Big data and the cloud are each revolutionary IT concepts, and their intersection is doubly so. Today, users and...
cloud providers alike are focusing on that intersection, planning applications and service offerings to exploit the technologies. To address the intersection of big data and the cloud, and to choose the best cloud big data platform, cloud planners need to decide on a database model, select between cloud database services and cloud database platforms, and review the features of each platform against their own special needs.
There are three popular models for big data: the distributed Map Reduce model popularized by Hadoop; the NoSQL model used for non-relational, non-tabular storage; and the SQL RDBMS model for relational tabular storage of structured data. You can use all three in the cloud, so in most cases it will be less cloud consideration and more database design and usage concerns that determine the model choice. Once you've identified a database model you can explore cloud options for the model selected.
Most business transactions are best stored and accessed in RDBMS form, where SQL queries and tabular summarization can be supported easily. This is the model that both enterprise users and database architects are most likely to be familiar with, and a good rule of thumb is to assume SQL/RDBMS until you can prove another option is better.
The most common "don't use SQL" determinant is that the data to be stored is object-structured rather than tabular in nature. Object data collects information for a given entity (the object) as a set of properties that may be relatively free-form within the object. Although the object typically has a unique "key," the properties may or may not relate among objects. If you can't visualize data as a set of tables with fixed fields and valuable field-to-field relationships, then SQL/RDBMS may be difficult to adopt, and one of the other options may be better.
Hadoop and NoSQL work with non-structured data
Both the Hadoop and NoSQL options are easier to adapt to non-structured data. Hadoop is ideal for applications where unstructured data is stored in clusters distributed on a network. NoSQL can also be adapted to this cluster distribution, so in most cases the "object-structured" question will help determine the best choice between Hadoop and NoSQL models. If the database stores information about specific, identified things in the form of "thing-properties," then NoSQL is likely best. Data that has no natural structure, like free-form text, is likely better stored using Hadoop.
Note that, generally, you can query SQL, NoSQL and Hadoop databases using SQL. The latter two may require an overlay product, and the lack of natural tabular organization may render query processing more time-consuming. If you expect most database activity to be in SQL form, you probably have tabular data and should be considering an SQL/RDBMS model.
The second point to consider is whether to use a database package from a cloud provider or to host your own database in the cloud. Amazon, Rackspace, Microsoft and Google are well-known providers of cloud big data services. Joyent and Qubole are examples of lesser-known providers with strong big data credentials. Hadoop is usually available from major cloud providers, and other database models may also be supported as cloud services.
Taking a do-it-yourself approach
If you can't get a public cloud big data service you still have the option of hosting your own big data application in the cloud using a public cloud infrastructure as a service or platform as a service and big data software you've selected.
A do-it-yourself approach can offer advantages even when you can get big data services. First, it widens your options for cloud hosting because not all cloud providers will support big data as a service. Second, you can use multiple public clouds or switch between cloud providers with greater ease. Third, often you can create hybrid big data applications more easily if you can adopt the same big data software in the cloud and on-premises. The disadvantage, according to cloud buyers, is that creating in-cloud big data with your own platform tools is more complicated and sometimes more costly.
Obviously, the best platforms for cloud big data depend on your database model. Top-rated Hadoop options include Apache Hadoop, SAP's HANA/Hadoop combination, Hortonworks, Hadapt and VMware's Cloud Foundry, as well as services provided by IBM, Microsoft and Oracle. For NoSQL, consider Cassandra, Hbase or MongoDB. IBM also offers NoSQL for the cloud, and there are plenty of other NoSQL providers. For the latter, check to be sure that they support the level of big data cloud scaling you expect.
SQL big data in the cloud is probably most often supported by extending your on-premises SQL vendor offering. IBM, Oracle and Microsoft all offer SQL that's suitable, with some tuning, for big data cloud deployment. HP's Haven is a general big data architecture for the cloud that embraces both structured and unstructured data and that supports SQL queries.
Understanding your needs is critical
To choose the right cloud big data platform, you need to understand your needs so that you can evaluate how each platform supports your needs. Big data applications can be balanced between query and update, they can be mostly transactional, mostly analytic, etc. You may need to run tests to determine whether a given big data option is efficient for your specific mix of update and access. Be particularly careful about SQL queries against non-SQL databases. Analytics that require extensive use of SQL can create major performance issues even with traditional RDBMS, and more so with other database models. Creative database design and careful use of JOINed databases may make things more efficient.
You also should ensure that distributed big data clusters can be accessed efficiently for combined queries. This can be complicated with cloud-hosted data because users have only limited control over how the data is distributed. A combination of testing to determine optimum data distribution strategies and a contract to ensure data stays at least generally within those distribution guidelines is critical.
Cloud big data hosting has considerable variables, so be prepared to gather a lot of operating data on quality of experience to ensure that workers are getting what they need, and that costs are managed. Otherwise you'll end up with something too costly to fix and to slow to accept.
Hadoop and the cloud
Managing big data in the cloud