Established vendors and startups alike have spearheaded advanced technologies for managing petabytes of data that...
have sprung from social computing and data analysis applications, commonly called big data. One vexing problem for enterprise IT management is learning to filter a growing cloud data store taxonomy.
Choosing the optimum infrastructure for your organization's big data isn't a walk in the park.
The NoSQL category gained initial developer mindshare with the first no:sql(east) conference in October 2009, which might be better named no:rdm or no:rel for "no (to the) relational data model." The NoSQL moniker today applies primarily to open source approaches to high-availability, massively scalable and fully durable data stores that don't involve the traditional relational model and Structured Query Language. NoSQL fundamentalists, such as Heroku's Adam Wiggins, contend that "SQL databases are fundamentally non-scalable, and there is no magical pixie dust that we, or anyone, can sprinkle on them to suddenly make them scale."
Virtually all discussions of relational database scalability involve interpretation of Brewer's Theorem, first expounded in Eric Brewer's "Toward Robust Distributed Systems" keynote speech of July 19, 2000 at the ACM Symposium about the Principles of Distributed Computing (PODC). Brewer posited that there are three desirable features of distributed systems, such as Web services, which include databases:
- Consistency represents the C in the ACID test of "atomicity, consistency, isolation and durability" for data store transactions.
- Availability means a fast response to every query and update while maintaining consistency.
- Partition tolerance refers to the ability of the system to remain available and consistent despite failures of individual sub-components such as servers and disk drives.
Brewer's Theorem, which was proven formally in 2002, states that it is impossible to achieve all three of these features simultaneously in the asynchronous network model typified by stateless Web services. It's most common for NoSQL proponents to settle for the "eventual consistency" exhibited by Amazon.com's Dynamo data store.
The following four classifications define most NoSQL data store genres:
- Key-Value/Tuple Stores include Windows Azure table storage, Amazon Dynamo, Dynomite, Project Voldemort, Membase, Riak, Redis, BerkeleyDB and MemcacheDB.
- Wide-Column/Column Families Store members are Apache Hadoop/HBase, Apache Cassandra, Amazon SimpleDB, Hypertable, Cloudata and Cloudera.
- Document Stores include CouchDB, MongoDB and RavenDB.
- Graph Databases emphasize relationships between entities, which are difficult to model with relational databases. Neo4j, Dryad, FlockDB, HyperGraphDB, AllegroGraph and Sones are more interesting examples.
Enterprise-level IT managers commonly rank the viability of NoSQL databases by the breadth of their commercial usage, sponsorship or both by high-visibility Web properties or sponsors, such as Amazon Web Services (SimpleDB and Amazon RDS), Apache Foundation (Hadoop, CouchDB and Cassandra), LinkedIn (Project Voldemort), Microsoft (Windows Azure, SQL Azure and Dryad), Twitter (Hadoop, Cassandra, Redis and FlockDB) and Yahoo! (Hadoop/Pig). Hadoop/HBase and its MapReduce programming model are probably the most widely used NoSQL members in mid-2011; Yahoo! is reportedly planning to spin off its seasoned Hadoop development team into a standalone commercial venture.
Dryad's place in the cloud data store world
The Dryad distributed graph database has been under development by Microsoft Research for the past six years or so. According to Microsoft Research, Dryad "provides a general, flexible execution layer" that uses a "dataflow graph as the computation model. It completely subsumes other computation frameworks, such as Google's map-reduce or the relational algebra. Moreover, Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting."
DryadLINQ is "is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC clusters. DryadLINQ combines two important pieces of Microsoft technology: the Dryad distributed execution engine and the .NET Language Integrated Query (LINQ)."
In May 2011, blogger Mary Jo Foley reported that "HPC 2008 R2 SP2 is the slated delivery vehicle for Dryad, Microsoft's closest competitor to Google MapReduce and Apache Hadoop."
SQL and NoSQL: The same?
Eric Meijer, the "Father of LINQ," and Gavin Bierman asserted that "contrary to popular belief, SQL and NoSQL are really just two sides of the same coin" as an introduction to their "A Co-Relational Model of Data for Large Shared Data Banks" article for the Association for Computing Machinery's (ACM) Queue journal. The authors conducted a mathematic analysis, using DryadLINQ as an example, and:
… developed a mathematical data model for the most common form of NoSQL -- namely, key-value stores as the mathematical dual of SQL's foreign-/primary-key stores. Because of this deep and beautiful connection, we propose changing the name of NoSQL to CoSQL. Moreover, we show that monads and monad comprehensions (i.e., LINQ) provide a common query mechanism for both SQL and CoSQL and that many of the strengths and weaknesses of SQL and CoSQL naturally follow from the mathematics. In contrast to common belief, the question of big versus small data is orthogonal to the question of SQL versus CoSQL. While the CoSQL model naturally supports extreme sharding, the fact that it does not require strong typing and normalization makes it attractive for "small" data as well. On the other hand, it is possible to scale SQL databases by careful partitioning.
The article's final determination was that "CoSQL and SQL are not in conflict, like good and evil. Instead they are two opposites that coexist in harmony." The authors also noted that, "because of the common query language based on monads, both can be implemented using the same principles."
NOSQL, a recent term said by less parochial NoSQLers to be an all-encompassing abbreviation for "not only SQL," hasn't gained much mindshare among enterprise IT managers and Web developers, probably because it's excessively generic. AnySQL has encountered a similar lack of interest, probably for the same reason.
A look at NewSQL databases
Analyst firm The 451 Group recently coined the term NewSQL and described it as "shorthand for the various new scalable/high performance SQL database vendors. And to clarify, like NoSQL, NewSQL is not to be taken too literally: the new thing about the NewSQL vendors is the vendor, not the SQL."
The 451 Group includes Akiban, Clustrix, Drizzle, CodeFutures, GenieDB, MySQL Cluster with NDB, MySQL with HandlerSocket, NimbusDB, RethinkDB, ScalArc, ScaleBase, ScaleDB, Schooner, Translattice and VoltDB in its NewSQL category. It's obvious that the market for NoSQL databases won't support this many vendors in the foreseeable future, but picking winners isn't easy.
NimbusDB, which previewed in Silicon Valley's recent Under the Radar conference and was the audience's choice in the Scalability category, describes its NewSQL database as "a SQL database with 100% ACID semantics. Unlike existing SQL databases NimbusDB delivers the key requirements for cloud-style environments, including dynamically adding or deleting nodes from a live system."
NimbusDB's founders have enterprise-grade credentials. CEO Barry Norris was CEO of IONA Technologies and CTO Jim Starkey founded Interbase Software, which was acquired by Ashton-Tate, whom Borland International later acquired. Interbase is the foundation for the Firebird open-source database. Starkey later founded Netfrastructure, Inc. and sold it to MySQL AB, where it became the kernel for MySQL's Falcon storage engine. According to Norris, Nimbus will enter the NewSQL market "imminently." If NimbusDB lives up to its pedigree, it should be a NewSQL "keeper."
And when it comes to categorizing databases as "NewSQL as a Service," The 451 Group includes Amazon Relational Database Services, Salesforce.com's Database.com, FathomDB, Microsoft SQL Azure and Xeround. Amazon RDS currently offers a MySQL implementation but plans to add an Oracle Database 11g in 2011's second quarter. Both Amazon RDS and SQL Azure use replication to maintain data availability, and Amazon and Microsoft tout developer familiarity with MySQL or SQL Server as a major time and cost saver. SQL Azure has a fixed database size limit of 50 GB; sharding with SQL Azure Federations, which entered private Community Technical Preview in mid-May 2011, is expected to eliminate the size restriction.
Choosing the optimum infrastructure for your organization's big data isn't a walk in the park. The most important decision is the choice between the NoSQL and NewSQL approaches, which often is dictated by your data's schema (or lack thereof). For instance, if mapping entity hierarchies is a critical factor, graph databases like Neo4j or Dryad are a logical choice. On the other hand, if transactions are necessary to assure data consistency, consider SQL Azure.
While taking the critical step of budgeting the necessary resources for a thorough investigation and full-scale trial installation will help matters, there's no doubt that choosing your big data infrastructure will be a big headache.
About the author:
Roger Jennings is a data-oriented .NET developer and writer, the principal consultant of OakLeaf Systems and curator of the OakLeaf Systems blog. He's also the author of 30+ books on the Windows Azure Platform, Microsoft operating systems (Windows NT and 2000 Server), databases (SQL Azure, SQL Server and Access), .NET data access, Web services and InfoPath 2003. His books have more than 1.25 million English copies in print and have been translated into 20+ languages.