What is driving the advent of new, non-relational databases? Is it the staggering amount of data we have these days?
Guy Harrison: It's the whole argument that it's easier to leave the data where it is, on commodity hardware and then have an engine that can churn through it with brute force, rather than pay the costs of bringing it in to some structure and locking it down before you can use it. I think if you have a look at where the data warehousing market is going now that Hadoop has taken hold, it's almost on a collision course.
The problem, of course, is how the data is getting to the cloud.
Oracle is working on big data, massively parallel processing, optimized I/O, very fast channels between the CPU and the I/O, and so forth. It's trying to get to that same processing path where it can churn through lots of data within secure file formats, meaning you won't need to treat the data within the database, you will be able to point the database at it and tell it what the structure is, but it will still be stored exactly where it was in the first place.
That's not a long way from what Hadoop does now. That sort of convergence is visible now, although there's still a gap between where the two technologies are, but they will become almost the same thing.
What are some current drawbacks in the NoSQL databases? Some proponents would have you believe they can be a solution across the board.
GH: They are very specific to task in many respects, and they support very large implementations. I'm not trying to belittle them in any way, but there are gaps all over the place where things that clearly need to be done, have not yet been done, and that's just a reflection on the level of maturity.
I think anyone who's been in the industry for a while, especially from an enterprise point of view, will understand that the data isn't in the database just for the sake of the application. It's not there just to be inserted, and updated, and deleted. It's also there to be mined and analyzed and used for decision-making and trending. And the more data there is, the more granular data I've got, the more fine-grained decision making I can have.
Most of the of the NoSQL databases offer no solutions for that whatsoever.
What is a practical example of the differences between relational and non-relational databases?
GH: So, you take something like HBase, where there's no real range consistency, and you want something that does
SELECT COUNT * on your HBase table. So you start counting rows, basically, by reading all the labels. There's no API that tells you how many rows. The only way to find out is to read them all.
If someone's inserting at the same time, there's no real guarantee you'll ever catch up with the person writing, so you literally have a situation where
COUNT * might go on forever. So, we have to come up with some way of either timing out and saying, "It's impossible, the query has gone on too long," or trying to estimate the size. That's a simple example of where a relational query that would work fine in any relational database, is not supported naturally in the non-relational databases.
So what are they being used for?
GH: Have a look at the landscape at the moment and how popular Hadoop is inside the same organizations that are adopting straight NoSQL, and then how popular Hive is, which puts a SQL-like interface on top of Hadoop, and you can see two types of NoSQL uses. One type is being used for high speed OLTP, infinitely scalable and relatively simplistic data models, and the other, which is offered in Hadoop, is being used like a data warehouse with a SQL layer.
Why are enterprises interested in this? Why not throw more Oracle at it?
GH: We've seen the trend of the size of the largest enterprise databases, growing steadily and exponentially, and data warehouse technology, by and large until relatively recently, kept up with that. The exponential growth has just outstripped what can be done even by the largest databases now. Oracle and Teradata are struggling, but Hadoop's come along and provided an alternative that's more economical.
Right at this second, there's not a lot of our customers who are likely to adopt NoSQL, but there's a lot of people who will, over the next year or so, adopt Hadoop. The economics for processing large amounts of log data or creating massive data warehouses on Hadoop are cost-effective compared to Oracle's Exadata.
Cloud computing and these non-relational data processing models seem to go hand in hand; why is that?
GH: First, all of the cloud platforms have to provide some sort of elastic storage model otherwise their economic model is jeopardized. As I'm growing my application, I want to be able to scale my spend on the database as I go, and then I want the power to scale as I go, and that's really, really difficult with relational databases. In theory, Oracle's RAC can do it, but you can't shard MySQL automatically. People who have sharded it have to put in an enormous manual effort; they have to plan all sorts of stuff. It's not going to happen on the fly.
So Amazon provides SimpleDB and Microsoft provides Azure Table, which is a non-relational database. Google has something similar.
That's the provider end, but are businesses using them?
GH: Netflix is a big adopter of SimpleDB, but we're not really seeing a lot of poster children yet. If you're in the cloud, you're probably motivated more than usual to try NoSQL solutions; you probably want to scale up and down and pay for your resources that way, and relational databases don't help there.
Anyone who's been in the industry for a while will understand that data isn't in the database just for the sake of the application.
That drives a whole host of NoSQL databases. Rackspace is working on [providing] Cassandra, and SimpleDB is there for the same reason. The other thing that's going on in the cloud is an effort to make Hadoop and other similar types of big-data analytics available in the cloud. Amazon offers a fairly mature implementation of Hadoop that you can pay for by the elapsed hour and scale up to hundreds of thousands of nodes to do short-term processing and shut the whole thing down again. It's not as efficient as running your own cluster, but it's a good solution if you want to transfer a lot of data in a very cost-effective way. There are many examples of companies doing that.
So is cloud the best idea for an enterprise wanting cheap data processing right now?
GH: The problem, of course, is how the data is getting to the cloud. If you've got terabytes or petabytes of data, getting into the cloud costs money and takes time. Amazon's solution of "you send us a hard drive and we'll mount it" is a bit clunky.
Also, data migrating out might be a little technically problematic for some enterprises, but actually placing data inside someone else's data center raises all sorts of issues around sovereignty; it may be illegal. In many cases, companies are required by law or company policy not to do it. I'll trust Amazon, for instance, for my purposes, but I don't know if I'd bet my business on it, that they're never going to lose my data and so forth.
Looking into the future, will relational databases cede their popularity to NoSQL?
GH: I'm certainly not thinking they're going away, relational databases will still be dominant. It's not a question of the relational database going away, but this total lock the relational database had on all data processing, that's what's loosening. However, for all of the interest, justifiable interest, we have in these new databases, we have precious few real world examples outside of a few very unusual sites.
You've had Cassandra being used at Twitter and Facebook, but they're some of the biggest websites in the world and they're not typical of enterprise customers, or even websites in general. The relational database offers so much, and there are so many professionals who know how to use it, and it's reliable and can be secure, so it's probably going to be the best choice for 95% of data processing needs. But, when you push beyond that, you can imagine a world in which specialized NoSQL databases just become the best fit.
GUY HARRISON'S BIO:
Guy Harrison is a director of research and development at Quest Software, and has more than 20 years of experience in database design, development, administration, and optimization. Guy is an Oracle ACE, and is the author of the Oracle Performance Survival Guide (Prentice Hall, 2009) and MySQL Stored Procedure Programming (O'Reilly with Steven Feuerstein), as well as other books, articles and presentations on database technology. Guy is the architect of Quest's Spotlight family of diagnostic products, and has led the development of Quest's Toad for Cloud Databases.