Web commerce, social media, mobile communication and machine-to-machine traffic generates terabytes or petabytes
of data that enterprise IT departments must store and process. For users dealing with this data in cloud-oriented databases, learning sharding best practices is an important step in the cloud planning process.
Sharding is a process where tables are partitioned into manageable-sized disk files. Some highly elastic key-value data stores -- such as Amazon Simple DB, Google App Engine's data store or Windows Azure Tables -- and document databases like CouchDB, MongoDB or RavenDB can handle the big data in these tables with aplomb. MongoDB features built-in automatic sharding; RavenDB will gain this capability in the near future. Autosharding balances the size of shards automatically and eliminates the need for DevOps teams to oversee the process. Autosharding MongoDB databases isn't a walk in the park, as Todd Hoff reported in his October 2010 blog post on troubles with sharding.
Be prepared to shard relational databases that must handle big data in the cloud.
Almost all but a few key-value and document (called NoSQL) databases lack the transactional data consistency offered by traditional relational database management systems (RDBMSs). You can scale up RDBMSs (vertically) by throwing money into more memory, processors or both. You can fit about 256 GB of RAM into higher-end commodity-grade servers, but adding substantially more CPU cores isn't always practical. If your database is in the cloud, for example, you are limited to memory and process combinations determined by the cloud provider's price list.
Scaling out RDBMSs (horizontally) imposes a substantial technical challenge. Morgan Tocker wrote a detailed blog post on why you don't want to shard MySQL databases in August 2009. Tocker acknowledged that you'll probably need to shard database tables if you encounter either:
- Oversized working sets: Your working set, which usually consists of frequently accessed and updated data and indexes, won't fit in the RAM installable in an on-premise server, the amount your hardware budget will cover, or the total available from your cloud service provider. The solution is sharding.
- Excessive write frequency: Your database's I/O systems can't process the number of writes/second being requested of your on-premise or cloud-based server(s). The solution is splitting out read operations into read replicas, which might require sharding to divide the I/O load across more database servers.
Amazon Web Services' Relational Database Services for MySQL offers its largest High-Memory Quadruple Extra Large DB Instance -- which includes 68 GB of memory and 26 Elastic Compute Units (ECUs) comprising eight virtual cores with 3.25 ECUs each -- at $2.60 per hour (or $1,872 per month). According to AWS's pricing list, one ECU provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. AWS RDS DB instances enable selecting from 5 GB to 1 TB of associated storage capacity for $0.10 per GB per month. Data transfer charges are $0.12/GB out, plus an I/O charge of $0.10 per one million requests. Amazon dropped data ingress charges and reduced data egress charges as of July 1.
You'll also need to budget extra storage for read replicas to achieve high availability with commodity servers. Fortunately, ScaleBase offers a third-party Database Load Balancer application to automate sharding and read/write splitting for MySQL running on Amazon EC2 and Amazon RDS.
Simon Munro described sharding issues with relational databases in general, and Microsoft's SQL Azure customized cloud implementation of SQL Server 2008 in particular, in his "trouble with sharding" blog post in August 2009. At that time, SQL Azure's maximum database size was 10 GB; it's now 50 GB.
Scott Guthrie, the new corporate vice president of Microsoft's Azure Application Platform team, stated in a June 8 keynote to the Norwegian Developer Conference (NDC) 2011 that:
…we also do autosharding as part of SQL Azure, which means that from a scale-out perspective, we can handle super-high loads, and we do all of that kind of load-balancing and scale-out work for you.
Today, SQL Azure supports up to 50 GB of relational storage for a database, but you can have any number of databases. In the future, you'll see us support hundreds of gigabytes and terabytes [that] you can take advantage of.
Autosharding with SQL Azure Federations is currently in the private Community Technical Preview (CTP) stage, and it's not evident from Guthrie's statement whether sharding will be required to "support hundreds of gigabytes and terabytes." SQL Azure Federations also promise to handle schema migrations gracefully. SQL Azure includes a primary and two secondary replicas for high availability at a basic pay-as-you-go price of $9.99 per GB per month from 1 to 5 GB (Web Edition) and $99.99 per 10GB from 10 to 50 GB per month (business databases). Fixed monthly prices don't include data transfer charges of $0.15/GB out in North American and European data centers; $0.20/GB out in Asia. Microsoft dropped data ingress charges July 1. Unlike Amazon RDS, you don't incur I/O charges for SQL Azure.
Microsoft doesn't divulge CPU and memory specifications for SQL Azure, but the company says they are commensurate to database size. You can follow SQL Azure Federations' progress toward commercial release in late 2011 at Cihan Biyikoglu's blog.
Google announced at its I/O 2011 conference in May that the commercial release of the now-beta Google App Engine -- scheduled for the latter part of this year -- will incorporate the RDBMS that was to be included with GAE for Business , but the company hadn't yet divulged feature or pricing details.
Unless Google pulls a high-scalability rabbit out of its hat its forthcoming RDBMS, be prepared to shard relational databases that must handle big data in the cloud.
ABOUT THE AUTHOR:
Roger Jennings is a data-oriented .NET developer and writer, the principal consultant of OakLeaf Systems and curator of the OakLeaf Systems blog. He's also the author of 30+ books on the Windows Azure Platform, Microsoft operating systems (Windows NT and 2000 Server), databases (SQL Azure, SQL Server and Access), .NET data access, Web services and InfoPath 2003. His books have more than 1.25 million English copies in print and have been translated into 20+ languages.