Microsoft's new Flat Network Storage architecture for its Windows Azure data centers debunks Hadoop's "move compute...
to the data" truism and enables using highly available and durable blob storage clusters for big data analytics.
Microsoft corporate vice president Scott Guthrie announced a major upgrade of its HDInsight Service Preview to the Hortonworks Hadoop Data Platform v1.1.0 in March. The updated preview's DevOps features make it easy for developers with a Windows Azure subscription to create high-performance HDInsight compute clusters with the Windows Azure Management Portal (see Figure 1).
The SQL Server group's new version incorporates the following open source Apache Hadoop components and a redistributable Microsoft Java Database Connectivity (JDBC) driver for SQL Server:
- Apache Hadoop, Version 1.0.3
- Apache Hive, Version 0.9.0
- Apache Pig, Version 0.9.3
- Apache Sqoop, Version 1.4.3
- Apache Oozie, Version 3.2.0
- Apache H.Catalog, Version 0.4.1
- Apache Templeton, Version 0.1.4
- SQL Server JDBC Driver, Version 3.0
A cluster consists of an extra-large head node costing $0.48 per hour, and one or more large compute nodes at $0.24 per hour. Therefore, a small-scale cluster with four compute nodes will set users back $1.44 per hour deployed, or about $1,000 per month. (Microsoft bases its charges on a list price of $2.88 per hour but discounts the time to 50% of actual clock hours; learn more about HDInsight Preview pricing here.) The first Hadoop on Windows Azure preview provided a prebuilt, renewable three-node cluster with a 24-hour lifetime; a later update increased the lifetime to five days but disabled renewals. Data storage and egress bandwidth charges aren't discounted in the latest preview, but they're competitive with Amazon Web Services' Simple Storage Service (S3) storage. The discounted Azure services don't offer a service-level agreement.
Moving HDFS storage from the local file system to Windows Azure blobs
One of Hadoop's fundamental DevOps precepts is to "move compute to the data," which ordinarily requires hosting Hadoop Distributed File System (HDFS) data storage files and compute operations in the operating system's file system. Windows Azure-oriented developers are accustomed to working with Azure blob storage, which provides high availability by replicating all stored objects three times. Durability is enhanced and disaster recovery is enabled by geo-replicating the triplicate copy to a Windows Azure data center in the same region after geo-locating the initial and duplicate copies more than 100 miles from the center. For example, an Azure blob store created in Dublin, the West Europe subregion, is auto-replicated to Amsterdam, the North Europe subregion. HDFS doesn't provide such built-in availability and durability features.
HDFS running on the local file system delivered better performance than Azure blobs for MapReduce tasks in the HDInsight service's first-generation network architecture by residing in the same file system as the MapReduce compute executables. Windows Azure storage was hobbled by an early decision to separate virtual machines (VMs) for computation from those for storage to improve isolation for multiple tenants.
Brad Calder of the Windows Azure Storage team described Flat Network Storage and second-generation storage hardware in a November 2012 blog post, Windows Azure's Flat Network Storage and 2012 Scalability Targets. He compared first- and second-generation storage hardware as follows:
|Hardware generation||Storage node network speed||Networking between compute and storage||Load balancer||Storage device for journaling|
|First||1 Gbps||Hierarchical network||Hardware||Hard drives|
|Second||10 Gbps||Flat network||Software||Solid-state drives|
According to Calder, second-generation Quantum 10, or Q10, storage "provides a fully nonblocking, 10-Gbps-based, fully meshed network, providing an aggregate backplane in excess of 50 Tbps of bandwidth for each Windows Azure datacenter." He claimed to have implemented storage account scalability targets that would achieve the following goals by the end of 2012:
- Up to 200 TB of capacity
- A transaction rate of up to 20,000 entities/messages/blobs per second
- Bandwidth for a geo-redundant storage account
- Ingress: Up to 5 Gbps
- Egress: Up to 10 Gbps
- Bandwidth for a locally redundant storage account
- Ingress: Up to 10 Gbps
- Egress: up to 15 Gbps
Storage accounts have geo-replication on by default to provide geo-redundant storage. End users can turn geo-replication off and use locally redundant storage, which results in reduced prices relative to geo-redundant storage and higher ingress and egress targets.
Denny Lee, a member of the SQL Server team's business intelligence group, and Brad Sarsfield, a principal developer in the Windows Azure group, discussed the performance of blob storage with HDInsight on Azure. In short, they found these key points:
- Azure blob storage provides near-identical HDFS access characteristics for reading -- performance and task splitting -- into map tasks.
- Azure blob provides faster write access for Hadoop HDFS, allowing jobs to complete faster when writing data to disk from reduce tasks.
Lee also summarized Nasuni's The State of Cloud Storage 2013 Industry Report with respect to comparisons of Azure blob storage and Amazon Simple Storage Service (S3) performance:
- Speed: Azure was 56% faster than No. 2 Amazon S3 in write speed, and 39% faster at reading files than the No. 2 HP Cloud Object Storage in read speed.
- Availability: Azure's average response time was 25% faster than Amazon S3, which had the second-fastest average time.
- Scalability: Amazon S3 varied only 0.6% from its average in the scaling tests, with Microsoft Windows Azure varying 1.9% -- both very acceptable levels of variance. HP and Rackspace, the two OpenStack-based clouds, showed a variance of 23.5% and 26.1%, respectively, with performance becoming more and more unpredictable as object counts increased.
My blog post, Using Data from Windows Azure Blobs with Apache Hadoop on Windows Azure CTP, explains how to import *.csv files into what HDInsight calls an Azure Storage Vault (ASV).
HDInsight Service's dashboard and sample gallery
The interactive console's Hive window lets developers define structured Hive tables based on Azure blob data (Figure 3). Data specialists query Hive tables with an SQL-like language called HiveQL. An Open Database Connectivity (ODBC) Hive driver lets business intelligence analysts use Microsoft Excel to visualize the results of HiveQL queries. For additional details, refer to my tutorial's sections about setting up the ODBC data source in Excel and executing HiveQL queries from Excel.
The latest HDInsight Service Preview for Azure takes full advantage of the second-generation hardware for flat networking in Microsoft data centers. ASV users gain the availability and durability benefits of Windows Azure blob storage for HDFS clusters without the previous performance penalty. Developers must be vigilant in deleting unneeded clusters to avoid substantial billings for unused compute and, to a lesser degree, ASV storage.
About the author
Roger Jennings is a data-oriented .NET developer and writer, a Windows Azure MVP, principal consultant at OakLeaf Systems, and curator of the OakLeaf Systems blog. He's also the author of more than 30 books on the Windows Azure Platform, Microsoft operating systems (Windows NT and 2000 Server), databases (SQL Azure, SQL Server and Access), .NET data access, Web services and InfoPath 2003. More than 1.25 million English copies of his books are in print, and they have been translated into more than 20 languages.