This content is part of the Essential Guide: An enterprise guide to Microsoft Azure cloud
Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

Microsoft, OpenStack compete with Amazon EMR in Hadoop as a Service space

Hadoop remains an effective data analysis tool, but also comes with a bit of overhead. Here are the Hadoop as a Service tools available to you.

Hadoop is an increasingly popular tool for data analysis, but the system administration overhead of managing a Hadoop cluster can be daunting. To assist companies looking to use Hadoop without this overhead, Amazon pioneered the deployment of Hadoop as a Service, called Elastic MapReduce (EMR). Elastic MapReduce is a good option for Amazon Web Services customers, but users of Windows Azure and OpenStack also have similar options with their own benefits and downfalls.

HDInsight gives Windows Azure users access to Hadoop using both Microsoft and Apache tools. OpenStack's Savanna Project offers an option for the open source provider shops, but it's as an ongoing development project, so users should not expect a complete turnkey solution yet.

Windows Azure HDInsight

Microsoft teamed up with Hortonworks Inc., an enterprise developer of Hadoop, to offer access to the Hortonworks Data Platform (HDP) for Windows Azure users. HDP enables users to deploy Hadoop clusters on both Windows and Linux servers. While a choice in underlying operating systems is important to system administrators, developers may be more interested in the tools included with HDP.

The combination of Hadoop as a Service with a mix of Microsoft and Apache tools will smooth the integration of Hadoop into an existing data management infrastructure.

One of Microsoft's many qualities as a software development company is its ability to integrate its products. Windows Azure's implementation of Apache Hadoop, known as HDInsight, is no exception. System administrators can leverage PowerShell and .NET for managing Hadoop jobs. HDInsight also enables users to use Microsoft BI tools with Excel, such as PowerPivot, Power View and Power Query.

In addition to Microsoft tools and applications, HDInsight includes a number of Apache project tools to facilitate data management and analysis. Pig is a high-level data analysis language that can be used instead of writing MapReduce code and is particularly valuable for analysts grappling with coding in Java. Hive, another Apache project, is a data warehouse system for managing large data sets and querying them with a SQL-like language called HiveQL. For those working with both Hadoop and relational databases, Apache Sqoop is available for bulk data transfers between Hadoop and relational databases.

The combination of Hadoop as a Service with a mix of Microsoft and Apache tools will smooth the integration of Hadoop into an existing data management infrastructure.

Hadoop users have the choice of using either HDFS or Windows Azure Blob storage. This is similar to using either HDFS or Amazon Simple Storage Service (S3) with Amazon EMR. HDFS is the native storage format for Hadoop, but since HDInsight clusters are not persistent, data from HDFS must be copied to Blob storage or other persistent storage to maintain it for other Hadoop jobs.

Pricing for HDInsight is based on the number of servers used in the cluster and the type of payment plan. All HDInsight clusters include a head node, a secure gateway node and one or more compute nodes. Under the pay-as-you-go plan, head nodes are billed at $0.64 per hour and compute nodes are $0.32 per hour for a large (S3) instance. There is no charge for the secure gateway node in either plan. Under the six-month and yearly plans, the cost of head nodes range from $0.44 to $0.51 per hour, and compute nodes range from $0.22 to $0.26 per hour. The exact price is determined by other factors, such as commitment duration and whether the customer pre-pays or pays monthly.

OpenStack Savanna project

OpenStack is an open source cloud computing system used for both private clouds and public Infrastructure as a Service implementations, such as Rackspace. Like other Hadoop as a Service offerings, the goal of the Savanna project is to streamline the deployment of Hadoop clusters in a cloud. Savanna is a modular component designed to work within the OpenStack environment and integrates with key OpenStack components, including Horizon for management, Keystone for user authentication, Nova for virtual machine provisioning, Glance for image storage and Swift for data storage. Savanna also supports integration with other vendors' tools, such as Cloudera Manager Admin Console.

Although Amazon EMR and Windows Azure HDInsight users can start Hadoop clusters fairly easily, Savanna users should expect to work with system administrators familiar with Hadoop configuration, which may detract from some of the benefits. Savanna uses templates to specify server configuration, file system parameters and Hadoop distribution-specific parameters.

Savanna is under active development; version 0.3 has recently been released. Currently available functionality includes basic cluster provisioning, templates for cluster configurations, management application programming interfaces and support for ad hoc queries using Pig and Hive. Support for Hadoop version 2 is planned for the second quarter of 2014.

About the author:
Dan Sullivan holds a Master of Science degree and is an author, systems architect and consultant with more than 20 years of IT experience. He has had engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence, and worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail and education. Dan has written extensively about topics that range from data warehousing, cloud computing and advanced analytics to security management, collaboration and text mining.

Dig Deeper on High-performance computing in the cloud

Join the conversation

1 comment

Send me notifications when other members comment.

Please create a username to comment.

I'd agree that Azure and Google's Big Data offerings compete with AWS EMR, but OpenStack is just some software that you could install in your data-center. It requires you to have all the skills to manage a Hadoop environment, which is complex. The cloud offerings remove much of that operational complexity and they give you the ability to pay for the underlying resources on-demand, which doesn't happen in your data center.