Microsoft, OpenStack compete with Amazon EMR in Hadoop as a Service space

Hadoop remains an effective data analysis tool, but also comes with a bit of overhead. Here are the Hadoop as a Service tools available to you.

Hadoop is an increasingly popular tool for data analysis, but the system administration overhead of managing a

Hadoop cluster can be daunting. To assist companies looking to use Hadoop without this overhead, Amazon pioneered the deployment of Hadoop as a Service, called Elastic MapReduce (EMR). Elastic MapReduce is a good option for Amazon Web Services customers, but users of Windows Azure and OpenStack also have similar options with their own benefits and downfalls.

HDInsight gives Windows Azure users access to Hadoop using both Microsoft and Apache tools. OpenStack's Savanna Project offers an option for the open source provider shops, but it's as an ongoing development project, so users should not expect a complete turnkey solution yet.

Windows Azure HDInsight

Microsoft teamed up with Hortonworks Inc., an enterprise developer of Hadoop, to offer access to the Hortonworks Data Platform (HDP) for Windows Azure users. HDP enables users to deploy Hadoop clusters on both Windows and Linux servers. While a choice in underlying operating systems is important to system administrators, developers may be more interested in the tools included with HDP.

The combination of Hadoop as a Service with a mix of Microsoft and Apache tools will smooth the integration of Hadoop into an existing data management infrastructure.

One of Microsoft's many qualities as a software development company is its ability to integrate its products. Windows Azure's implementation of Apache Hadoop, known as HDInsight, is no exception. System administrators can leverage PowerShell and .NET for managing Hadoop jobs. HDInsight also enables users to use Microsoft BI tools with Excel, such as PowerPivot, Power View and Power Query.

In addition to Microsoft tools and applications, HDInsight includes a number of Apache project tools to facilitate data management and analysis. Pig is a high-level data analysis language that can be used instead of writing MapReduce code and is particularly valuable for analysts grappling with coding in Java. Hive, another Apache project, is a data warehouse system for managing large data sets and querying them with a SQL-like language called HiveQL. For those working with both Hadoop and relational databases, Apache Sqoop is available for bulk data transfers between Hadoop and relational databases.

The combination of Hadoop as a Service with a mix of Microsoft and Apache tools will smooth the integration of Hadoop into an existing data management infrastructure.

Hadoop users have the choice of using either HDFS or Windows Azure Blob storage. This is similar to using either HDFS or Amazon Simple Storage Service (S3) with Amazon EMR. HDFS is the native storage format for Hadoop, but since HDInsight clusters are not persistent, data from HDFS must be copied to Blob storage or other persistent storage to maintain it for other Hadoop jobs.

Pricing for HDInsight is based on the number of servers used in the cluster and the type of payment plan. All HDInsight clusters include a head node, a secure gateway node and one or more compute nodes. Under the pay-as-you-go plan, head nodes are billed at $0.64 per hour and compute nodes are $0.32 per hour for a large (S3) instance. There is no charge for the secure gateway node in either plan. Under the six-month and yearly plans, the cost of head nodes range from $0.44 to $0.51 per hour, and compute nodes range from $0.22 to $0.26 per hour. The exact price is determined by other factors, such as commitment duration and whether the customer pre-pays or pays monthly.

OpenStack Savanna project

OpenStack is an open source cloud computing system used for both private clouds and public Infrastructure as a Service implementations, such as Rackspace. Like other Hadoop as a Service offerings, the goal of the Savanna project is to streamline the deployment of Hadoop clusters in a cloud. Savanna is a modular component designed to work within the OpenStack environment and integrates with key OpenStack components, including Horizon for management, Keystone for user authentication, Nova for virtual machine provisioning, Glance for image storage and Swift for data storage. Savanna also supports integration with other vendors' tools, such as Cloudera Manager Admin Console.

Although Amazon EMR and Windows Azure HDInsight users can start Hadoop clusters fairly easily, Savanna users should expect to work with system administrators familiar with Hadoop configuration, which may detract from some of the benefits. Savanna uses templates to specify server configuration, file system parameters and Hadoop distribution-specific parameters.

Savanna is under active development; version 0.3 has recently been released. Currently available functionality includes basic cluster provisioning, templates for cluster configurations, management application programming interfaces and support for ad hoc queries using Pig and Hive. Support for Hadoop version 2 is planned for the second quarter of 2014.

About the author:
Dan Sullivan holds a Master of Science degree and is an author, systems architect and consultant with more than 20 years of IT experience. He has had engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence, and worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail and education. Dan has written extensively about topics that range from data warehousing, cloud computing and advanced analytics to security management, collaboration and text mining.

This was first published in January 2014

Dig deeper on High-performance computing in the cloud

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

SearchServerVirtualization

SearchVMware

SearchVirtualDesktop

SearchAWS

SearchDataCenter

SearchWindowsServer

SearchSOA

SearchCRM

Close