A handful of public cloud service providers -- Google, IBM, Microsoft and Oracle -- are taking a cue from Amazon Web Services and getting in on the “big data” analytics trend with Hadoop/MapReduce, a multifaceted open source project.
Cloud-based Hadoop/MapReduce applications appeared in 2009 when AWS released its Elastic MapReduce Web service for EC2 and Simple Storage Service (S3). Google later released an experimental version of its Mapper API , the first component of the App Engine's MapReduce toolkit, in mid-2010, and since May 2011, developers have had the ability to run full MapReduce jobs on Google App Engine. In this instance, however, rate limiting is necessary to prevent the program from consuming all available resources and to prevent Web access.
Google added a Files API storage system for intermediate results in March 2011 and Python shuffler functionality for small datasets (up to 100 MB) in July. The company promises to accommodate larger capacities and release a Java version and MapperAPI shortly.
It seems interest and integration plans for Hadoop/MapReduce further mounted in the second half of 2011.
Integration plans for Hadoop/MapReduce
Oracle announced its big data appliance at Oracle Open World in October 2011. The appliance is a "new engineered system that includes an open source distribution of Apache Hadoop, Oracle NoSQL Database, Oracle Data Integrator Application Adapter for Hadoop, Oracle Loader for Hadoop, and an open source distribution of MapR," according to the announcement.
The appliance appears to use Hadoop primarily for extract, transform and load (ETL) operations for the Oracle relational database, which has a cloud-based version. Oracle's NoSQL Database is based on the BerkeleyDB embedded database, which the firm acquired when Oracle purchased SleepyCat Software Inc. in 2006.
The Oracle Public Cloud, which also debuted at Open World, supports developing standard JSP, JSF, servlet, EJB, JPA, JAX-RS and JAX-WS applications. Therefore, you can integrate your own Hadoop implementation with the Hadoop Connector. There's no indication that Oracle will pre-package Hadoop/MapReduce for its public cloud offering, but competition from AWS, IBM, Microsoft and even Google might make Hadoop/MapReduce de rigueur for all enterprise-grade public clouds.
At the PASS conference in October 2011, Microsoft promised to release a Hadoop-based service for Windows Azure by the end of 2011; company vice president Ted Kummert said a community technical preview (CTP) for Windows Server would follow in 2012. Kummert also announced a strategic partnership with Hortonworks Inc. to help Windows Azure bring Hadoop to fruition.
Kummert described a new SQL Server-Hadoop Connector for transfer of data between SQL Server 2008 R2 and Hadoop, which appears to be similar in concept to Oracle's Hadoop Connector. Denny Lee, a member of the SQL Server team, demonstrated a HiveQL query against log data in a Hadoop for a Windows database with a HiveODBC driver. Kummert said this will be available as a CTP in November 2011. Microsoft generally doesn't charge for Windows Azure CTPs, but hourly Windows Azure compute and monthly storage charges will apply.
Microsoft and IBM projects in the works
Microsoft's High Performance Computing (HPC) team released Beta 2 of HPC Pack for Windows HPC Server 2008 clusters and LINQ to HPC R2 SP2 in June 2011, after several years of incubation as Dryad and Dryad LINQ at Microsoft Research. The most common configuration for this is a hybrid cloud model called the "burst scenario," in which the head note resides on-premises and a number of compute notes run as Windows Azure virtual machines, depending on the load, with file sets stored on Windows Azure drives.
Project "Daytona," another Microsoft Research venture, claims to be a MapReduce runtime for Windows Azure with a user-friendly Excel DataScope user interface, but it's still in the early CTP stage. Hadoop's ubiquity and track record probably influenced Microsoft's Server and Cloud Platforms team to offer the genuine article.
IBM was last to climb onto the Hadoop cloud bandwagon with IBM InfoSphere BigInsights on the IBM SmartCloud Enterprise, a Hadoop-based analytics software. BigInsights on the cloud is available in both basic and enterprise editions with the options of public, private and hybrid cloud deployments.
The basic edition is an entry-level, no-charge option that helps organizations learn how to do big data analytics, including what-if scenarios with its BigSheets component, a browser-based analytics tool. Clients can seamlessly move to the enterprise edition when ready and set up Hadoop clusters to start analyzing data with low usage rates starting at $0.60 per cluster, per hour. Both versions include a developer sandbox where clients can build a new generation of business analytics applications complete with tools and a test-and-development environment. IBM appears to me to be the only player offering a no-charge, try-before-you-buy implementation.
Hadoop and the social network
Other high-profile social computing Hadoop implementers include Yahoo, which runs Hadoop on a 10,000 core Linux cluster for its search service; Facebook, which announced in July 2011 that its Hadoop cluster had reached 30 petabytes in size; and LinkedIn, with a terabyte-scale Hadoop data cycling application. Twitter also uses Hadoop to store and process tweets, log files and other data; eBay claims a 532-node, 5 PB Hadoop cluster.
The upshot for enterprises today is that Amazon is the only cloud provider with a proven (two-and-a-half year) track record with Hadoop/MapReduce. IBM's BigInsights is in its infancy, and there's no official timetable for release-to-manufacturing versions of Microsoft's Hadoop CTPs.
I'm betting both Amazon and IBM will be contenders for big data analytics in the cloud by mid-2012. Despite Google having introduced MapReduce to the world of big data, I'm not sanguine about the future success of Google’s appengine-mapreduce.
ABOUT THE AUTHOR:
Roger Jennings is a data-oriented .NET developer and writer, a Windows Azure MVP, the principal consultant of OakLeaf Systems and curator of the OakLeaf Systems blog. He's also the author of 30+ books on the Windows Azure Platform, Microsoft operating systems (Windows NT and 2000 Server), databases (SQL Azure, SQL Server and Access), .NET data access, Web services and InfoPath 2003. His books have more than 1.25 million English copies in print and have been translated into 20+ languages.