Enterprises are rapidly adopting Apache Hadoop and MapReduce as their primary data analysis tools. But they’re doing so despite a serious shortage of data analysts with the DevOps skills to set up Hadoop Distributed File System clusters or write Java code for MapReduce jobs.
Amazon Web Services (AWS) does offer hosted Elastic Map Reduce (EMR) and Microsoft promotes Apache Hadoop on Windows Azure’s cloud-based MapReduce, Hive, Pig and Mahout implementations. While these products eliminate capital and management costs of on-premises HDFS SANs, moving Hadoop clusters to the cloud doesn’t lessen the need for a slew of MapReduce tools.
Analysts can use the Apache HiveQL dialect of SQL to translate aggregate queries with built-in count(), sum(), avg() and stddev_pop() functions to MapReduce jobs. The Apache Pig subproject documentation claims when you use the Pig Latin language:
It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand and maintain.
But Pig Latin is a v0.9.2 release with even fewer expert grammarians than MapReduce. So what’s an enterprise IT department that lacks the DevOps chops supposed to do when trying to analyze “big data” in the cloud? Microsoft’s Codename “Cloud Numerics” is one possible option.
Microsoft Cloud Numerics for C# coders
Microsoft’s January 2012 release of Codename “Cloud Numerics” Lab offers an alternative to Hadoop, HDFS and MapReduce, as well as HiveQL or Pig Latin for big data analytics. Cloud Numerics provides the enterprise-level .NET programmers who are fluent in C# and accustomed to using Visual Studio with:
- A programming model that hides the complexity of developing distributed algorithms
- Access to a.NET library of numerical algorithms that range from basic mathematics to advanced statistics to linear algebra
- The ability deploy an application to Windows Azure and use the cloud’s computing power
Using parallel processing with Cloud Numerics HPC clusters requires analysts to input data in the form of distributed dense arrays of mostly numeric data, or matrixes. Dense arrays correspond to tables that have non-null values in all columns, as opposed to sparse arrays, which have columns populated only in subsets of rows.
The invitation-only Cloud Numerics Lab deliverables consist of a Microsoft Cloud Numerics Application template for Visual Studio 2010 and later (Figure 1), which includes the pre-built C# projects shown in Table 1.
|AppConfigure||Provides a user interface for configuring and deploying HPC clusters and the HPC scheduler to Windows Azure, as well as classes for creating the ServiceConfiguration.Cloud.csfg file, handling the Management Certificate and deploying the application to Windows Azure.|
|AzureSampleService||Defines the service hosted by Windows Azure and its ComputNode, FrontEnd and HeadNode instances.|
|FrontEnd||Provides the Web role for accessing Windows Azure HPC Scheduler Web Portal.|
|HeadNode||Defines the worker role entry point for interaction with the HPC cluster.|
|ComputeNode||Defines worker roles for accessing Windows Azure storage, data processing and diagnostics.|
|MSCloudNumericsApp||Holds custom code in its Program.cs class to initialize the Microsoft.Numerics runtime, read data in parallel from Windows Azure storage to local dense arrays, convert local to distributed arrays, process them and write the results to Windows Azure storage.|
Table 1. Six prebuilt projects that compose a Microsoft Cloud Numerics app.
The default MSCloudNumericsApp project comes with a basic Main() function for a console application to test operation in the local development fabric with a simple process. The function initializes the Microsoft.Numerics runtime, creates and fills an element array with random numbers, performs a matrix multiplication and then applies the Choleski decomposition to solve linear equations, shuts down the Microsoft.Numerics runtime and returns a completion message.
In most cases, developers only need to replace a few lines of default code in the Main() function with their own procedures. Figure 2 illustrates the relationship of Cloud Numerics components.
The initial Cloud Numerics Lab release provided the following end-to-end sample applications for download from Microsoft Connect:
- A document classification example using Latent Sematic Indexing (LSICloudApplication).
- An in-depth view at some statistics functionality (StatisticsCloudApplication).
- A time-series analysis of serial yield data (TimeSeriesApplication)
Analyzing air carrier departure delays with Cloud Numerics
On-time performance is an important element in many consumers’ choice of airlines. The U.S. Federal Aviation Administration (FAA) has maintained comprehensive records of arrival and departure delays of all flights by each certificated U.S. air carrier since 1987; the FAA makes this data available in the form of *.zip archives of monthly *.csv files to the public from the Research and Innovative Technology Administration’s (RITA) Bureau of Transportation Statistics (BTS) site. Each *.csv file contains about 500,000 rows and they average about 225 MB. Therefore, the 302 months of data through February 2012 totals about 150 million rows and 68 GB.
The Cloud Numerics team announced a new sample program that summarizes average delays and the standard deviation of the delay data for 32 months of FAA flight data. Figure 3 shows an Excel histogram of flight arrival delays from 0 to 5 hours for January 2012.
For developers to easily use the sample, the Cloud Numerics team copied 32 *.csv files up to January 2012 to a publicly accessible Windows Azure blob container in Microsoft’s North Central U.S. data center. The number 32 is significant because each ExtraLarge ComputeNode instance has 8 CPU cores; the AppConfigure project deploys four instances.
Most SQL Azure Labs CTPs provide free access to resources used during the preview period, but Cloud Numerics doesn’t. Running the OnTimeStats application with four ComputeNodes, one HeadNode worker role and one FrontEnd Web role will cost you $5.10 per hour. This cost multiplies the incentive to delete Cloud Numerics deployments when you’re not using them.
The MSCloudNumericsApp project’s Main() method contains added code that calculates the average arrival delay time, standard deviation of arrival times and the times above and below 1, 2, 3, 4, and 5 standard deviations as well as the values to a FlightDataInfo.csv file stored in a Windows Azure blob (Figure 4). It takes less than two minutes to read the nearly 8 million rows into arrays containing arrival delay time in minutes. Analyzing them using two extra-large compute instances (16 cores) took just under two minutes.
Roope Astala, Cloud Numerics program manager, summarized the data in a post, stating:
Let’s take a look at the results. We can immediately see they’re not normal-distributed at all. First, there’s skew—about 70% of the flight delays are better than average of 5 minute. Second, the number of delays tails off much more gradually than a normal distribution would as one moves away from the mean towards longer delays. A step of one standard deviation (about 35 minutes) roughly halves the number of delays, as we can see in the sequence 8.5%, 4.0%, 2.1%, 1.1%, 0.6%. These findings suggest that the tail could be modeled by an exponential distribution.
This result is both good news and bad news for you as a passenger. There is a good 70% chance you’ll arrive no more than five minutes late. However, the exponential nature of the tail means—based on conditional probability—that if you have already had to wait for 35 minutes there’s about a 50-50 chance you will have to wait for another 35 minutes.
Comparing Cloud Numerics with Apache Hive on Windows Azure
The Apache Hadoop on Windows Azure preview lets you analyze data stored in folders of Windows Azure blob containers. However, the feature that enables the substitution of private Azure blobs for HDFS datasets doesn’t work for public blobs like those the Cloud Numerics team uploaded.
I reduced the size of six individual *.csv files for August 2011 through January 2012 by eliminating unnecessary columns and uploaded them to a folder of a blob in the North Central U.S. data center. I specified the blob folder as the data source for a Hive data warehouse table, created the Hive table and then executed a simple HiveQL aggregate query against it with the Hive ODBC Driver and Excel Add-In (Figure 5).
The elapsed time for the Hive approach was considerably shorter than that for the Cloud Numerics sample because it requires about two hours to upload HPC cluster to Windows Azure over a slow DSL Internet connection. However, Cloud Numerics could retrieve the same data faster; I estimate it would have taken me at least half a day to determine this using a HiveQL query. Otherwise, I would need another few hours to write and test a Pig Latin script. Without the Internet connection asymmetry, getting the data with Cloud Numerics would have been considerably faster.
Roger Jennings is a data-oriented .NET developer and writer, a Windows Azure MVP, principal consultant of OakLeaf Systems and curator of the OakLeaf Systems blog. He's also the author of 30+ books on the Windows Azure Platform, Microsoft operating systems (Windows NT and 2000 Server), databases (SQL Azure, SQL Server and Access), .NET data access, Web services and InfoPath 2003. His books have more than 1.25 million English copies in print and have been translated into 20+ languages.