Analyzing 'big data' with Microsoft Cloud Numerics

Microsoft’s Cloud Numerics lets developers use their .NET chops and HPC clusters to analyze "big data" in the cloud.

Enterprises are rapidly adopting Apache Hadoop and MapReduce as their primary data analysis tools. But they’re

doing so despite a serious shortage of data analysts with the DevOps skills to set up Hadoop Distributed File System clusters or write Java code for MapReduce jobs.

Amazon Web Services (AWS) does offer hosted Elastic Map Reduce (EMR) and Microsoft promotes Apache Hadoop on Windows Azure’s cloud-based MapReduce, Hive, Pig and Mahout implementations. While these products eliminate capital and management costs of on-premises HDFS SANs, moving Hadoop clusters to the cloud doesn’t lessen the need for a slew of MapReduce tools.

Analysts can use the Apache HiveQL dialect of SQL to translate aggregate queries with built-in count(), sum(), avg() and stddev_pop() functions to MapReduce jobs. The Apache Pig subproject documentation claims when you use the Pig Latin language:

It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand and maintain.

Figure 1. Create a new Visual Studio Cloud Numerics deployment with the Microsoft Cloud Numerics Application C# template.

But Pig Latin is a v0.9.2 release with even fewer expert grammarians than MapReduce. So what’s an enterprise IT department that lacks the DevOps chops supposed to do when trying to analyze “big data” in the cloud? Microsoft’s Codename “Cloud Numerics” is one possible option.  

Microsoft Cloud Numerics for C# coders
Microsoft’s January 2012 release of Codename “Cloud Numerics” Lab offers an alternative to Hadoop, HDFS and MapReduce, as well as HiveQL or Pig Latin for big data analytics. Cloud Numerics provides the enterprise-level .NET programmers who are fluent in C# and accustomed to using Visual Studio with:

  • A programming model that hides the complexity of developing distributed algorithms
  • Access to a.NET library of numerical algorithms that range from basic mathematics to advanced statistics to linear algebra
  • The ability deploy an application to Windows Azure and use the cloud’s computing power

Using parallel processing with Cloud Numerics HPC clusters requires analysts to input data in the form of distributed dense arrays of mostly numeric data, or matrixes. Dense arrays correspond to tables that have non-null values in all columns, as opposed to sparse arrays, which have columns populated only in subsets of rows.

The invitation-only Cloud Numerics Lab deliverables consist of a Microsoft Cloud Numerics Application template for Visual Studio 2010 and later (Figure 1), which includes the pre-built C# projects shown in Table 1.

Project name

Description

AppConfigure Provides a user interface for configuring and deploying HPC clusters and the HPC scheduler to Windows Azure, as well as classes for creating the ServiceConfiguration.Cloud.csfg file, handling the Management Certificate and deploying the application to Windows Azure.
AzureSampleService Defines the service hosted by Windows Azure and its ComputNode, FrontEnd and HeadNode instances.
FrontEnd Provides the Web role for accessing Windows Azure HPC Scheduler Web Portal.
HeadNode Defines the worker role entry point for interaction with the HPC cluster.
ComputeNode Defines worker roles for accessing Windows Azure storage, data processing and diagnostics.
MSCloudNumericsApp Holds custom code in its Program.cs class to initialize the Microsoft.Numerics runtime, read data in parallel from Windows Azure storage to local dense arrays, convert local to distributed arrays, process them and write the results to Windows Azure storage.

Table 1. Six prebuilt projects that compose a Microsoft Cloud Numerics app.


The default MSCloudNumericsApp project comes with a basic Main() function for a console application to test operation in the local development fabric with a simple process. The function initializes the Microsoft.Numerics runtime, creates and fills an element array with random numbers, performs a matrix multiplication and then applies the Choleski decomposition to solve linear equations, shuts down the Microsoft.Numerics runtime and returns a completion message.

Figure 2. The relationship between Cloud Numerics run times, libraries and Windows Azure deployment tools.

In most cases, developers only need to replace a few lines of default code in the Main() function with their own procedures. Figure 2 illustrates the relationship of Cloud Numerics components.

The initial Cloud Numerics Lab release provided the following end-to-end sample applications for download from Microsoft Connect:
 

  1. A document classification example using Latent Sematic Indexing (LSICloudApplication).
  2. An in-depth view at some statistics functionality (StatisticsCloudApplication).
  3. A time-series analysis of serial yield data (TimeSeriesApplication)

I posted an illustrated step-by-step tutorial for installing and running the LSICloudApplication in the local development fabric and deploying it to a Windows Azure account.


Analyzing air carrier departure delays with Cloud Numerics
On-time performance is an important element in many consumers’ choice of airlines. The U.S. Federal Aviation Administration (FAA) has maintained comprehensive records of arrival and departure delays of all flights by each certificated U.S. air carrier since 1987; the FAA makes this data available in the form of *.zip archives of monthly *.csv files to the public from the Research and Innovative Technology Administration’s (RITA) Bureau of Transportation Statistics (BTS) site. Each *.csv file contains about 500,000 rows and they average about 225 MB. Therefore, the 302 months of data through February 2012 totals about 150 million rows and 68 GB.

Figure 3. A histogram of flight delays from 0 to 5 hours for U.S.-certificated air carriers in January 2012.

The Cloud Numerics team announced a new sample program that summarizes average delays and the standard deviation of the delay data for 32 months of FAA flight data. Figure 3 shows an Excel histogram of flight arrival delays from 0 to 5 hours for January 2012.

For developers to easily use the sample, the Cloud Numerics team copied 32 *.csv files up to January 2012 to a publicly accessible Windows Azure blob container in Microsoft’s North Central U.S. data center. The number 32 is significant because each ExtraLarge ComputeNode instance has 8 CPU cores; the AppConfigure project deploys four instances.

Most SQL Azure Labs CTPs provide free access to resources used during the preview period, but Cloud Numerics doesn’t. Running the OnTimeStats application with four ComputeNodes, one HeadNode worker role and one FrontEnd Web role will cost you $5.10 per hour. This cost multiplies the incentive to delete Cloud Numerics deployments when you’re not using them.

Figure 4. Excel worksheet with mean and standard deviation data from the FlightDataInfo.csv file.

The MSCloudNumericsApp project’s Main() method contains added code that calculates the average arrival delay time, standard deviation of arrival times and the times above and below 1, 2, 3, 4, and 5 standard deviations as well as the values to a FlightDataInfo.csv file stored in a Windows Azure blob (Figure 4). It takes less than two minutes to read the nearly 8 million rows into arrays containing arrival delay time in minutes. Analyzing them using two extra-large compute instances (16 cores) took just under two minutes.

Roope Astala, Cloud Numerics program manager, summarized the data in a post, stating:

Let’s take a look at the results. We can immediately see they’re not normal-distributed at all. First, there’s skew—about 70% of the flight delays are better than average of 5 minute. Second, the number of delays tails off much more gradually than a normal distribution would as one moves away from the mean towards longer delays. A step of one standard deviation (about 35 minutes) roughly halves the number of delays, as we can see in the sequence 8.5%, 4.0%, 2.1%, 1.1%, 0.6%. These findings suggest that the tail could be modeled by an exponential distribution.

This result is both good news and bad news for you as a passenger. There is a good 70% chance you’ll arrive no more than five minutes late. However, the exponential nature of the tail means—based on conditional probability—that if you have already had to wait for 35 minutes there’s about a 50-50 chance you will have to wait for another 35 minutes.


Comparing Cloud Numerics with Apache Hive on Windows Azure

The Apache Hadoop on Windows Azure preview lets you analyze data stored in folders of Windows Azure blob containers. However, the feature that enables the substitution of private Azure blobs for HDFS datasets doesn’t work for public blobs like those the Cloud Numerics team uploaded.

Figure 5. Flight Delays by Carrier Excel Graph from Hadoop

I reduced the size of six individual *.csv files for August 2011 through January 2012 by eliminating unnecessary columns and uploaded them to a folder of a blob in the North Central U.S. data center. I specified the blob folder as the data source for a Hive data warehouse table, created the Hive table and then executed a simple HiveQL aggregate query against it with the Hive ODBC Driver and Excel Add-In (Figure 5).  

The elapsed time for the Hive approach was considerably shorter than that for the Cloud Numerics sample because it requires about two hours to upload HPC cluster to Windows Azure over a slow DSL Internet connection. However, Cloud Numerics could retrieve the same data faster; I estimate it would have taken me at least half a day to determine this using a HiveQL query. Otherwise, I would need another few hours to write and test a Pig Latin script. Without the Internet connection asymmetry, getting the data with Cloud Numerics would have been considerably faster.

 

Roger Jennings is a data-oriented .NET developer and writer, a Windows Azure MVP, principal consultant of OakLeaf Systems and curator of the OakLeaf Systems blog. He's also the author of 30+ books on the Windows Azure Platform, Microsoft operating systems (Windows NT and 2000 Server), databases (SQL Azure, SQL Server and Access), .NET data access, Web services and InfoPath 2003. His books have more than 1.25 million English copies in print and have been translated into 20+ languages.

This was first published in June 2012

Dig deeper on Big data and cloud business intelligence

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

SearchServerVirtualization

SearchVMware

SearchVirtualDesktop

SearchAWS

SearchDataCenter

SearchWindowsServer

SearchSOA

SearchCRM

Close