CIOs are looking to pull actionable information from ultra-large data sets, or big data. But big data can mean...
big bucks for many companies. Public cloud providers could make the big data dream more of a reality.
[Researchers] want to manipulate and share data in the cloud.
Roger Barga, architect and group lead, XCG's Cloud Computing Futures (CCF) team
Big data, which is measured in terabytes or petabytes, often is comprised of Web server logs, product sales, data, and social network and messaging data. Of the 3,000 CIOs interviewed in IBM's Essential CIO study, 83% listed an investment in business analytics as the top priority. And according to Gartner, by 2015, companies that have adopted big data and extreme information management will begin to outperform unprepared competitors by 20% in every available financial metric.
Budgeting for big data storage and the computational resources that advanced analytic methods require, however, isn't so easy in today's anemic economy. Many CIOs are turning to public cloud computing providers to deliver on-demand, elastic infrastructure platforms and Software as a Service. When referring to the companies' search engines, data center investments and cloud computing knowledge, Steve Ballmer said, "Nobody plays in big data, really, except Microsoft and Google."
Microsoft's LINQ Pack, LINQ to HPC, Project "Daytona" and the forthcoming Excel DataScope were designed explicitly to make big data analytics in Windows Azure accessible to researchers and everyday business analysts. Google's Fusion Tables aren't set up to process big data in the cloud yet, but the app is very easy to use, which likely will increase its popularity. It seems like the time is now to prepare for extreme data management in the enterprise so you can outperform your "unprepared competitors."
LINQ to HPC marks Microsoft's investment in big data
Microsoft has dipped its toe in the big data syndication market with Windows Azure Marketplace DataMarket. However, the company's major investments in cloud-based big data analytics are beginning to emerge as revenue-producing software and services candidates. For example, in June 2011, Microsoft's High Performance Computing (HPC) team released Beta 2 of HPC Pack for Windows HPC Server 2008 clusters and LINQ to HPC R2 SP2.
Bing search analytics use HPC Pack and LINQ to HPC, which were called Dryad and Dryad LINQ, respectively, during several years of development at Microsoft Research. LINQ to HPC is used to analyze unstructured big data stored in file sets that are defined by a Distributed Storage Catalog (DSC). By default, three DSC file replicas are installed on separate machines running HPC Server 2008 with HPC Pack R2 SP2 in multiple compute nodes. LINQ to HPC applications, or jobs, process DSC file sets. According to David Chappel, principal, Chappell & Associates, the components of LINQ to HPC are "data-intensive computing with Windows HPC Server" combined with on-premises hardware (Figure 1).
The LINQ to HPC client contains a .NET C# or VB project that executes LINQ queries, which a LINQ to HPC Provider then sends to the head node's Job Scheduler. LINQ to HPC uses the directed acyclic graph data model. A graph database is a document database that uses relations as documents. The Job Scheduler then creates a Graph Manager that manipulates the graphs.
One major benefit of the LINQ to HPC architecture is that it enables .NET developers to easily write jobs that execute in parallel across many compute nodes, a situation commonly called an "embarrassingly parallel" workload.
Microsoft recently folded the HPC business into the Server and Cloud group and increased its emphasis on running HPC in Windows Azure. Service Pack 2 allows you to run compute nodes as Windows Azure virtual machines (VMs). The most common configuration is a hybrid cloud mode called the "burst scenario" where the head node resides in an on-premises data center and a number of compute nodes run as Windows Azure VMs -- depending on the load -- with file sets stored on Windows Azure drives. In LINQ to HPC, customers can perform data-intensive computing with the LINQ programming model on Windows HPC Server.
Will "Daytona" and Excel DataScope simplify development?
The eXtreme Computing Group (XCG), an organization in Microsoft Research (MSR) established to push the boundaries of computing as a part of the group's Cloud Research Engagement Initiative, released the "Daytona" platform, as a Community Technical Preview (CTP) in July 2011. The group refreshed the project later that month.
Daytona is a MapReduce runtime for Windows Azure that competes with Amazon Web Service's Elastic Map Reduce, Apache Foundation's Hadoop Map Reduce, MapR's Apache Hadoop distribution and Cloudera Enterprise Hadoop. A major advantage of "Daytona" is that it's easy to deploy to Windows Azure. The CTP includes a basic deployment package with pre-built .NET MapReduce libraries and host source code, C# code and sample data for k-means clustering and outlier detection analysis, as well as complete documentation.
"Daytona has a very simple, easy-to-use programming interface for developers to write machine-learning and data-analytics algorithms," said Roger Barga, an architect and group lead on the XCG's Cloud Computing Futures (CCF) team. "[Developers] don't have to know too much about distributed computing or how they're going to spread the computation out, and they don't need to know the specifics of Windows Azure."
Barga said in a telephone interview that the Daytona CTP will be updated in eight-week sprints. This interval parallels the update schedule for Windows Azure CTP during the later stages of its preview in 2010. Plans for the next Daytona CTP update include a RESTful API and performance improvements. In fall 2011, you can expect an upgrade to the MapReduce engine that will enable the addition of stream processing to traditional batch processing capabilities. Barga also said the team is considering an open-source "Daytona" release, depending on community interest in contributing to the project.
In June 2011, Microsoft Research released Excel DataScope, its newest big data analytical and visualization candidate. Excel DataScope lets users upload data, extract patterns from data stored in the cloud, identify hidden associations, discover similarities between datasets and perform time series forecasting using a familiar spreadsheet user interface called the research ribbon (Figure 2).
"Excel presents a closed worldview with access only to the resources on the user's machine. Researchers are a class of programmers who use different models; they want to manipulate and share data in the cloud," explained Barga.
"Excel DataScope keeps a session Windows Azure open for uploading and downloading data to a workspace stored in an Azure blob. A workspace is a private sandbox for sharing access to data and analytics algorithms. Users can queue jobs, disconnect from Excel and come back and pick up where they left off; a progress bar tracks status of the analyses." The Silverlight PivotViewer provides Excel DataScope's data visualization feature. Barga expects the first Excel DataScope CTP will drop in fall 2011.
Google Base is a goner, but Google Fusion Tables holds hope
Google Base was the first Web-accessible, non-relational data management systems based on the company's BigTable technology. There was a flurry of interest when Google introduced a Base beta version in 2005, but many early users became disenchanted with its restrictive schema and poor performance.
I first voiced my Google Base frustrations after trying to use it as a general-purpose cloud data store. Google moved Base into its Merchant Center as the data store for Google Product Search in September 2010 and sounded Base's death knell that by year's end -- when it depreciated its API in favor of a new set of Google Shopping APIs.
Google Fusion Tables. The free beta version of Google Fusion Tables, which was introduced on Google Labs in 2009, lets users upload and download *.csv files with a maximum of 100 MB per dataset and 250 MB per user. Users can share files with the public or other designated users. However, these storage limits are too low to work with production Fusion Tables for big data projects; the product's storage limits would need to be expanded by at least four orders of magnitude to get to that point.
Users can filter and aggregate the data, as well as visualize it with Google Maps or other methods offered by the Google Visualization API. Fusion Tables also enable data set and individual item annotation; users can also join, or fuse, tables on according to primary key values.
Nobody plays in big data, really, except Microsoft and Google.
Steve Ballmer, CEO, Microsoft
According to a November 2010 Google Operating Systems post‚ Fusion Tables "graduated" from Google Labs in September and will be included in a Google Docs app. "Google Docs includes fusion tables in the list of document types and there's also an icon for fusion tables. Users can already import tables from Google Spreadsheets and sharing works just like in Google Docs." "Graduation" means Fusion Tables should escape Larry Page's, "More wood behind fewer arrows" fatwa of July 2011, which closed down Google Labs.
"The 'Map' and 'Intensity Map' table data should be of special interest to all the GIS folks. It makes the process of mapping data real easy. The 'Location' field type in Fusion Tables supports both street address strings and KML string representation of geometries. The street addresses entered into the location field get automatically geocoded and are viewable on the map visualization."
Another Fusion Table project includes 2010 U.S. Election Ratings data that lets users choose senate, house and governance races with projections from a range of reports.
ABOUT THE AUTHOR
Roger Jennings is a data-oriented .NET developer and writer, the principal consultant of OakLeaf Systems and curator of the OakLeaf Systems blog. He's also the author of 30+ books on the Windows Azure Platform, Microsoft operating systems (Windows NT and 2000 Server), databases (SQL Azure, SQL Server and Access), .NET data access, Web services and InfoPath 2003. His books have more than 1.25 million English copies in print and have been translated into 20+ languages.