Big data tutorial: Everything you need to know
A comprehensive collection of articles, videos and more, hand-picked by our editors
Apache Hadoop and MapReduce get the most attention from “big data” analysts and business intelligence specialists,...
but populating Hadoop Distributed File System tables and writing or executing MapReduce jobs in Java demands some serious software development skills. Apache Hive could be the answer.
The Apache Software Foundation’s Hive data warehousing component, an element of the cloud-based Hadoop ecosystem, offers a set-based query language called HiveQL that translates SQL-like queries into MapReduce jobs automatically.
Relational databases, such as IBM DB2, Oracle and SQL Server as well as data warehousing applications, have dominated the business intelligence (BI) scene for so long most data analysts have substantial SQL language skills. Similarly, business analysts are accustomed to summarizing data for managers with Excel spreadsheets, PivotTables and graphs.
Let’s look at an end-to-end BI project that runs under Windows Azure. This example starts with a large amount of data and delivers an Excel graph of on-time flight arrival statistics for U.S. certified air carriers -- without having to write programming code.
Hooking up with Apache Hadoop on Windows Azure CTP
Microsoft’s SQL Server team announced the availability of community technical preview of Apache Hadoop on Windows Azure, or HadoopOnAzure CTP,in December 2011. The company claims it simplifies Hadoop access and setup, can issue Hive queries to pull and analyze unstructured data from Hadoop in Excel and offers the elasticity of Windows Azure.
HadoopOnAzure CTP is private, so users need to request an invitation by filling out a brief survey on Microsoft Connect. After you receive your invitation, proceed to the HadoopOnAzure website and sign in with your Windows Live ID. Type a globally unique DNS name, select an initial Hadoop cluster size, provide a cluster login name and password, and click the Request Cluster button (Figure 1).
Provisioning the cluster should take about 15 to 30 minutes. There’s no charge for HadoopOnAzure CTP resources, but clusters require you to renew the subscription during the final six hours of the initial 24-hour duration; you’ll need to renew credentials daily after that.
You’ll need a Windows Azure subscription and a storage account if you want to use Windows Azure blobs for persistent data storage, otherwise you’ll lose data stored in Hadoop Distributed File System (HDFS) when the cluster times out. If you don’t have a subscription, you can sign up for a free three-month Windows Azure account that gives you 20 GB of storage, a million storage transactions and 20 GB of outbound bandwidth.
Populating SQL Azure blobs with big data
This Apache Hive project takes data from the U.S. Federal Aviation Administration (FAA) on six months of air carrier arrival delay information for January 2012 and the last five months of 2011. The six tab-delimited text files, which have a subset of the FAA files’ columns, contain about 500,000 rows and about 25 MB each.
You’ll need to upload the data files to a folder within a blob container to make them visible to Hive. I detailed the process of creating Azure blob source data on my blog. I also included information on data files and how to download them from my Windows Live SkyDrive account and upload them to Windows Azure blobs in Microsoft’s Chicago-based data center.
Figure 3. The Hive Tables list box will include the name of the new table; the Columns box displays field names of a selected table. Clicking the >> button inserts the selected list item in the text box.
After cluster provisioning is complete, the Elastic MapReduce (EMR) portal’s landing page appears with metro-style tiles to open pages for HadoopOnAzure MapReduce tasks, as well a cluster and account management (Figure 2).
Copy the Primary Access Key for your storage account from the Windows Azure Management portal to the clipboard, click the Manage Cluster tile to open the page of the same name and then click Set Up ASV (Azure Storage Vault) to use a Windows storage account as the data source for a Hive table. Alternatively, you could use data stored in Amazon S3 (Simple Storage Service) or from the Windows Azure Dataplace DataMarket as the data source for a Hive table.
Type your storage account name, paste the Primary Access Key value in the Passkey text boxes and click Save Settings to enable Hive to access blobs in the account containers. You will receive a message stating the Azure account setup was successful, if your credentials are accepted.
Unlike HDFS, Hive tables require a schema, even for simple key-value data. To generate a Hive table from non-HDFS, or external, tab-delimited data and define its column names and data types, you’ll need to run a CREATE EXTERNAL TABLE statement, such as the following HiveQL statement for data in a flightdata folder of an aircarrier container:
CREATE EXTERNAL TABLE flightdata_asv (
COMMENT 'FAA on-time data'
ROW FORMAT DELIMITED FIELDS TERMINATED by '9'
STORED AS TEXTFILE
Apache Hive has relatively few data types and doesn’t support date/time fields, so it’s fortunate that the source *.csv file has integer year, month and day values. Departure (depdelay) and arrival (arrdelay) delay values are displayed in minutes.
To execute interactive HiveQL statements, click the Elastic MapReduce’s Interactive Console tile and then click the Hive button to open the Interactive Hive page with a read-only results pane at the top and a text box to enter instructions below it (Figure 3, above).
Download and install Apache Hive ODBC Driver and Excel add-in
Return to the Elastic MapReduce console’s home page and click the Downloads tile. Click the installer link that corresponds with your Excel 2010 edition (32-bit, for example) and then click Run to open a warning dialog. Click More Options to expose the Run Anyway option and then click it to start the installation and open the ODBC Driver for Hive Setup dialog. Then mark the I Accept check box.
Click Install to install the driver, click Finish to exit the installer. Next, open Excel and click the Data tab to verify presence of the Hive Pane icon and click it to display the Hive Query panel at the right of the worksheet. Installing the Add-In places a Hive Pane icon in the Hive Data section of the menu.
Return to the EMR console’s home page and click Open Ports to open the Configure Ports page and click ODBC Server to move the slider to the right and open TCP port 10000 for the ODBC Server.
Executing an interactive Apache Hive query
Return to Excel, click the Hive pane icon to open the Hive Query task pane, click Enter Cluster Details to open the ODBC Hive Setup dialog, type a description and the DNS host name and accept the TCP port. Next, select Username/Password authentication and then type your username and password for the Elastic MapReduce portal instance (Figure 4).
If the Hive Setup settings you made in the ODBC Hive Setup dialog are correct, the name you typed as the description value will appear when you open the Select or Enter Hive Connection list. Open the Select the Hive Object to Query list and select flightdata_asv [Table] to populate the Columns list. For additional details, refer to my tutorial sections on setting up the ODBC data source in Excel and executing HiveQL queries from Excel.
To execute an aggregate query that reports the average arrival delay in minutes for each U.S. certified carrier for the dates covered by the source data, mark the carrier and arrdelay check boxes, open the function list for the arrdelay field and double-click avg in the list to add it to the HiveQL statement (Figure 5).
Next, clear the Limit Results check box, open the Aggregate Grouping list and select the carrier column.
Type arrdelay between in the parentheses of avg(), as such: avg(arrdelay), to correct a bug in the query design process and click Execute Query to return the result set (Figure 6).
Remove the spurious Carrier item, which was probably added when a column header was mistakenly left in a text file, reduce the length of the decimal fraction to 1, close the task pane and add a bar chart to the worksheet. Add a chart title, x-axis title and data labels, as shown in Figure 7.
This example demonstrates how easy it is for data analysts and BI specialists to summarize big data using the HadoopOnAzure CTP implementation. The Microsoft Codename “Cloud Numerics” CTP has similar aggregation capabilities but requires the use of Visual Studio 10 or later to process similar FAA data. The capability to send tabular data directly to Excel worksheets for further analysis and charting makes HadoopOnAzure’s Interactive Hive with the Hive ODBC data source and Excel add-in the logical choice for grappling ad hoc big data.
Roger Jennings is a data-oriented .NET developer and writer, a Windows Azure MVP, principal consultant of OakLeaf Systems and curator of the OakLeaf Systems blog. He's also the author of 30+ books on the Windows Azure Platform, Microsoft operating systems (Windows NT and 2000 Server), databases (SQL Azure, SQL Server and Access), .NET data access, Web services and InfoPath 2003. His books have more than 1.25 million English copies in print and have been translated into 20+ languages.