Public clouds not only change the cost structure of computing and storage, they also extend the range of analysis enterprise IT can perform. This is especially true when working with big data sets that would not have been practical without access to elastic computing and storage.
"Big data" is loosely defined as data sets that are too large to work with using conventional data management techniques and infrastructure. Detailed server logs, clickstream data, social networking data and mobile device data can all be used to supplement the types of transactional data often captured in data warehouses and business intelligence (BI) systems. In addition, public data repositories and third-party aggregators offer big data sets on topics ranging from
Obtaining details about how customers browsed through your site … can lead to more insights about customer preferences than tracking product purchases can.
Incorporating these data sources can allow for more detailed and precise analysis. Obtaining details about how customers browsed through your site and how long they viewed various products can lead to more insights about customer preferences than tracking product purchases can.
Searching for big data: Three sources
Before you can work with big data, it's important to determine what type of data you're dealing with. Big data sources fall into three broad categories: internally generated data, data set marketplaces and third-party data generators.
Internally generated big data is often a byproduct of IT operations. It includes network traffic, clickstream data and applications logs. In the past, companies would capture limited information about a significant event, such as a customer who made a purchase. Now we can capture and, more importantly, analyze low-level details about customers' interactions with your business application. Combine these details with data mining algorithms, and you may find insights into the usability of your interface, patterns associated with low-margin transactions or unexpected clusters of customer types.
Data set marketplaces, such as Infochimps, Amazon Web Services' (AWS) public data sets and the Windows Azure Marketplace, offer convenient access to a wide range of data sets to supplement your internal data. If you're interested in prescription drug use, retail sales data, transportation data or a wide array of other topics, you can find the data in one of these data marketplaces. Many of these marketplaces offer cloud data analytics, so you can work with them directly from virtual machines in the cloud.
Third-party generators are organizations that focus on collecting and providing data for customers or public use. The U.S. federal government and the European Union, for example, generate large volumes of data on demographics, economics and public health. Proprietary companies, like Hoover's, also offer value-added services like marketing and risk-management data for customers.
Enterprise tools can mine big data's potential
It can be difficult to incorporate large volumes of unstructured and semi-structured data into relational databases. Cloud data analytics tools give enterprises of all sizes the ability to analyze that data.
If data is well-structured, you may want to stick with relational databases like Oracle or Microsoft SQL Server, both of which are available from AWS, Microsoft Windows Azure and other cloud providers.
When you start working with hundreds of millions or billions of rows of data, it's time to consider Hadoop or Google BigQuery. AWS has a Hadoop service called Elastic MapReduce, which can save you from having to install and configure a Hadoop cluster yourself. Hadoop is a good fit for batch-oriented analysis, but BigQuery is a better option for interactive analysis. BigQuery uses a SQL-like query language and supports Tableau Software's visualization tool, two important considerations for ad hoc analysis.
Data integration and management gotchas
Many of the tasks associated with extraction, transformation and load (ETL) operations in data warehouses carry over into big data analysis. Linking entities across multiple data sets is challenging when the data sets use unique identifiers; data formats need to be standardized.
Watch for differences in aggregation levels. For example, some data can be aggregated at a household level, while other data is only as detailed as the census-track level.
Most importantly, beware of data transfer costs, which often accompany big data. When possible, use virtual servers in the same cloud where your data is stored. When working with Google BigQuery, keep in mind that you will be charged according to the volume of data processed by queries, so only query for the rows and columns you need.
About the author
Dan Sullivan, M.Sc., is an author, systems architect and consultant with more than 20 years of IT experience. He has had engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail and education. Dan has written extensively about topics that range from data warehousing, cloud computing and advanced analytics to security management, collaboration and text mining.
This was first published in February 2013