Entry-level big data analysis with Google BigQuery

For enterprises intimidated by Hadoop, Google BigQuery offers an entry-level approach to big data analysis, but it may trade accuracy for speed.

Big data projects can be intimidating, especially if they involve setting up and managing Hadoop clusters. If you are more comfortable with SQL than with MapReduce, but find that your relational database is not meeting your analysis needs, Google BigQuery is worth a look. Google Big Query offers an entry-level approach to big data analytics.

BigQuery reduces some of the implementation challenges of analyzing large data sets. The pricing model provides incentives for developing your statistical analysis skills so you can derive useful information from samples of data, rather than from brute force analysis of full data sets.

Sometimes, to work with big data means trading accuracy for speed. For example, if you want to determine whether two different groups of customers navigate your website differently, you could analyze every log entry associated with all of those customers' activities on the site. Alternatively, you could sample a subset of customers and analyze differences within the sample. Basic statistics gives you the tools to understand the error rate in this kind of analysis and, as long as the error is within an acceptable range, the answers you derive can be sufficient for decision making.

This kind of trade-off is seen in the TOP function, which returns the top rows found by a grouping and sorting operation. It may return an approximate result, but the function is generally faster than using a combination of "group by," "order by" and "limit" clauses.

Google BigQuery operational features

BigQuery is designed for analyzing data on the order of billions of rows, using a SQL-like syntax. It is not a replacement for a fully SQL-compliant database, and it is not suited for transaction processing applications. BigQuery supports an interactive style of analysis. You build your queries using SELECT commands that should be familiar to any SQL developer.

BigQuery is designed for analyzing data on the order of billions of rows, using a SQL-like syntax.

The query language includes support for standard operations, such as joining, sorting and grouping, as well as for operators for working with nested data structures. As you would expect from a service designed for big data analysis, there is support for aggregate functions, such as count, sum, average, variance and standard deviation. The combination of grouping functions and descriptive statistics makes it relatively straightforward to compare means and variance between two populations.

Joins are essential operations when working with normalize data models, but they can be computationally expensive. The JOIN operator in BigQuery functions is like a SQL join but, when joining two tables, one of the tables must be 8 MB or less when compressed. This kind of restriction allows for more efficient join operations, since the smaller table can be cached efficiently and joined to the larger table. If you must join two large tables you can use the JOIN EACH operation -- but expect slower performance.

You can work with BigQuery interactively using a simple browser interface. The browser tool keeps a history of queries and provides a workspace for constructing your queries. This is a basic tool; it does not have the full range of features you would find in SQL developer tools, such as MySQL Workbench or DBVisualizer. You can also work with a command line interface.

Google BigQuery's pricing model

With access to Google's cloud resources, you might wonder why you would bother with sampling or approximate answers to your queries -- after all, you can just throw more hardware at the problem, right? Yes, you could, but BigQuery charges are based in part on how much data you analyze.

Google's pricing model is based on the amount of data stored and the amount of data analyzed. Storage charges are $0.12/GB per month. Interactive queries are billed at $0.035/GB processed and batch queries cost $0.02/GB processed. To limit the amount of data processed, you can limit the number of rows you analyze and include only columns you actually need in your results. BigQuery uses a columnar data store, so not all data in a row is retrieved when one or more of the columns is queried.

Data is loaded into BigQuery using load jobs that can work with data in Google Cloud Storage or from a local file system using a POST request. Files can be formatted as CSV or in JSON format. Compressed files are limited to 1 GB, but uncompressed files can be up to 1 TB. Load jobs can include up to 10,000 files, but the size of all files combined cannot exceed 1 TB. Since you can have up to 1,000 load jobs per table per day, these quotas are not likely to have practical implications for most projects.

About the author
Dan Sullivan, M.Sc., is an author, systems architect and consultant with more than 20 years of IT experience. He has had engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail and education. Dan has written extensively about topics that range from data warehousing, cloud computing and advanced analytics to security management, collaboration and text mining.

Dig Deeper on Big data, machine learning and AI