Tip

Entry-level big data analysis with Google BigQuery

Big data projects can be intimidating, especially if they involve setting up and managing Hadoop clusters. If you are more comfortable with SQL than with MapReduce, but find that your relational database is not meeting your analysis needs, Google BigQuery is worth a look. Google Big Query offers an entry-level approach to

    Requires Free Membership to View

big data analytics.

BigQuery reduces some of the implementation challenges of analyzing large data sets. The pricing model provides incentives for developing your statistical analysis skills so you can derive useful information from samples of data, rather than from brute force analysis of full data sets.

Sometimes, to work with big data means trading accuracy for speed. For example, if you want to determine whether two different groups of customers navigate your website differently, you could analyze every log entry associated with all of those customers' activities on the site. Alternatively, you could sample a subset of customers and analyze differences within the sample. Basic statistics gives you the tools to understand the error rate in this kind of analysis and, as long as the error is within an acceptable range, the answers you derive can be sufficient for decision making.

This kind of trade-off is seen in the TOP function, which returns the top rows found by a grouping and sorting operation. It may return an approximate result, but the function is generally faster than using a combination of "group by," "order by" and "limit" clauses.

Google BigQuery operational features

BigQuery is designed for analyzing data on the order of billions of rows, using a SQL-like syntax. It is not a replacement for a fully SQL-compliant database, and it is not suited for transaction processing applications. BigQuery supports an interactive style of analysis. You build your queries using SELECT commands that should be familiar to any SQL developer.

BigQuery is designed for analyzing data on the order of billions of rows, using a SQL-like syntax.

The query language includes support for standard operations, such as joining, sorting and grouping, as well as for operators for working with nested data structures. As you would expect from a service designed for big data analysis, there is support for aggregate functions, such as count, sum, average, variance and standard deviation. The combination of grouping functions and descriptive statistics makes it relatively straightforward to compare means and variance between two populations.

Joins are essential operations when working with normalize data models, but they can be computationally expensive. The JOIN operator in BigQuery functions is like a SQL join but, when joining two tables, one of the tables must be 8 MB or less when compressed. This kind of restriction allows for more efficient join operations, since the smaller table can be cached efficiently and joined to the larger table. If you must join two large tables you can use the JOIN EACH operation -- but expect slower performance.

You can work with BigQuery interactively using a simple browser interface. The browser tool keeps a history of queries and provides a workspace for constructing your queries. This is a basic tool; it does not have the full range of features you would find in SQL developer tools, such as MySQL Workbench or DBVisualizer. You can also work with a command line interface.

Google BigQuery's pricing model

With access to Google's cloud resources, you might wonder why you would bother with sampling or approximate answers to your queries -- after all, you can just throw more hardware at the problem, right? Yes, you could, but BigQuery charges are based in part on how much data you analyze.

Google's pricing model is based on the amount of data stored and the amount of data analyzed. Storage charges are $0.12/GB per month. Interactive queries are billed at $0.035/GB processed and batch queries cost $0.02/GB processed. To limit the amount of data processed, you can limit the number of rows you analyze and include only columns you actually need in your results. BigQuery uses a columnar data store, so not all data in a row is retrieved when one or more of the columns is queried.

Data is loaded into BigQuery using load jobs that can work with data in Google Cloud Storage or from a local file system using a POST request. Files can be formatted as CSV or in JSON format. Compressed files are limited to 1 GB, but uncompressed files can be up to 1 TB. Load jobs can include up to 10,000 files, but the size of all files combined cannot exceed 1 TB. Since you can have up to 1,000 load jobs per table per day, these quotas are not likely to have practical implications for most projects.

About the author
Dan Sullivan, M.Sc., is an author, systems architect and consultant with more than 20 years of IT experience. He has had engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail and education. Dan has written extensively about topics that range from data warehousing, cloud computing and advanced analytics to security management, collaboration and text mining.

This was first published in April 2013

There are Comments. Add yours.

 
TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

Disclaimer: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.