Apache Sqoop a crucial link for big data analytics in the cloud

Hadoop is becoming the popular enterprise choice for big data analytics, but getting your data there isn't so simple. Apache Sqoop is stepping in to help bring the vital data from your database to Hadoop.

As big data becomes more important and more applications are deployed to the cloud, cloud administrators face increasing demands to migrate data from source systems to analysis systems -- the most popular being Hadoop. Custom scripts of ETL tools have helped with this migration in the past, but Apache Sqoop is emerging as a better option.

As moving between Hadoop and relational databases is becoming a standard process, cloud admins can take advantage of Sqoop's parallel bulk data loading capabilities to streamline this process and reduce the need to write custom data load scripts.

Apache Sqoop is a command-line tool for transferring data between relational databases and Hadoop. Connectors are also available for some NoSQL databases. Sqoop, similar to other ETL tools, uses schema metadata to infer data types and ensure type-safe data handling when the data moves from the source to Hadoop. Designed for big data bulk transfers, it partitions data sets and creates Hadoop jobs to process each partition.

There are definite advantages to using Sqoop over custom scripts.

Sqoop has a fairly small command set that includes importing and exporting, listing database and table information, generating Java classes to manipulate data, evaluating SQL commands, and some more specialized commands. The Java class generation command is especially useful for writing Java applications to manipulate data in Hadoop. The SQL evaluation command shows the results of executing a SQL statement, and it is useful when exploring new databases or crafting queries with complicated logic.

There are definite advantages to using Sqoop over custom scripts. For one, Sqoop is designed to be fault-tolerant. You can custom code scripts to monitor your job status and recover from failure, but that can be incredibly time-consuming.

What to watch for when using Sqoop

Despite these advantages, there are some things to be wary of when using Sqoop. First, be careful with default parallelism. Default parallelism means Sqoop can assume big data is evenly distributed across the range of the partition key. This works fine when your source system uses a sequence to generate primary keys. For example, if you have a 10-node cluster and the workload is evenly allocated across the 10 servers. However, if your split key is based on an alphanumeric key that has, for example, 20 times the amount of keys beginning with "A", than with "M", the workload could become skewed to one server over another.

If your top consideration is performance, look into direct loads. Direct loads bypass the usual Java database connectivity import and use the direct load utility provided by the database, such as MySQL's mysqldump. However, there are database-specific limitations. For example, you cannot import BLOB and CLOB types using the MySQL or PostgreSQL connectors. Neither driver supports importing from views, either. The Oracle direct driver requires privileges to read metadata sources, such as dba_objects and v_$parameter. Check the documentation for the direct connect driver of your database for limitations.

Performing incremental imports is a top efficiency concern because Sqoop is designed for large-scale data sets. Sqoop supports incremental updating based on rows appended to the source from the most recent export, or by specifying a last modified time stamp.

It is no surprise that Sqoop has specialized support for Hive -- the SQL-like data warehouse facility in the Hadoop ecosystem -- due to its ability to move data in and out of relational databases. The command "create-hive-table" is provided to import a table definition into Hive.

About the author:
Dan Sullivan holds a Master of Science degree and is an author, systems architect and consultant with more than 20 years of IT experience. He has had engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail and education. Dan has written extensively about topics that range from data warehousing, cloud computing and advanced analytics to security management, collaboration and text mining.

Next Steps

Trouble in Hadoop paradise?

Nearly perfect, but Hadoop has its pitfalls

Who's hot and who's not in Hadoop as a service

Dig Deeper on Big data, machine learning and AI