Data management in the cloud often involves loading and maintaining files in a distributed file system, such as Hadoop Distributed File System (HDFS), and then processing that data with a tool like MapReduce. For data warehousing and other analytics tasks, database tools like Hive provide SQL-like functionality on top of distributed file systems.
Even though parallels can be drawn between conventional relational database management systems and cloud-based nonrelational databases, operational differences create issues when moving data between the two segments. And extracting, transforming and loading processes can create even more challenges.
Data migration tools to assist a move to cloud
Extracting data from a database is easy; efficiently mining large volumes of data from a database can be a challenge. If your data warehouses encounter performance or storage issues because of growing data volumes, it may be time to consider using cloud resources. There are several tools to help load data from relational databases to a cloud file system and database.
More on big data in the cloud
Specialized tools, like Sqoop (SQL-to-Hadoop), generate code to extract data from relational databases and copy it to HDFS or Hive. Sqoop uses JDBC drivers to work with multiple types of relational databases, but pushing large volumes of data through JDBC comes with performance costs.
When extracting data from relational databases for a move to the cloud, you may need to transform the data. If all the data you are working with originates from a single database, you can perform the transformations in the source database. If you’re merging data from two separate systems, it’s often more efficient to move the data source after extracting it. However, you should do this before you load the data into the final data store. The Cascading data processing API can help with this task.
Cascading provides functions such as processing, planning and scheduling for workflows running on Hadoop. It works with a pipes-and-filters metaphor; data is streamed from a source to a target through a pipe with applied filters. Other functions, such as grouping, can be applied to data streams. Cascading is implemented in Java and transforms API calls into MapReduce jobs.
If you’re working with MySQL, Sqoop can use a MySQL dump utility to bypass JDBC and extract data more efficiently. Sqoop can also generate Java classes, which can be used to manipulate loaded data and import it directly into Hive. HIHO (Hadoop Input and Output) extracts data from relational tables and provides some basic transformation services, such as deduplication and merging input streams.
When generating files that require minimal transformation before loading them into the HDFS file system or a Hive data warehouse, you may be able to load the files directly. Hive has a command to load data after determining a target table and a partition specification. Pig, a high-level language for data analysis programs, can be useful when working with files on HDFS. Pig is easy to program, especially when compared to coding MapReduce in Java. It provides the basic aggregate functions you would find in a relational database (e.g., min, max, count) as well as math and string manipulation functions. Pig natively supports compression for structured and unstructured text files.
Cloud computing resources complement data warehousing infrastructures. However, to maximize the benefits of moving data warehousing to the cloud, it’s important to structure data properly and implement the right data analysis tools.
Dan Sullivan, M.Sc., is an author, systems architect and consultant with over 20 years of IT experience with engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail and education, among others. Dan has written extensively about topics ranging from data warehousing, cloud computing and advanced analytics to security management, collaboration, and text mining.
This was first published in February 2012