pressmaster - Fotolia

Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

Hadoop tools combat obstacles of its constant evolution

Hadoop is a top choice for big data analytics, but keeping up with the quickly evolving framework is a struggle. However, a variety of tools can simplify cloud admins' jobs.

When it comes to big data management, Hadoop is the popular kid in school. Due to its growing popularity, cloud admins will have to deal with the challenges of its constantly evolving ecosystem. Hadoop started as a platform for running MapReduce, but morphed into a large-scale computing platform that features a variety of tools to support the full lifecycle of data management.

The continuing evolution of Hadoop is enabled by YARN -- the new resource manager in Hadoop 2. With the data management platform constantly changing, cloud admins should be familiar with the variety of Hadoop tools to support the environment.

Three scripts that speak Hadoop's language

Moving data in and out of the Hadoop Distributed File System (HDFS) is common for most admins using Hadoop. While some cloud admins write custom scripts in their favorite programming language, there are three different, efficient options that warrant attention.

  1. Pig. The high-level declarative language for specifying ETL jobs running in Hadoop provides high-level commands for processing each line of a file, filtering data sets, grouping subsets of data, sorting and other common operations. Pig works especially well with text files and is easy to learn -- especially for those familiar with SQL. Developers can create user-defined functions to enhance its capabilities.
  2. Sqoop. The bulk data loading tool is designed to simplify moving data between Hadoop and relational databases. It can read metadata and extract data from tables, as well as create files on HDFS containing table data. Sqoop performs bulk data loads simultaneously to enable faster data loads with a single sequential data loading script. Sqoop also generates Java classes to manipulate imported data tables to use within your applications.
  3. Flume. The distributed system designed for scalability and reliability eliminates the need to design custom apps to scale up or down to varying loads in a stream processing operation. Flume is designed to transport large volumes of data in real time, and it is great for moving data into Hadoop. It uses a basic model based on events, sources, sinks and channels.

Oozie: The workflow engine that could

Admins often need to run an entire series of programs to complete a large-scale big data analysis task. For example, you may need a few Pig and Sqoop jobs to load data, several Map Reduce jobs to transform and analyze the data and another Sqoop job to write the results to a relational database.

Oozie, however, is a workflow engine that allows cloud admins to specify which jobs need to be run and the dependencies among them. For example, MapReduce cannot start running until data loading is complete. Administrators specify the details of Oozie workflows using an XML process definition language called hPDL.

In addition to using a declarative language to describe a workflow, Oozie can restart failed workflows. Instead of needing to re-run each step from the very beginning, Oozie uses information from the workflow to get it back up and running from where it failed.

Managing data for complex applications

Hadoop is the standard for big data management and analysis. ETL and streaming data tools, workflow support, and data management tools are commonly used in Hadoop environments to streamline DevOps and provide basic data management services at scale.

With the data management platform constantly changing, cloud admins should be familiar with the variety of Hadoop tools to support the environment.

Simple data management requirements for Hadoop function with HDFS, but more complex applications need HBase and Hive.

HBase is a column data store type of NoSQL database. It is designed to support extremely large tables with billions of rows and columns. HBase excels at data management needs, including rapid lookup and updating of data sets with more than a few million rows.

Hive is a data warehouse platform that supports SQL-like query capabilities over large-scale data sets, and it takes advantage of Hadoop's parallel architecture to partition large tables over multiple files in HDFS. However, Hive's performance features come with a price. For example, Hive supports overwriting and appending data, but does not support deleting and updating it.

About the author:
Dan Sullivan holds a master of science degree and is an author, systems architect and consultant with more than 20 years of IT experience. He has had engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail and education. Dan has written extensively about topics that range from data warehousing, cloud computing and advanced analytics to security management, collaboration and text mining.

Next Steps

Database technology evolution impacting Hadoop?

Testing your open source expertise

Hadoop as a service competition heating up

Dig Deeper on Big data, machine learning and AI

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.