This content is part of the Essential Guide: An enterprise guide to big data in cloud computing

Does Hadoop in the cloud give big data a boost?

Hadoop is an option for handling the vast and diverse data sets that come with big data. But how exactly can it be used in an organization?

Why is Hadoop in the cloud important for big data analytics?

To be meaningful, data must be processed, cleaned, filtered, transformed and modeled. When data sets are small or limited in scope, organizations can analyze and model data on a single server. Enterprises can also use conventional data processing techniques, including those with software development platforms such as the R programming language.

However, when data sets grow, processing tasks must be distributed among multiple servers. To coordinate distributed storage and distributed processing tasks, it's also necessary to have an additional software layer. Tools like Apache Hadoop provide this additional layer as an open source software framework. Hadoop supports distributed data storage and processing for big data sets across computer clusters. It has emerged as the standard framework for big data analytics projects, and version 2.6.0 was released in November.

Hadoop is not a single tool, but a series of related modules. These include Hadoop Common with the core libraries and utilities; the Hadoop Distributed File System (HDFS) for storing data on distributed machines; Hadoop YARN for managing and scheduling cluster computing resources; and Hadoop MapReduce, the programming language for big data processing. Enterprises can deploy Hadoop in an on-premises data center, or through cloud providers such as Microsoft Azure, Google App Engine, and Amazon S3.

Hadoop, while the most common, is not the only big data storage and processing platform. There are a number of alternatives, including Hydra, Zillabyte, Stratosphere and Sparc. These alternatives are still emerging, so it's important to test them each before deployment. To do this, evaluate the potential benefits of each platform. For example, the Zillabyte platform caters to Ruby and Python, so it might be a solid option for big data developers who are proficient in those languages. Once enterprises select a platform, arrange a test deployment using a limited cloud environment that duplicates an established Hadoop environment as closely as possible.

About the author:
Stephen J. Bigelow is the senior technology editor of the Data Center and Virtualization Media Group. He can be reached at [email protected].

Next Steps

Getting to know Hadoop cloud tools

Hadoop as a service competition heats up

Following customer trends with cloud predictive analytics

Dig Deeper on Big data, machine learning and AI