BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Why is Hadoop in the cloud important for big data analytics?
To be meaningful, data must be processed, cleaned, filtered, transformed and modeled. When data sets are small or limited in scope, organizations can analyze and model data on a single server. Enterprises can also use conventional data processing techniques, including those with software development platforms such as the R programming language.
However, when data sets grow, processing tasks must be distributed among multiple servers. To coordinate distributed storage and distributed processing tasks, it's also necessary to have an additional software layer. Tools like Apache Hadoop provide this additional layer as an open source software framework. Hadoop supports distributed data storage and processing for big data sets across computer clusters. It has emerged as the standard framework for big data analytics projects, and version 2.6.0 was released in November.
Hadoop is not a single tool, but a series of related modules. These include Hadoop Common with the core libraries and utilities; the Hadoop Distributed File System (HDFS) for storing data on distributed machines; Hadoop YARN for managing and scheduling cluster computing resources; and Hadoop MapReduce, the programming language for big data processing. Enterprises can deploy Hadoop in an on-premises data center, or through cloud providers such as Microsoft Azure, Google App Engine, and Amazon S3.
Hadoop, while the most common, is not the only big data storage and processing platform. There are a number of alternatives, including Hydra, Zillabyte, Stratosphere and Sparc. These alternatives are still emerging, so it's important to test them each before deployment. To do this, evaluate the potential benefits of each platform. For example, the Zillabyte platform caters to Ruby and Python, so it might be a solid option for big data developers who are proficient in those languages. Once enterprises select a platform, arrange a test deployment using a limited cloud environment that duplicates an established Hadoop environment as closely as possible.
About the author:
Stephen J. Bigelow is the senior technology editor of the Data Center and Virtualization Media Group. He can be reached at firstname.lastname@example.org.
Getting to know Hadoop cloud tools
Hadoop as a service competition heats up
Following customer trends with cloud predictive analytics