Why is Hadoop in the cloud important for big data analytics?
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
To be meaningful, data must be processed, cleaned, filtered, transformed and modeled. When data sets are small or limited in scope, organizations can analyze and model data on a single server. Enterprises can also use conventional data processing techniques, including those with software development platforms such as the R programming language.
However, when data sets grow, processing tasks must be distributed among multiple servers. To coordinate distributed storage and distributed processing tasks, it's also necessary to have an additional software layer. Tools like Apache Hadoop provide this additional layer as an open source software framework. Hadoop supports distributed data storage and processing for big data sets across computer clusters. It has emerged as the standard framework for big data analytics projects, and version 2.6.0 was released in November.
Hadoop is not a single tool, but a series of related modules. These include Hadoop Common with the core libraries and utilities; the Hadoop Distributed File System (HDFS) for storing data on distributed machines; Hadoop YARN for managing and scheduling cluster computing resources; and Hadoop MapReduce, the programming language for big data processing. Enterprises can deploy Hadoop in an on-premises data center, or through cloud providers such as Microsoft Azure, Google App Engine, and Amazon S3.
Hadoop, while the most common, is not the only big data storage and processing platform. There are a number of alternatives, including Hydra, Zillabyte, Stratosphere and Sparc. These alternatives are still emerging, so it's important to test them each before deployment. To do this, evaluate the potential benefits of each platform. For example, the Zillabyte platform caters to Ruby and Python, so it might be a solid option for big data developers who are proficient in those languages. Once enterprises select a platform, arrange a test deployment using a limited cloud environment that duplicates an established Hadoop environment as closely as possible.
About the author:
Stephen J. Bigelow is the senior technology editor of the Data Center and Virtualization Media Group. He can be reached at email@example.com.
Getting to know Hadoop cloud tools
Hadoop as a service competition heats up
Following customer trends with cloud predictive analytics
Related Q&A from Stephen J. Bigelow
Photon Controller and vSphere Integrated Containers both manage containers, but in different ways. What's the difference between these utilities, and...continue reading
Photon OS optimizes VMware Photon platform deployment, not only in vSphere but in GCE, EC2 and more. Follow these steps to learn how to run Photon OS...continue reading
Performance problems can be caused by a number of things, including overprovisioning and poor vCPU selection and assignment to VMs. Use these ...continue reading
Have a question for an expert?
Please add a title for your question
Get answers from a TechTarget expert on whatever's puzzling you.