An enterprise guide to big data in cloud computing
A comprehensive collection of articles, videos and more, hand-picked by our editors
Why is Hadoop in the cloud important for big data analytics?
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
To be meaningful, data must be processed, cleaned, filtered, transformed and modeled. When data sets are small or limited in scope, organizations can analyze and model data on a single server. Enterprises can also use conventional data processing techniques, including those with software development platforms such as the R programming language.
However, when data sets grow, processing tasks must be distributed among multiple servers. To coordinate distributed storage and distributed processing tasks, it's also necessary to have an additional software layer. Tools like Apache Hadoop provide this additional layer as an open source software framework. Hadoop supports distributed data storage and processing for big data sets across computer clusters. It has emerged as the standard framework for big data analytics projects, and version 2.6.0 was released in November.
Hadoop is not a single tool, but a series of related modules. These include Hadoop Common with the core libraries and utilities; the Hadoop Distributed File System (HDFS) for storing data on distributed machines; Hadoop YARN for managing and scheduling cluster computing resources; and Hadoop MapReduce, the programming language for big data processing. Enterprises can deploy Hadoop in an on-premises data center, or through cloud providers such as Microsoft Azure, Google App Engine, and Amazon S3.
Hadoop, while the most common, is not the only big data storage and processing platform. There are a number of alternatives, including Hydra, Zillabyte, Stratosphere and Sparc. These alternatives are still emerging, so it's important to test them each before deployment. To do this, evaluate the potential benefits of each platform. For example, the Zillabyte platform caters to Ruby and Python, so it might be a solid option for big data developers who are proficient in those languages. Once enterprises select a platform, arrange a test deployment using a limited cloud environment that duplicates an established Hadoop environment as closely as possible.
About the author:
Stephen J. Bigelow is the senior technology editor of the Data Center and Virtualization Media Group. He can be reached at email@example.com.
Getting to know Hadoop cloud tools
Hadoop as a service competition heats up
Following customer trends with cloud predictive analytics
Related Q&A from Stephen J. Bigelow
DR planning mistakes are easy to make. Avoid selecting a tool that doesn't meet your needs or that's overly complex, carefully consider the ...continue reading
Establishing a DR plan for a VMware environment can be overwhelming. How do you design a plan that prioritizes VMs and manage your infrastructure to ...continue reading
Storage I/O control can be an effective way to handle occasional storage sharing issues, but it is not always suitable for every virtual machine.continue reading
Have a question for an expert?
Please add a title for your question
Get answers from a TechTarget expert on whatever's puzzling you.