Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
One of the most recognized concepts in the cloud space is one that few people understand. When asked about Apache Hadoop, a substantial majority of enterprises will name it as the premier cloud data model. Nearly as many won't know what Hadoop is, how it could be applied or whether it's even useful to them.
Apache Hadoop is an open-source implementation of a computing model called MapReduce. MapReduce was popularized by Google, which used it to build its index of the Internet. In its original form, MapReduce was developed as a way to distribute work to a cluster of systems. In a cluster there's a "master" node that takes a problem and dissects it into a distributable form, then each piece is sent to a "worker" node for processing. This divide-and-distribute phase is the "map" part of the name. When all the worker nodes have completed their tasks, the results are returned and combined, or "reduced," to create the result.
The interest in MapReduce and Hadoop, however, is really interest in applying MapReduce concepts to big data, not to distribute processing jobs in a compute grid. While the original mission of MapReduce is arguably very similar to "grid computing," the concept can be applied to accessing databases that are stored across a number of systems. This is the typical model for big data for two reasons: Most big data is collected in specific locales and stored there for convenience; and big data is typically too big to be concentrated on a single system.
It is very possible that over time, as true cloud applications develop, they will evolve from Hadoop.
Hadoop's central component is the Hadoop Distributed File System (HDFS), a file system designed to be virtualized across a potentially enormous number of distributed servers. Hadoop uses JobTrackers and TaskTrackers to actually do the mapping and reduction tasks; with the proper software elements, Hadoop can operate on both structured and unstructured data and use nearly any programming language as its development framework. It's available on most compute platforms, and as long as the versions and tools are organized correctly, you can mix platforms in a Hadoop installation without trouble.
Because Hadoop is built around both HDFS, a distributed data model, and JobTrackers/TaskTrackers, a distributable programming model, it's arguably a perfect framework for building cloud applications. In fact, you can argue that Hadoop is the only true, widely available cloud application framework because it is designed specifically to distribute processing to where the data is located, and it doesn't pull back data to where processing is done. This is a critical requirement in the cloud because data movement on a large scale is incredibly expensive and performance-intensive. It is very possible that over time, as true cloud applications develop, they will evolve from Hadoop.
The flip side of Hadoop's 'perfect' framework
Hadoop has its challenges, of course. Any data processing architecture that masks complexity also generates the risk of developing major inefficiencies from misuse.
The largest challenge in Hadoop is that of data organization. Because data is separated, it's possible to construct queries that require correlation among distributed components of the data. For example, imagine a spreadsheet-style structure in which half the columns are on one system and the other half are on another. If a query required testing two columns on different systems against each other, virtually the whole database would have to be moved to perform the task, defeating the principle of distributed data and processing. On structured data it's often fairly easy to design the applications to avoid this kind of inefficiency, but in unstructured data or where business intelligence (BI) queries are highly diverse, major performance problems can result.
Because of this risk, practical applications of big data in enterprise applications often apply Hadoop in conjunction with traditional tools. Some of the largest Hadoop applications build "front ends" to Hadoop to process information from standard DBMS and data collection applications into HDFS. They also summarize Hadoop results in abstract databases. Having BI applications work on summary data is always more efficient than having the same applications work on the raw big data detail, and pre-processing can ensure that data distribution is optimal.
Another issue with Hadoop is that it tends to solve big-data problems by applying considerable compute power rather than through the use of efficient processing. For structured data in particular there may be better DBMS-based mechanisms for distributing data and query processing; complex jobs may occupy so much resources with Hadoop that job scheduling is critical to prevent BI inquiries from overburdening resources to the point where more real-time tasks simply can't complete on schedule. Most Hadoop users that mix BI and real-time applications in the same cluster will either schedule jobs to avoid collision or provide a means of distributing compute time in the cluster to keep large BI tasks from stealing all the resources.
Hadoop is a paradigm shift, and because of that it's absolutely critical that well-trained teams adopt it through a series of carefully designed pilot steps. The notion that adopting Hadoop alone will unify a disconnected set of cloud data resources into a unified database is false and dangerous, and identifying the pitfalls is difficult even for skilled Hadoop developers unless extensive testing of alternatives, particularly data distribution alternatives, is done before committing to production.
About the author
Tom Nolle is president of CIMI Corporation, a strategic consulting firm specializing in telecommunications and data communications since 1982.