Structuring a big data strategy
A comprehensive collection of articles, videos and more, hand-picked by our editors
MapReduce is a core component of the Apache Hadoop software framework.
Hadoop enables resilient, distributed processing of massive unstructured data sets across commodity computer clusters, in which each node of the cluster includes its own storage. MapReduce serves two essential functions: It parcels out work to various nodes within the cluster or map, and it organizes and reduces the results from each node into a cohesive answer to a query.
MapReduce is composed of several components, including:
- JobTracker -- the master node that manages all jobs and resources in a cluster
- TaskTrackers -- agents deployed to each machine in the cluster to run the map and reduce tasks
- JobHistoryServer -- a component that tracks completed jobs, and is typically deployed as a separate function or with JobTracker
To distribute input data and collate results, MapReduce operates in parallel across massive cluster sizes. Because cluster size doesn't affect a processing job's final results, jobs can be split across almost any number of servers. Therefore, MapReduce and the overall Hadoop framework simplify software development. MapReduce is available in several languages, including C, C++, Java, Ruby, Perl and Python. Programmers can use MapReduce libraries to create tasks without dealing with communication or coordination between nodes.
MapReduce is also fault-tolerant, with each node periodically reporting its status to a master node. If a node doesn't respond as expected, the master node re-assigns that piece of the job to other available nodes in the cluster. This creates resiliency and makes it practical for MapReduce to run on inexpensive commodity servers.
MapReduce in action
For example, users can list and count the number of times every word appears in a novel as a single server application, but that is time consuming. By contrast, users can split the task among 26 people, so each takes a page, writes a word on a separate sheet of paper and takes a new page when they're finished. This is the map aspect of MapReduce. And if a person leaves, another person takes his place. This exemplifies MapReduce's fault-tolerant element.
When all pages are processed, users sort their single-word pages into 26 boxes, which represent the first letter of each word. Each user takes a box and sorts each word in the stack alphabetically. The number of pages with the same word is an example of the reduce aspect of MapReduce.
Is MapReduce holding back Hadoop?
Apache Spark moves ahead of MapReduce
AWS and Google go head-to-head on big data