MapReduce definition

This definition is part of our Essential Guide: Managing Hadoop projects: What you need to know to succeed
Contributor(s): Stephen J. Bigelow and Mark C. Chu-Carroll

MapReduce is a core component of the Apache Hadoop software framework.

Hadoop enables resilient, distributed processing of massive unstructured data sets across commodity computer clusters, in which each node of the cluster includes its own storage. MapReduce serves two essential functions: It parcels out work to various nodes within the cluster or map, and it organizes and reduces the results from each node into a cohesive answer to a query.

MapReduce is composed of several components, including:

  • JobTracker -- the master node that manages all jobs and resources in a cluster
  • TaskTrackers -- agents deployed to each machine in the cluster to run the map and reduce tasks
  • JobHistoryServer -- a component that tracks completed jobs, and is typically deployed as a separate function or with JobTracker

To distribute input data and collate results, MapReduce operates in parallel across massive cluster sizes. Because cluster size doesn't affect a processing job's final results, jobs can be split across almost any number of servers. Therefore, MapReduce and the overall Hadoop framework simplify software development. MapReduce is available in several languages, including C, C++, Java, Ruby, Perl and Python. Programmers can use MapReduce libraries to create tasks without dealing with communication or coordination between nodes.

MapReduce is also fault-tolerant, with each node periodically reporting its status to a master node. If a node doesn't respond as expected, the master node re-assigns that piece of the job to other available nodes in the cluster. This creates resiliency and makes it practical for MapReduce to run on inexpensive commodity servers.

MapReduce in action

For example, users can list and count the number of times every word appears in a novel as a single server application, but that is time consuming. By contrast, users can split the task among 26 people, so each takes a page, writes a word on a separate sheet of paper and takes a new page when they're finished. This is the map aspect of MapReduce. And if a person leaves, another person takes his place. This exemplifies MapReduce's fault-tolerant element.

When all pages are processed, users sort their single-word pages into 26 boxes, which represent the first letter of each word. Each user takes a box and sorts each word in the stack alphabetically. The number of pages with the same word is an example of the reduce aspect of MapReduce.

This was first published in March 2015

Next Steps

Is MapReduce holding back Hadoop?

Apache Spark moves ahead of MapReduce

AWS and Google go head-to-head on big data

PRO+

Content

Find more PRO+ content and other member only offers, here.

3 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

File Extensions and File Formats

Powered by:

SearchServerVirtualization

SearchVMware

SearchVirtualDesktop

SearchAWS

SearchDataCenter

SearchWindowsServer

SearchSOA

SearchCRM

Close