Essential Guide

Managing Hadoop projects: What you need to know to succeed

A comprehensive collection of articles, videos and more, hand-picked by our editors

MapReduce

MapReduce is a software framework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers.

MapReduce is a software framework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers. It was developed at Google for indexing Web pages and replaced their original indexing algorithms and heuristics in 2004.

The framework is divided into two parts:

  • Map, a function that parcels out work to different nodes in the distributed cluster.
  • Reduce, another function that collates the work and resolves the results into a single value.
  • The MapReduce framework is fault-tolerant because each node in the cluster is expected to report back periodically with completed work and status updates. If a node remains silent for longer than the expected interval, a master node makes note and re-assigns the work to other nodes.

    According to software engineer Mark C. Chu-Carroll:

    "The key to how MapReduce works is to take input as, conceptually, a list of records. The records are split among the different computers in the cluster by Map. The result of the Map computation is a list of key/value pairs. Reduce then takes each set of values that has the same key and combines them into a single value. So Map takes a set of data chunks and produces key/value pairs and Reduce merges things, so that instead of a set of key/value pair sets, you get one result. You can't tell whether the job was split into 100 pieces or 2 pieces...MapReduce isn't intended to replace relational databases: it's intended to provide a lightweight way of programming things so that they can run fast by running in parallel on a lot of machines."

    MapReduce is important because it allows ordinary developers to use MapReduce library routines to create parallel programs without having to worry about programming for intra-cluster communication, task monitoring or failure handling. It is useful for tasks such as data mining, log file analysis, financial analysis and scientific simulations. Several implementations of MapReduce are available in a variety of programming languages, including Java, C++, Python, Perl, Ruby, and C.

    See also: Hadoop, cluster computing, distributed computing, cloud computing

    Learn more:

    Eugene Ciurana asks the question, Why should you care about MapReduce?

    John Willis provides an overview of Amazon's Elastic Map Reduce.

    Rich Seeley explains why MapReduce moves from secret Google goo to enterprise architecture.

    Learn what MapReduce and in-database technology means for data warehouses.

    Hadoop has a MapReduce tutorial.

    Contributor: Mark C. Chu-Carroll

This was first published in February 2010

Glossary

'MapReduce' is part of the:

View All Definitions

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

Essential Guide

An architect's guide: How to use big data

2 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

File Extensions and File Formats

Powered by:

SearchServerVirtualization

SearchVMware

SearchVirtualDesktop

SearchAWS

SearchDataCenter

SearchWindowsServer

SearchSOA

SearchCRM

Close