This content is part of the Essential Guide: An enterprise guide to big data in cloud computing

MapReduce vs. Spark: Which should I choose for big data in the cloud?

MapReduce and Spark are two common options for processing big data in the cloud. But what are the key differences between the two?

For many businesses, the ability to manage, store and process big data effectively could mean the difference between being on top and being forgotten. And when it comes to processing big data in the cloud, users have two popular choices: MapReduce vs. Spark. Both are distributed processing systems that work well with large volumes of data, especially when data does not readily fit within the constraints of a single server.

Before looking at options for running these big data frameworks in the public cloud, let's look at the basic differences when comparing MapReduce vs. Spark.

MapReduce was the first processing framework released with Hadoop, an open source framework for processing large data sets. As its name suggests, MapReduce is based on the functional programming concepts of mapping and reducing. A map operation applies a function to an argument, and outputs a result in the form of a key value pair. A common example is counting words in a book. For each occurrence of a word in a book, a map function, known as a mapper, takes a word as input and emits a key value pair, consisting of the word, which is the key, and the number 1, which is the value. A reducer function then collects all the key value pairs with the same key -- or, in this example, the same word -- and sums the values.

MapReduce works well for batch-oriented processes. The MapReduce framework uses persistent storage on nodes in the cluster to store results, so the high level of I/O can introduce latencies. As a result, MapReduce is a good choice for very large data sets that are processed in batches.

Apache Spark is an open source, distributed computing platform. It runs on Hadoop, as well as Mesos, and you can use its own cluster manager. Spark works similarly to MapReduce, but it keeps big data in memory, rather than writing intermediate results to disk. Because of this, Spark applications can run a great deal faster than MapReduce jobs, and provide more flexibility.

When evaluating MapReduce vs. Spark, consider your options for using both frameworks in the public cloud. For example, Amazon Web Services Elastic MapReduce (EMR) includes support for Spark. MapReduce is included with the base Hadoop installation on EMR. Microsoft Azure also offers MapReduce and Spark in its HDInsight service. Google Dataproc, currently in beta, is a managed Hadoop service that offers MapReduce, as well as Spark.

Next Steps

Five quick links for big data management in the cloud

Considerations for using Hadoop in the cloud

Choosing the best cloud model for your big data needs

Dig Deeper on Big data, machine learning and AI