Q

MapReduce vs. Spark: Which should I choose for big data in the cloud?

MapReduce and Spark are two common options for processing big data in the cloud. But what are the key differences between the two?

For many businesses, the ability to manage, store and process big data effectively could mean the difference between...

being on top and being forgotten. And when it comes to processing big data in the cloud, users have two popular choices: MapReduce vs. Spark. Both are distributed processing systems that work well with large volumes of data, especially when data does not readily fit within the constraints of a single server.

Before looking at options for running these big data frameworks in the public cloud, let's look at the basic differences when comparing MapReduce vs. Spark.

MapReduce was the first processing framework released with Hadoop, an open source framework for processing large data sets. As its name suggests, MapReduce is based on the functional programming concepts of mapping and reducing. A map operation applies a function to an argument, and outputs a result in the form of a key value pair. A common example is counting words in a book. For each occurrence of a word in a book, a map function, known as a mapper, takes a word as input and emits a key value pair, consisting of the word, which is the key, and the number 1, which is the value. A reducer function then collects all the key value pairs with the same key -- or, in this example, the same word -- and sums the values.

MapReduce works well for batch-oriented processes. The MapReduce framework uses persistent storage on nodes in the cluster to store results, so the high level of I/O can introduce latencies. As a result, MapReduce is a good choice for very large data sets that are processed in batches.

Apache Spark is an open source, distributed computing platform. It runs on Hadoop, as well as Mesos, and you can use its own cluster manager. Spark works similarly to MapReduce, but it keeps big data in memory, rather than writing intermediate results to disk. Because of this, Spark applications can run a great deal faster than MapReduce jobs, and provide more flexibility.

When evaluating MapReduce vs. Spark, consider your options for using both frameworks in the public cloud. For example, Amazon Web Services Elastic MapReduce (EMR) includes support for Spark. MapReduce is included with the base Hadoop installation on EMR. Microsoft Azure also offers MapReduce and Spark in its HDInsight service. Google Dataproc, currently in beta, is a managed Hadoop service that offers MapReduce, as well as Spark.

Next Steps

Five quick links for big data management in the cloud

Considerations for using Hadoop in the cloud

Choosing the best cloud model for your big data needs

This was first published in February 2016

Dig Deeper on Big data and cloud business intelligence

PRO+

Content

Find more PRO+ content and other member only offers, here.

Have a question for an expert?

Please add a title for your question

Get answers from a TechTarget expert on whatever's puzzling you.

You will be able to add details on the next page.
Related Discussions

Dan Sullivan asks:

How did you decide between MapReduce vs. Spark for big data processing in the cloud?

0  Responses So Far

Join the Discussion

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

SearchServerVirtualization

SearchVMware

SearchVirtualDesktop

SearchAWS

SearchDataCenter

SearchWindowsServer

SearchSOA

SearchCRM

Close