What is MapReduce?

MapReduce is a programming framework in which applications can split Big Data into smaller chunks for parallel processing. This approach allows for high-speed analysis of vast data sets. 

MapReduce was originally a proprietary Google technology but has since become genericized. The most popular implementation of MapReduce is the open-source version associated with Apache Hadoop

How Does MapReduce Work?

MapReduce involves two main stages: mapping and reducing. 

First, a mapper application segments and tokenizes data. This approach can help to bring some organization to large quantities of unstructured data. 

Next, a reducer application sorts and combines the data from the mapper. The result of the reducer is a much smaller, logically consistent data set that is suitable for high-speed analysis. 

In a Big Data environment, the MapReduce system will divide the available data into smaller chunks. Discrete instances of mappers and reducers process each of these chunks in parallel.

An Example of the MapReduce Workflow

Text is a form of unstructured data. It's called unstructured data because it is not in the form of a relational database that is easy to query. 

MapReduce can bring some structure, which allows for complex natural language processing operations like sentiment analysis. 

The typical workflow for handling text might go like this:

1) Input

At the input stage, a MapReduce application will split this data down into smaller chunks, typically lower than 128 MB. If an algorithm is dealing with a book's text, it might split it into individual paragraphs. 

For this example, say that we are analyzing word frequency in this text:

See Spot run

Run Spot run

Come see Spot

The MapReduce algorithm might split the text like this

1 – See Spot run

2 – Run Spot run

3 – Come see Spot

The contents of this split then pass to the next stage: mapping.

2) Mapping

The mapping process creates key-value pairs for each element of the data in the split. The notation of these pairs is in the format <key, value>. 

In this example algorithm, we are trying to determine word frequency. The keys are the individual word, and the value is going to be 1, which indicates a single instance of the word. Three mappers will run to handle each of the three splits. The results look like this: 

M1 – <see, 1> <spot, 1> <run, 1>

M2 – <run, 1> <spot, 1> <run, 1>

M3 – <come, 1> <see, 1> <spot, 1>

The three mappers run concurrently, which reduces the overall processing time. While each mapper's output isn't immediately useful, we can see that the data now has a little more structure. 

3) Shuffling 

Each mapper passes its output to a reducer, which is a separate algorithm. The reducer cannot begin processing until it receives data from the mapper. 

The original word order in the sample text followed the rules of English grammar. This order is irrelevant for our purposes, so the reducer must reorganize these values in a more useful way. For word frequency, it might group data according to the key. We end up with this:

<see, 1> <see, 1>  

<spot, 1> <spot, 1> <spot, 1>

<run, 1> <run, 1> <run, 1>

<come, 1> 

There is now a much more logical structure to this data, which allows for the final reduction. 

4) Reduction

The reducer will now attempt to reduce the analyzed data. In this instance, the reducer will simply combine the values for each individual key. The final output looks like this:

<see, 2> 

<spot, 3> 

<run, 3> 

<come, 1> 

From these values, it is easy to extrapolate a report on the word frequency in the original text. 

What are the Uses of MapReduce?

MapReduce applications work in a variety of languages. Hadoop supports applications in Java, Ruby, Python, and C++.  

The above example looks at a very small piece of text. But MapReduce is generally used for Big Data, often working with petabytes of information in a single operation.

There are a number of reasons this can be beneficial.

Faster Queries Across Multiple Physical Devices

In a typical Big Data structure, data exists physically across multiple devices. This may include hard drives or other storage devices within different servers, or even in different buildings. 

For operations involving all of this data, each physical device engages simultaneously, which is a substantial resource overhead. MapReduce is fundamentally about breaking up the query process into more manageable chunks. Mappers and reducers can focus on specific devices or even specific disk sectors. 

Navigation of Unsorted Data Structures

Repositories such as data lakes hold large quantities of unsorted data. This can include structured data that has not passed through a transformation process, as well as semi-structured and unstructured data. 

Performing any kind of operation on these data structures would be impossible without a strategy such as MapReduce. Mappers and reducers can sift through the contents of a data lake and provide structured information. This is a crucial step when performing analytics on a data lake. 

Machine Learning 

Machine learning algorithms are tools for sorting unstructured data. An ML algorithm tokenizes data looks for patterns and then stores those patterns for future reference. The more patterns it has available, the more "intelligent" the ML algorithm becomes. 

MapReduce can help ML algorithms to navigate through large data sets. The example given above is part of one application of machine learning, known as natural language processing. MapReduce helps to turn unstructured language data into something that is suitable for storage in a relational database.

Data Exploration

Data exploration is the first step towards analytics. The analytics team may not know where to look in large data sets for clusters and trends that warrant further study. Data exploration can flag up potentially promising areas on which to focus. 

For example, in the example above, we saw that there were clusters of certain words like "run" and "spot". An analytics consultant may use this cluster information to decide where to focus further study that may yield insights.