Monday, April 14, 2014

Map Reduce Framework and its components

Map Reduce is a programming model and an associated implementation for processing and generating large data sets. Programs written in this functional style of Map Reduce are automatically parallelized and executed on a large cluster of commodity machines. In a basic Map Reduce job, it consists of the following four components:

 Input
 Mapper
 Reducer &
 Output

Input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. A Map Reduce job usually splits the input data-set into independent pieces usually 16 MB to 64 MB per piece.

Mapper component process each input in a completely parallel manner as key/value pairs to generate a set of intermediate key/value pairs.

Reducer component merges all the intermediate values associated with the same intermediate key and provided as output files. Typically both the input and the output of the job are stored in a file-system.

In a Map Reduce job, a special program called master assigns mapper or reducer tasks to rest of the idle workers. Although, the basic Map Reduce job consists of the above components, followings are also the useful extensions:

 Partitioner – partitions the data into specified number of reduce tasks/output files
 Combiner – does partial merging of the data before sending over the network

Data Flow through Hadoop Map Reduce
Hadoop Map Reduce runs the job by dividing it into tasks. The two types are

1.      Map tasks
2.      Reduce tasks

There are two types of nodes that control the job execution process: a job tracker (master) and a number of task trackers (workers). The job tracker coordinates all the jobs run on the system by scheduling tasks to run on task trackers.

Hadoop divides the input to a Map Reduce job into fixed-size pieces called input splits, or splits. Hadoop creates one map task for each split, which runs the user-defined map function for each record in the split. Having many splits means the time taken to process each split is small compared to the time to process the whole input. On the other hand, if splits are too small, then the overhead of managing the splits and of map task creation begins to dominate the total job execution time.

Map tasks write their output to local disk, not to HDFS. The reducer(s) is fed by the mapper outputs. The output of the reducer is normally stored in HDFS for reliability.



Explanation of Map Reduce paradigm with reference to Word Count Algorithm

The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce. Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function.

The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user's reduce function via an iterator. This allows us to handle lists of values that are too large to _t in memory.

Consider the problem of counting the number of occurrences of each word in a large collection of documents. The user would write code similar to the following pseudo-code:

map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");

reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));

Straggler Machine

Straggler machine is a machine that takes an unusually long time to complete map or reduce tasks in the computation. Stragglers are usually generated due to the variation in the CPU availability, network traffic or IO contention.

Since a job in Hadoop environment does not finish until all Map and Reduce tasks are finished, even small number of stragglers can largely deteriorate the overall response time for the job. It is therefore essential for Hadoop to find stragglers and run a speculative copy to ensure lower response times. Earlier the detection of the stragglers better is the overall response time for the job. The detection of stragglers needs to be accurate, since the speculative copies put up an overhead on the available resources.

Combiner Function


Combiner is a user specified function that does partial merging of the data before it is sent over the network. In some cases, there is significant repetition in the inter-mediate keys produced by each map task. For example in the word counting example in word frequencies tend to follow a Zipf distribution and each map task will produce hundreds or thousands of records of the form <the, 1>. All of these counts will be sent over the network to a single reduce task and then added together by the Reduce function to produce one number. So to decentralize the count of reduce, user are allowed to specify an optional Combiner function that does partial merging of this data before it is sent over the network. The Combiner function is executed on each machine that performs a map task. 

2 comments: