Map Reduce is a programming model and an associated
implementation for processing and generating large data sets. Programs written
in this functional style of Map Reduce are automatically parallelized and
executed on a large cluster of commodity machines. In a basic Map Reduce job,
it consists of the following four components:
Input
Mapper
Reducer &
Output
Input data
is usually large and the computations have to be distributed across hundreds or
thousands of machines in order to finish in a reasonable amount of time. A Map Reduce
job usually splits the input data-set into independent pieces usually 16 MB to
64 MB per piece.
Mapper
component process each input in a completely parallel manner as key/value pairs
to generate a set of intermediate key/value pairs.
Reducer
component merges all the intermediate values associated with the same
intermediate key and provided as output files. Typically both the input and the
output of the job are stored in a file-system.
In a Map Reduce job, a special program called master
assigns mapper or reducer tasks to rest of the idle workers. Although, the
basic Map Reduce job consists of the above components, followings are also the
useful extensions:
Partitioner – partitions the data into specified number of
reduce tasks/output files
Combiner – does partial merging of the data before sending
over the network
Data Flow
through Hadoop Map Reduce
Hadoop Map Reduce runs the job by dividing it into tasks. The
two types are
1.
Map
tasks
2.
Reduce
tasks
There are two types of nodes that control the job execution
process: a job tracker (master) and a number of task trackers (workers). The
job tracker coordinates all the jobs run on the system by scheduling tasks to
run on task trackers.
Hadoop divides the input to a Map Reduce job into fixed-size
pieces called input splits, or splits. Hadoop creates one map task for each
split, which runs the user-defined map function for each record in the split.
Having many splits means the time taken to process each split is small compared
to the time to process the whole input. On the other hand, if splits are too
small, then the overhead of managing the splits and of map task creation begins
to dominate the total job execution time.
Map tasks write their output to local disk, not to HDFS. The reducer(s)
is fed by the mapper outputs. The output of the reducer is normally stored in
HDFS for reliability.
Explanation of Map Reduce paradigm with reference to Word
Count Algorithm
The computation takes a set of input key/value pairs,
and produces a set of output key/value pairs. The user of the MapReduce
library expresses the computation as two functions: Map and Reduce.
Map, written by the user, takes an input pair and produces a set of intermediate
key/value pairs. The MapReduce library groups together all intermediate
values associated with the same intermediate key I and passes them to the Reduce
function.
The Reduce function, also written by the user, accepts
an intermediate key I and a set of values for that key. It merges together
these values to form a possibly smaller set of values. Typically just zero or
one output value is produced per Reduce invocation. The intermediate
values are supplied to the user's reduce function via an iterator. This allows
us to handle lists of values that are too large to _t in memory.
Consider the problem of counting the number of occurrences of
each word in a large collection of documents. The user would write code similar
to the following pseudo-code:
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String
key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Straggler Machine
Straggler machine is a machine that takes an unusually long
time to complete map or reduce tasks in the computation. Stragglers are usually
generated due to the variation in the CPU availability, network traffic or IO
contention.
Since a job in Hadoop environment does not finish until all
Map and Reduce tasks are finished, even small number of stragglers can largely
deteriorate the overall response time for the job. It is therefore essential
for Hadoop to find stragglers and run a speculative copy to ensure lower
response times. Earlier the detection of the stragglers better is the overall
response time for the job. The detection of stragglers needs to be accurate,
since the speculative copies put up an overhead on the available resources.
Combiner Function
Combiner is a user specified function that does partial
merging of the data before it is sent over the network. In some cases, there is
significant repetition in the inter-mediate keys produced by each map task. For
example in the word counting example in word frequencies tend to follow a Zipf
distribution and each map task will produce hundreds or thousands of records of
the form <the, 1>. All of these counts will be sent over the network to a
single reduce task and then added together by the Reduce function to produce
one number. So to decentralize the count of reduce, user are allowed
to specify an optional Combiner function that does partial merging of this data
before it is sent over the network. The Combiner function is executed on each
machine that performs a map task.
Thank you.Well it was nice post and very helpful information on Big data hadoop online training Hyderabad
ReplyDeleteHi,
ReplyDeleteVery nice post,thank you for shring this article.
keep updating...
big data online training
Hadoop admin training