This post is a St. Nicholas’ Day gift for one of my students who asked to explain the concept of map-reduce used to solve computationally complex problems. I will do this, on the example, using Python language, which is known for its brevity. At the same time, I will be happy to test a new plug-in installed in our WordPress that allows you to display source codes nicer – namely Enlighter 😉
Map-reduce is a concept inherent in parallel data processing. Its application consists in dividing the computational problem into smaller sub-problems, solving these smaller sub-problems, and then merging these solutions into one solution that solves the initial problem.
Perhaps the best-known implementations of the above problem are the Apache Hadoop and the slightly younger Apache Spark frameworks. Here, however, the concept will be presented in the simplest possible way using Python programming language.
Let’s assume that we have a large data set – e.g. a set of feet size given in inch that our customers wear and we want to convert it to a more commonly used metric.
feet_size = [("Alan",17),("Bob",18),("Carol",14),("Dean",15),("Elvis",16)]
Then we define the conversion function using lambda syntax.
to_shoe_size = lambda data:(data[0],3/2.0 * (data[1] + 1.5))
Now, we compute and display a converted dataset.
shoe_size = list(map(to_shoe_size, feet_size)) print(shoe_size)
And we get the following result.
[('Alan', 27.75), ('Bob', 29.25), ('Carol', 23.25), ('Dean', 24.75), ('Elvis', 26.25)]
Let us filter, for example, all records having value above the average.
import statistics avg = statistics.mean(map(lambda x: x[1], shoe_size)) above_avg_shoe_size = filter(lambda x: x[1] > avg, shoe_size) print(list(above_avg_shoe_size))
And we get the following result.
[('Alan', 27.75), ('Bob', 29.25)]
Now, if we want to find the first smallest value that is higher than average we use the reduce function as below.
from functools import reduce min_above_avg = reduce(lambda x,y: min( x[1],y[1]), filter(lambda x: x[1] > avg, shoe_size)) print(min_above_avg)
And we get the following result.
27.75
That’s all. And what do you think about the above-described constructions? Are they more readable and self-explaining comparing to loop iterators or quite contrary?
Another view angles
- Worth noting here, that in Python3+, reduce is not a builtin function but is moved to functools module. This is intended and is explained by far more readability of classic for-loop construction. Here you can find a more detailed explanation conducted by the creator of Python, while the specification of reduce function and description of the way how to substitute it with for-loop can be found here.
- Both map and reduce functions used in the above example are NOT multiprocessing and have been used for illustrative purposes. However, there exist Python extensions to enable parallel computing using a similar map-reduce model.
2 replies on “The concept of map-reduce computing approach on example of Python (declarative vs imperative iteration)”
Does the use of map-reduce as described above have any effect on computing performance?
The documentation says that the version presented in this post is slightly faster than the imperative version, but I did not verify it myself.