As an autodidact in some areas I frequently run across things I probably should understand and know about but don’t. Since today in an interview with Google I was asked about MapReduce and said, “Err, I couldn’t really explain that,” which is sad because a couple years ago I was having dinner with a Google Engineer in Mountain View and he was raving about MapReduce and I didn’t quite understand what it was then either. So I thought I’d start my new objective of explaining complicated topics that the brain trust at Google (or other places) works on in terms that everyone can understand. Here we go.

The first thing to understand is that there are multiple MapReduces. There is a patented version by Google that is *the* MapReduce, and a bunch of other implementations, the most widely known probably Hadoop. What does it do? Well, simply speaking it is a way to process a massive amount of data. Rather than having it processed on a single machine, you can process it in parallel. The reason it is called MapReduce is that the first step is to divide up all of the work among the different machines (map), and then to figure out what the answer is given the results from all of these processes (reduce). The names of these functions come out of functional programming.

Obviously this is a core part of Google’s proprietary tech, since they process massive amounts of data to build their search index. Is is usually used with unstructured data. Of course, this may not matter to you much depending on what you are doing, since most of the time (almost certainly) you want structured data rather than unstructured data when you can get it. That said, if you’ve got a bunch of machines, an enormous dataset, and something you want to find out from the dataset, MapReduce may be for you!

A few additional sources: wiki | 2 | 3