| Comments

MapReduce is the generic term for a simple algorithm, but also for a library implementing this algorithm at Google. Details in the Google paper on MapReduce and the Google round table on the subject.

  • Map is the processing of a lot of different entities (like, say, counting words in a html page). This step produces a number of key –> value pairs, with duplicates in the keys, so that the values can be grouped for the next step

  • Reduce is the aggregation of all the different results into the required info (like, say, creating a search index). The values from previous steps are grouped per keys (key –> list of values), and these are aggregated (reduced) to produce the required results.

This is obviously only suited for certain operations (process-intensive map operation on lots of different entities, lighter operation to gather into one result). It’s also very well suited to distributed systems. You start the map on a bunch of different machines, and then you reduce the result of all machines in one or more steps.

The simplicity of this concept made for it success – distributed stuff usually is painfully hard to bend your head round. It’s being used in: CouchDB, Hadoop framework, Microsoft Dryad, and a few others mentioned in the wikipedia article. And at Google.

I’ll talk about CouchDB next.