MapReduce
January 12th 2009MapReduce is the generic term for a simple algorithm, but also for a library implementing this algorithm at Google. Details in the Google paper on MapReduce and the Google round table on the subject.
Map is the processing of a lot of different entities (like, say, counting words in a html page). This step produces a number of key -> value pairs, with duplicates in the keys, so that the values can be grouped for the next step
Reduce is the aggregation of all the different results into the required info (like, say, creating a search index). The values from previous steps are grouped per keys (key -> list of values), and these are aggregated (reduced) to produce the required results.
This is obviously only suited for certain operations (process-intensive map operation on lots of different entities, lighter operation to gather into one result). It's also very well suited to distributed systems. You start the map on a bunch of different machines, and then you reduce the result of all machines in one or more steps.
The simplicity of this concept made for it success - distributed stuff usually is painfully hard to bend your head round. It's being used in: CouchDB, Hadoop framework, Microsoft Dryad, and a few others mentioned in the wikipedia article. And at Google.
I'll talk about CouchDB next.