What the Heck is Map Reduce?

When you first heard of Hadoop, you probably heard the term "Map Reduce*". And, when you looked that term up on the internet, you got something like this:
Map(k1,v1) → list(k2,v2)
Reduce(k2, list (v2)) → list(v3)

And that cleared everything right up, yes? Of course not.

To understand the basics of Map Reduce, I'll describe a situation confronting a King and his kingdom.

One day, a King was holding court, and two well-respected dignitaries named Thick and Thin arrived. The first one, Thin, said: "My dearest King, we have a crisis: The people in your kingdom are overweight and you must act immediately." At that point, Thick interjected and said: "On the contrary! My dearest King, we do have a crisis, but it is not as you have heard from Thin. In fact, it's the opposite: The people in your kingdom are underweight and you must act immediately!"

The King asked Thick and Thin for evidence supporting their respective cases. Each had a similar response. Thick stated, "I have traveled your great land, and I've observed it with my own eyes - the people in the fields are starving!" Thin retorted, "I too have traveled your kingdom, and each diner I visited was populated with largess."

After thinking for a bit, the King decided that he needed facts before he acted either way. He declared, "I will make my decision based on an accurate assessment of my people. I therefore decree that a census be taken, and the average weight of my people shall be determined. Furthermore, the weights shall be averaged based on an individual's height, so that I shall have an average for those who are 5'10", and also for those who are all other heights."

All in attendance admired this approach. But soon, the King's chief adviser inquired: "My dear lord, how will you do this? When we last performed a census, it took our Official Counter five years to gather the results. If you are to act decisively, you will need to make your decision much sooner!"

The King replied, "Yes, I recall how long it took to perform the last census. Clearly, that approach will not work. Would it be better if I gave the Official Counter a team of assistants, increasing his ability to count?"

"Yes, it would decrease the time, but only by six months", the chief adviser replied.

"Why such a minimal gain for a tenfold increase in counting power?", the King inquired.

"Because the Official Counter and his assistants must still travel your great and distant land. And when they reach a village, it usually only contains a handful of villagers, so many of the assistants merely sit idle."

"Then I shall decree that all of the people of my land come to the capitol so that they can all count", the King countered.

"But my lord, even if we could get transportation for your people, we simply haven't enough room in the city's Inns to store all of the people."

The King sighed, and replied, "I see. This is a problem that requires an entirely new approach. Summon the wise men of Googol, and see if they have a suggestion!".

When the wise men arrived, they described a new technique named "Mapus Reducius". At first the King thought it was blasphemy, but as they described how they had used it effectively in their own endeavors, the King began to be convinced of it's elegance. Accordingly, the men of Googol recommended that the census be performed as follows:

  • Send a Paige to each of the villages, and instruct the village elder to group their population by height. A scroll shall be used for each height, and all villagers of a specific height shall have their weight recorded on the scroll.
  • Once each scroll has been completed, they shall be sent to the appropriate Earl. So, for example, each 5'5" scroll shall be sent to the Earle of Short, and each 6'3" scroll shall be sent to the Earl of Dunk, and so on.
  • Each Earl shall calculate the average weight of all people of the same height. Once they have that average, they shall present their result to the King. The King will then have the relevant data to make a decision.

Now, the King harbored some secret doubts about this approach, but since he saw no other reasonable alternative, he proceeded. His doubts were laid to rest when answers started arriving within days. By the end of two weeks, he had all the data he needed, and it told an interesting story: Each height categorization had an appropriate average weight. Armed with this information, he summoned Thick and Thin, and asked them to explain their earlier assessments. As it turned out, Thick and Thin had been secret agents, working for the King of Badness, and they were trying to weaken the land by causing either extreme starvation, or extreme gluttony. They confessed, "We sought to create a problem that was too large to solve. Armed with that problem, we then hoped to force you to make a decision without knowing the facts. We estimated that the problem would take years to uncover, but you and your infernal wise men of Googol have been victorious! Please have mercy on us!"

The King showed mercy, and merely expelled Thick and Thin, and the people within the kingdom were saved. The End.

In the above example, the problem was solved by using Map Reduce. To summarize:

  1. The King wanted to gather information about an large, unstructured set of data (his people), that were too large to be called to a single location.
  2. He sent Paiges to each village and grouped the data in that village by some common key. (In this case, he "Mapped" data by height.)
  3. He then "shuffled" the data, by sending all data with the same key to the appropriate Earl
  4. The Earls then had a much simpler job of "Reducing" the data to a single average, which they presented to the King.

By following this approach, the King was able to undertake a massively parallel operation in far less time than usual. These are the same basic steps that a Map Reduce algorithm uses: Deploy the code, Map the unstructured data, shuffle it to the appropriate Reducers, reduce the intermediate results, and record the final "answer". You are now ready to be the king!