WorDS of Data Science beginning with S
In a Map-Reduce operation, the output from a Mapping operation is broken up and sent to several different Reduce tasks. This is because each Reduce operation is interested in obtaining a different result, and requires only a subset of the data from each Map task. The breaking up and transporting of different pieces of Map results to different Reducers is called ‘shuffling’.
For example, imagine that you and four friends have been tasked with reorganizing a small library of books, and say that you will be ordering them alphabetically. The entire collection is divided up between the five of you, and each of you sort your collection into groups, according to the first letter of each book’s title.
Your collection contains two novels belonging to the ’T’ group:
’To the Moon and Back’
‘There and Back Again’
Your friend Ashley has just one biography in her ’T’ group:
‘Thomas Jefferson: A Biography’
Before the entire library can be alphabetized, All of the books beginning with the letter ‘A’ need to be gathered and given to a single individual, who will alphabetize just the ‘A’ books. Meanwhile, another person will similarly receive and alphabetize all of the ‘B’ books, and so on.
The two books in your ’T’ group, and the one book in Ashley’s ’T’ group will be sent off to one person, while all of the books in your ‘G’ group will be sent to an entirely different individual. This is called ‘shuffling’, after the criss-cross pattern the groups make as they migrate from Mapper to Reducer.
Shuffling can be a time-intensive process, as much as 30% of the total time of a Map-Reduce operation. To save time, Reduce tasks will often begin ‘pulling groups’ of Mapping output as soon as a Map task has completed, rather than wait until all Map operations are completed. Shuffling is often considered a part of the Reduce operation, and is often implemented on the Reduce side.