During Week 7, we learned about MapReduce (or Map-Shuffle-Reduce), which is used in processing large amounts of data over distributed systems.
During the mapping stage, relevant data is selected from tables on each worker node. Then, during the shuffle stage, this data is redistributed and sorted into temporary tables on each node. These two stages may be repeated any number of times, depending on the number of joins is necessary to get the desired results. Finally, we enter the reduce stage, which combines the results from every worker node sent to the controller.
While the concept of MapReduce is easy to understand, it took a while for me to wrap my head around how it should be implemented in our Python model.
However, I ultimately realized that every temporary table created in the map-shuffle stages were equivalent to SELECT subqueries in the FROM clause in MySQL. The reduce stage is equivalent to the main query that uses these subqueries and JOINs to get a desired result. Once I understood this, crafting programs that would return result tables using MapReduce became incredibly easy.
I look forward to learning more about utilizing cluster computing in the future.
Entering week 8, I only have the final to worry about and I feel extremely prepared. I have read over the questions and I have no doubt that I will do well. I have learned much during this class and I look forward to integrating my knowledge of MySQL to Web Programming this summer.