{"id":1028,"date":"2018-09-07T06:36:15","date_gmt":"2018-09-07T06:36:15","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2018\/09\/07\/hadoop-for-beginners-part-2\/"},"modified":"2018-09-07T06:36:15","modified_gmt":"2018-09-07T06:36:15","slug":"hadoop-for-beginners-part-2","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2018\/09\/07\/hadoop-for-beginners-part-2\/","title":{"rendered":"Hadoop for Beginners &#8211; Part 2"},"content":{"rendered":"<p>Author: Aafrin Dabhoiwala<\/p>\n<div>\n<p style=\"text-align: center;\"><span style=\"text-decoration: underline; font-size: 14pt;\"><strong>Hadoop &#8211; MapReduce in an easy way<\/strong><\/span><\/p>\n<p>In the previous blog, we discussed about HDFS, one of the main components of Hadoop. I highly recommend going through <a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/hadoop-for-beginners\" target=\"_blank\" rel=\"noopener\">that blog<\/a> before moving onto MapReduce. This blog will introduce you to <strong>MapReduce<\/strong>, which is the main building blocks of processing in Hadoop framework. MapReduce is considered as the heart of Hadoop. Now, let see what makes MapReduce so popular in Hadoop framework.<\/p>\n<\/p>\n<p><span style=\"font-size: 12pt;\"><strong><u>What is MapReduce?<\/u><\/strong><\/span><\/p>\n<p>MapReduce is a <strong>programming framework<\/strong> that allows us to perform <strong>distributed<\/strong> and <strong>parallel<\/strong> processing on large data sets in a distributed environment.<\/p>\n<p><a href=\"http:\/\/api.ning.com\/files\/bWsXXB2erox4J3xfWz3Cgocv*tj4vVQ0CnkkuroM8UJheNjcP5HkmigaG5no7dOaai43xG2n1jwoEzfAO9OyWaSugRqnhboj\/1.PNG\" target=\"_self\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/bWsXXB2erox4J3xfWz3Cgocv*tj4vVQ0CnkkuroM8UJheNjcP5HkmigaG5no7dOaai43xG2n1jwoEzfAO9OyWaSugRqnhboj\/1.PNG\" class=\"align-center\" width=\"625\" height=\"270\"><\/a><\/p>\n<p>As shown in the above figure, input data is divided into partitions that are <strong>Mapped<\/strong> (transformed) and <strong>Reduced<\/strong> (aggregated) by mapper and reduced functions respectively that you define, and finally gives the output.<\/p>\n<p><strong>First \u2013<\/strong> Map takes a set of data (Input) and converts it into another set of data, where individual elements are broken down into <strong>Key\/Value pairs.<\/strong><\/p>\n<p><strong>Second \u2013<\/strong> Reduce takes the output from a map as an input and combines those Key\/Value pairs into smaller set of <strong>Key\/Value pairs.<\/strong><\/p>\n<p>As the sequence of the MapReduce implies, the reduce tasks are always performed after the map tasks.<\/p>\n<\/p>\n<p><span style=\"text-decoration: underline; font-size: 12pt;\"><strong>How MapReduce works?<\/strong><\/span><\/p>\n<p><span style=\"text-decoration: underline; font-size: 12pt;\"><strong><a href=\"http:\/\/api.ning.com\/files\/8fzzcZRAVnaaOqT*ZVy*GR*NOdJVYHsWMTRzp6*6oZ6PEnxZeh91mZb-eI8blrgdccXHoA1JBkyevlJIoRefVwvpKUGmNlfT\/2.PNG\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/8fzzcZRAVnaaOqT*ZVy*GR*NOdJVYHsWMTRzp6*6oZ6PEnxZeh91mZb-eI8blrgdccXHoA1JBkyevlJIoRefVwvpKUGmNlfT\/2.PNG\" class=\"align-center\" width=\"496\"><\/a><\/strong><\/span><\/p>\n<p>As shown in the above figure, the input data goes through the following phases:<\/p>\n<p><strong><u>Map Tasks<\/u><\/strong><\/p>\n<ul>\n<li><strong>Splitting<\/strong><\/li>\n<\/ul>\n<p>Input to MapReduce job is divided into fixed-size chunks called input splits. It produces the output in <strong><u>(Key, Value)<\/u><\/strong> pair.<\/p>\n<ul>\n<li><strong>Mapping<\/strong><\/li>\n<\/ul>\n<p>In this phase each input split is passed to a mapping function which divides the split into <strong><u>List (Key, Value).<\/u><\/strong><\/p>\n<p><strong><u>Reduce Tasks<\/u><\/strong><\/p>\n<ul>\n<li><strong>Shuffling and Sorting<\/strong><\/li>\n<\/ul>\n<p>Reduce tasks are the combination of shuffle\/sort and reduce. This phase consumes output of the Mapping phase. Its main task is to <strong>club together<\/strong> the relevant record in sorting manner from the output of mapping phase. The output is in the form of <strong><u>Key, List (Value).<\/u><\/strong><\/p>\n<ul>\n<li><strong>Reducing<\/strong><\/li>\n<\/ul>\n<p>In this phase, output from shuffling and sorting are <strong>aggregated<\/strong> and returns single <strong><u>(Key, Value)<\/u><\/strong> output value. This final output value is then written in the output file of <strong>HDFS<\/strong>.<\/p>\n<\/p>\n<p><span style=\"text-decoration: underline; font-size: 12pt;\"><strong>How MapReduce works with an Example<\/strong><\/span><\/p>\n<\/p>\n<ul>\n<li><span style=\"font-size: 10pt;\"><strong>Task<\/strong> \u2013 How many movies did each user rate in the Movie data set?<\/span><\/li>\n<li><span style=\"font-size: 10pt;\"><strong>Sample Dataset (Input File)<\/strong>&#8211;<\/span><\/li>\n<\/ul>\n<table>\n<tbody>\n<tr>\n<td width=\"102\">\n<p><strong>UserId<\/strong><\/p>\n<\/td>\n<td width=\"90\">\n<p><strong>MovieId<\/strong><\/p>\n<\/td>\n<td width=\"72\">\n<p><strong>Rating<\/strong><\/p>\n<\/td>\n<td width=\"132\">\n<p><strong>Timestamp<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"102\">\n<p>100<\/p>\n<\/td>\n<td width=\"90\">\n<p>319<\/p>\n<\/td>\n<td width=\"72\">\n<p>4<\/p>\n<\/td>\n<td width=\"132\">\n<p>343003432<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"102\">\n<p>120<\/p>\n<\/td>\n<td width=\"90\">\n<p>387<\/p>\n<\/td>\n<td width=\"72\">\n<p>2<\/p>\n<\/td>\n<td width=\"132\">\n<p>439439839<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"102\">\n<p>100<\/p>\n<\/td>\n<td width=\"90\">\n<p>435<\/p>\n<\/td>\n<td width=\"72\">\n<p>4<\/p>\n<\/td>\n<td width=\"132\">\n<p>545847584<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"102\">\n<p>121<\/p>\n<\/td>\n<td width=\"90\">\n<p>34<\/p>\n<\/td>\n<td width=\"72\">\n<p>3<\/p>\n<\/td>\n<td width=\"132\">\n<p>121212121<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"102\">\n<p>120<\/p>\n<\/td>\n<td width=\"90\">\n<p>212<\/p>\n<\/td>\n<td width=\"72\">\n<p>3<\/p>\n<\/td>\n<td width=\"132\">\n<p>548598459<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"102\">\n<p>218<\/p>\n<\/td>\n<td width=\"90\">\n<p>78<\/p>\n<\/td>\n<td width=\"72\">\n<p>1<\/p>\n<\/td>\n<td width=\"132\">\n<p>454545454<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"102\">\n<p>100<\/p>\n<\/td>\n<td width=\"90\">\n<p>343<\/p>\n<\/td>\n<td width=\"72\">\n<p>2<\/p>\n<\/td>\n<td width=\"132\">\n<p>323323232<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<ul>\n<li><strong>Map Tasks (Splitting and Mapping)<\/strong><\/li>\n<\/ul>\n<p>As we need to find the number of movies each user rated, we are interested in just two field from the data set \u2013 UserId and MovieId. We will extract and organize only the data what we care about.<\/p>\n<p>The <strong>output<\/strong> from the Map Tasks is the (Key, Value) pair &#8211;<\/p>\n<p><strong>(100,319), (120,387), (100,435), (121,34), (120,212), (218,78), (100,343)<\/strong><\/p>\n<\/p>\n<ul>\n<li><strong>Shuffling and sorting<\/strong><\/li>\n<\/ul>\n<p>This process sorts and groups\/clubs the Mapped data from the above step<\/p>\n<p>The <strong>output<\/strong> from shuffling and sorting is the Key, List (Values)\u2013<\/p>\n<p><strong>100, (319, 435, 343) \u00a0\u00a0\u00a0120, (387,212) \u00a0\u00a0\u00a0121, (34) \u00a0\u00a0\u00a0\u00a0\u00a0218, (78)<\/strong><\/p>\n<\/p>\n<ul>\n<li><strong>Reducing<\/strong><\/li>\n<\/ul>\n<p>This processes each key&#8217;s value from the above step. Reducer function would be to find the number of movies. It computes the aggregation of the MovieIds for each user. Reducer writes the final output to the <strong>HDFS<\/strong>.<\/p>\n<p>The <strong>output<\/strong> of this step is (key, value) pair &#8211;<\/p>\n<p><strong>(100,3) (120,2) (121,1) (218,1)<\/strong><\/p>\n<\/p>\n<ul>\n<li><strong>Conclusion<\/strong><\/li>\n<\/ul>\n<p>From the above MapReduce steps, user id-<\/p>\n<p><strong>100 rated 3 movies<\/strong>,<\/p>\n<p><strong>120 rated 2 movies<\/strong>,<\/p>\n<p><strong>121 and 218 rated 1 movie<\/strong><\/p>\n<\/p>\n<p><span style=\"text-decoration: underline; font-size: 12pt;\"><strong>Putting it all together in a Diagram<\/strong><\/span><\/p>\n<p><span style=\"text-decoration: underline; font-size: 12pt;\"><strong><a href=\"http:\/\/api.ning.com\/files\/8fzzcZRAVnZofFxOxzIkLPxkBxWflTGSxfxwZw0dkOzanlC2p9ZBLt6bRtrU79CmVLCOZnxK9Ow*Etho7geQHU1Gs4uQzWj6\/3.PNG\" target=\"_self\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/8fzzcZRAVnZofFxOxzIkLPxkBxWflTGSxfxwZw0dkOzanlC2p9ZBLt6bRtrU79CmVLCOZnxK9Ow*Etho7geQHU1Gs4uQzWj6\/3.PNG\" class=\"align-center\" width=\"659\" height=\"233\"><\/a><\/strong><\/span><\/p>\n<\/p>\n<p><span style=\"text-decoration: underline; font-size: 12pt;\"><strong>How MapReduce distributes processing (In Detail)?<\/strong><\/span><\/p>\n<p>Now let&#8217;s understand the complete end to end workflow of MapReduce in Hadoop, how input is given to the mapper, how mapper process the data, where mapper writes the data, how data is shuffled and sorted from mapper to reducer nodes, where reducers run and what type of processing is done in the reducers? All these questions will be answered in the following:<\/p>\n<p><a href=\"http:\/\/api.ning.com\/files\/8fzzcZRAVnbcZgd0FsJT7Fb0vjB0BOMsuLjDU**A3IuYjKjG3MGYoiUIVZOxjXl2tzuJ-Q5u7f7Ahrafke-6W0lCRGKP4Byt\/4.PNG\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/8fzzcZRAVnbcZgd0FsJT7Fb0vjB0BOMsuLjDU**A3IuYjKjG3MGYoiUIVZOxjXl2tzuJ-Q5u7f7Ahrafke-6W0lCRGKP4Byt\/4.PNG\" class=\"align-center\" width=\"717\"><\/a><\/p>\n<p><strong><u>Step 1<\/u> \u2013<\/strong> MapReduce workflow starts with the client program submitting job to the <strong>JobTracker*<\/strong><\/p>\n<ul>\n<li><strong>*JobTracker<\/strong>&#8211; a job configuration which specifies the map and reduce functions, as well as input and output path of the data. It also schedules jobs and tracks the assign jobs to the <strong>TaskTracker*<\/strong><\/li>\n<li><strong>*TastTracker \u2013<\/strong> It tracks the tasks and reports status to JobTracker.<\/li>\n<\/ul>\n<p><strong><u>Step 2<\/u><\/strong> \u2013 JobTracker will determine the number of splits from the input path of the data and it will select some TaskTrackers based on their network proximity to the data source. Then, the JobTracker send the task requests to those selected TaskTrackers.<\/p>\n<\/p>\n<p><strong><u>Step 3<\/u> \u2013<\/strong> Each Tasktracker will start the <strong>Map Phase processing<\/strong> by extracting input data from the splits. For each record parsed by the <strong>InputFormatter<\/strong>, it invokes the user provided &#8216;Map&#8217; function, which stores several key-value pair in the circular <strong>memory buffer<\/strong>* (100MB default size)<\/p>\n<ul>\n<li><strong>*Memory buffer \u2013<\/strong> it is found mainly in RAM and acts as an area where the CPU can store data temporarily.<\/li>\n<\/ul>\n<p>If the memory buffer fills up or if it reaches its maximum threshold (100 MB by default), mapper will block filling of data and spilling\/transferring of data takes place from memory buffer to the local disk until the buffer has space for incoming data.<\/p>\n<\/p>\n<p><strong><u>Step 4<\/u> &#8211;<\/strong> Before spilling the data into the disk, the thread will divide the data into <strong>partition<\/strong> corresponding to the reducers which it will ultimately send to. \u00a0For each partition, the background thread will perform in-memory sort by Key. Each time memory buffer reaches the threshold of filling data, a new spill file is created. There could be several spill files after the map task has written its last output record.<\/p>\n<\/p>\n<p><strong><u>Step 5<\/u><\/strong> \u2013 Before a map task is finished, all spill files are <strong>merged<\/strong> into a single partition and stored into the output files in the disk.<\/p>\n<\/p>\n<p><strong><u>Step 6<\/u><\/strong> \u2013 When the map task completes (all splits are done), the TaskTracker will notify the JobTracker. When all the TaskTrackers are done, the JobTracker will notify the selected TaskTrackers for the <strong>Reduce Phase.<\/strong> Hence, each TaskTracker will read the output files remoted from the above step and <strong>sort the Key-Value pairs<\/strong>. For each key, it invokes the \u201creduce\u201d function, which collects the <strong>key- aggregatedValue<\/strong> and writes it into the output file (one per reducer node).<\/p>\n<p>The JobTracker keep tracks of the progress of each phase and periodically ping the TaskTracker for their health status. When any of the map phase TaskTracker crashes, the JobTracker will reassign the map task to a different TaskTracker node, which will rerun all the assigned splits. If the reduce phase TaskTracker crashes, the JobTracker will rerun the reduce at a different TaskTracker.<\/p>\n<\/p>\n<p><strong><u>Step 7<\/u><\/strong> \u2013 After both phases are complete, the JobTracker will unblock the client program.<\/p>\n<\/p>\n<p><span style=\"text-decoration: underline; font-size: 12pt;\"><strong>What are the advantages of MapReduce?<\/strong><\/span><\/p>\n<p><strong>Resilient to Failure<\/strong> \u2013 an application master watch mapper and reducer tasks on each partition.<\/p>\n<p>Data processing is easy to <strong>scale<\/strong> over multiple computing nodes.<\/p>\n<p><strong>Parallel processing<\/strong> \u2013 In MapReduce, jobs are divided among multiple nodes and each node works with a part of the job simultaneously and hence helps to process the data using different machines. As the data is processed by different machines in parallel, the time taken to process the data gets reduced by a tremendous amount.<\/p>\n<p><strong>Cost-effective solution &#8211;<\/strong> <span>Hadoop\u2019s highly scalable structure also implies that it comes across as a very cost-effective solution for businesses that need to store ever growing data dictated by today\u2019s requirements. Hadoop\u2019s scale-out architecture with MapReduce, allows the storage and processing of data in a very affordable manner.<\/span><\/p>\n<p><strong>Fast \u2013<\/strong> The tools of data processing; here MapReduce are often on the same servers where the data is located, resulting in faster data processing.<\/p>\n<\/p>\n<p>In the next blog, we shall discuss about YARN which is another key feature of Hadoop. Stay tuned.<\/p>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:756046\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Aafrin Dabhoiwala Hadoop &#8211; MapReduce in an easy way In the previous blog, we discussed about HDFS, one of the main components of Hadoop. [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2018\/09\/07\/hadoop-for-beginners-part-2\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":1029,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1028"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=1028"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1028\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/1029"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=1028"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=1028"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=1028"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}