{"id":476,"date":"2018-05-15T14:44:00","date_gmt":"2018-05-15T14:44:00","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2018\/05\/15\/data-or-algorithms-which-is-more-important\/"},"modified":"2018-05-15T14:44:00","modified_gmt":"2018-05-15T14:44:00","slug":"data-or-algorithms-which-is-more-important","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2018\/05\/15\/data-or-algorithms-which-is-more-important\/","title":{"rendered":"Data or Algorithms \u2013 Which is More Important?"},"content":{"rendered":"<p>Author: William Vorhies<\/p>\n<div>\n<p><b><i>Summary:<\/i><\/b> <i>\u00a0Which is more important, the data or the algorithms?\u00a0 This chicken and egg question led me to realize that it\u2019s the data, and specifically the way we store and process the data that has dominated data science over the last 10 years.\u00a0 And it all leads back to Hadoop.<\/i><\/p>\n<p>\u00a0<\/p>\n<p><a href=\"http:\/\/api.ning.com\/files\/c6kyI4oiDt9veMbR8UPZ4cxETQRnEWcV-mbDXqgql-mllrUO-ERfGYFFmzxAgXlkB-yID4*3Ee7YWls4BOUURgm-NNep*L9i\/chickenegg1.jpg\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/c6kyI4oiDt9veMbR8UPZ4cxETQRnEWcV-mbDXqgql-mllrUO-ERfGYFFmzxAgXlkB-yID4*3Ee7YWls4BOUURgm-NNep*L9i\/chickenegg1.jpg?width=250\" width=\"250\" class=\"align-right\"><\/a>Recently I was challenged to speak on the role of data in data science.\u00a0 This almost sounds like a chicken and egg problem.\u00a0 How can you have one without the other?\u00a0 But as I reflected on how to explain this it also struck me that almost everything in the press today is about advances in algorithms.\u00a0 That\u2019s mostly deep learning and reinforcement learning which are driving our chatbots, image apps, and self-driving cars.<\/p>\n<p>So if you are fairly new to data science, say within the last five or six years you may have missed the fact that it is and was the data, or more specifically how we store and process the data that was the single most important factor in the explosion of data science over the last decade.\u00a0 In fact there was a single innovation that enabled data lakes, recommenders, IoT, natural language processing, image and video recognition, AI, and reinforcement learning.<\/p>\n<p>Essentially all of these areas of major innovation can be tracked back to the single enabler, NoSQL Hadoop.<\/p>\n<p>\u00a0<a href=\"http:\/\/api.ning.com\/files\/c6kyI4oiDt9KhQiewdjHTOcydZHqm6jmEYpdPdcJgDh-kChAaPG9nJAaLMo8oqSFvObXKqOER-FAPIHyZIfpiDNPBNsY0Dg6\/nosql1.png\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/c6kyI4oiDt9KhQiewdjHTOcydZHqm6jmEYpdPdcJgDh-kChAaPG9nJAaLMo8oqSFvObXKqOER-FAPIHyZIfpiDNPBNsY0Dg6\/nosql1.png?width=600\" width=\"600\" class=\"align-center\"><\/a><\/p>\n<p>It was in 2006 that Doug Cutting and his team took the proprietary work done at Google to the Apache Institute and created open source Hadoop.<\/p>\n<p>Most of you will recognize that this was also the birth of the era of Big Data, because Hadoop for the first time gave us a reasonable way to store, retrieve, and analyze anything.\u00a0 The addition of unstructured and semi-structured data like text, speech, image, and video created the possibilities of AI that we have today.\u00a0 It also let us store volumes of ordinary data like web logs or big transactional files that were previously simply too messy to store.<\/p>\n<p>What you may not know, and I heard Doug Cutting himself quote at this last spring\u2019s Strata Conference in San Jose is that the addition of unstructured and semi-structured data are not the most important feature of Hadoop.\u00a0 <b>The most important feature is that it allowed many ordinary computers to function as a single computer.<\/b>\u00a0 This was the birth of Massive Parallel Processing (MPP).\u00a0 If it hadn\u2019t been for MPP the hardware we have today would never have evolved and today\u2019s data science simply would not and could not exist.<\/p>\n<p>It\u2019s interesting to track the impact that this has had on each of the major data science innovations over the last decade:<\/p>\n<p>\u00a0<\/p>\n<p><span class=\"font-size-3\"><b>Predictive Analytics<\/b><\/span><\/p>\n<p>I have personally been practicing in predictive analytics since 2001.\u00a0 As valuable as that discipline was becoming to any major company with a large B2C market, we were restricted to basically numerical data.<\/p>\n<p>\u00a0<a href=\"http:\/\/api.ning.com\/files\/c6kyI4oiDt908O04nR3GO-UZcyCB4khUkKvy0b6EQ4PZC1VLasNUkvS-mvq3JDAanRBkcc5e6e*ap9RMe79j20BSxvQx0E6q\/nosql2.png\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/c6kyI4oiDt908O04nR3GO-UZcyCB4khUkKvy0b6EQ4PZC1VLasNUkvS-mvq3JDAanRBkcc5e6e*ap9RMe79j20BSxvQx0E6q\/nosql2.png?width=500\" width=\"500\" class=\"align-center\"><\/a><\/p>\n<p>As we move through this history I\u2019ll use this graphic to help locate the impact of the \u2018data\u2019 versus the innovation it enables.\u00a0 On the vertical axis we have the domains of structured through unstructured data.\u00a0 On the horizontal axis, a description of whether that data science technique delivers very specific insights or just more directional guidance.<\/p>\n<p>For the most part, in predictive modeling we were restricted to what we could extract from RDBMS systems like a BI warehouse, or with much more effort from transactional systems.\u00a0 A few of our algorithms like decision trees could directly handle standardized alpha fields like state abbreviations, but pretty much everything had to be converted to numeric.<\/p>\n<p>Predictive models on the other hand deliver business insights that are extremely specific about consumer behavior or the future value of a target variable.\u00a0 Generally, predictive models continue to deliver accurate predictions in the range of 70% to 90% accuracy about questions like who will buy or what the spot price of oil will be next month.<\/p>\n<p>\u00a0<\/p>\n<p><span class=\"font-size-3\"><b>Data Lakes<\/b><\/span><\/p>\n<p>One of the first applications of our new found compute power and flexibility was Data Lakes.\u00a0 These are the ad hoc repositories where you can place a lot of data without having to predefine a schema or getting IT involved.\u00a0 These are the data scientist\u2019s playground where we can explore hypotheses and look for patterns without a lot of cost or time.<\/p>\n<p>\u00a0<a href=\"http:\/\/api.ning.com\/files\/c6kyI4oiDt*pKoH3GoOuvG-SZT5rb9pc75Jcm3r5O8eA8zZbu7tssE0DEL2Yw78W031jtLUypAnj4bBVA7imVpvdQsEd-qp6\/nosql3.png\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/c6kyI4oiDt*pKoH3GoOuvG-SZT5rb9pc75Jcm3r5O8eA8zZbu7tssE0DEL2Yw78W031jtLUypAnj4bBVA7imVpvdQsEd-qp6\/nosql3.png?width=500\" width=\"500\" class=\"align-center\"><\/a><\/p>\n<p>Data Lakes in Hadoop could be established in a matter of hours and mostly without waiting for IT to help.\u00a0 These really speeded up the predictive modeling process since the volume of data that could be processed was rapidly expanding thanks to MPP.\u00a0 It also gave us a place to begin developing our techniques for NLP and image processing.<\/p>\n<p>\u00a0<\/p>\n<p><span class=\"font-size-3\"><b>Recommenders<\/b><\/span><\/p>\n<p>Now that we could handle the volume and complexity of web logs and large transactional files, the field of recommenders took off.<\/p>\n<p>\u00a0<a href=\"http:\/\/api.ning.com\/files\/c6kyI4oiDt-xYXYNKqoPlM3CWN*n7Q-YEH7UtUEUII6zdfHYRHJN65XoMbUPjXfqJbHv-vn5l5YDzoi3SBiNWN*7m8KorRUa\/nosql4.png\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/c6kyI4oiDt-xYXYNKqoPlM3CWN*n7Q-YEH7UtUEUII6zdfHYRHJN65XoMbUPjXfqJbHv-vn5l5YDzoi3SBiNWN*7m8KorRUa\/nosql4.png?width=500\" width=\"500\" class=\"align-center\"><\/a><\/p>\n<p>Recommender insights are directional in nature but answer really important questions on the minds of non-data scientists like:<\/p>\n<ul>\n<li>What should we buy.<\/li>\n<li>What should we watch or read.<\/li>\n<li>Who should we date or marry.<\/li>\n<\/ul>\n<p>The evolution of Recommenders underlies all of search and ecommerce.<\/p>\n<p>\u00a0<\/p>\n<p><span class=\"font-size-3\"><b>Natural Language Processing<\/b><\/span><\/p>\n<p>As we move forward into about the last five years, the more important features of Big Data enabled by Hadoop and NoSQL have become its ability to support unstructured data and data in motion.<\/p>\n<p>\u00a0<a href=\"http:\/\/api.ning.com\/files\/c6kyI4oiDt-xtsNYEaGphwSjDncL8U9EIH-uL5Wfh1GhdA77xT0vg5lozC34DvRtBiDkPundStRtFKgZhRo6iZjhyLCDOpp*\/nosql5.png\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/c6kyI4oiDt-xtsNYEaGphwSjDncL8U9EIH-uL5Wfh1GhdA77xT0vg5lozC34DvRtBiDkPundStRtFKgZhRo6iZjhyLCDOpp*\/nosql5.png?width=500\" width=\"500\" class=\"align-center\"><\/a><\/p>\n<p>This is Alexa, Siri, Cortana, Google Assistant, and the thousands of chatbots that have started emerging just since 2015.\u00a0 NLP took several years to evolve and now requires deep learning algorithms like recurrent neural nets.\u00a0 Our deep learning algorithms wouldn\u2019t be able to find these patterns without millions of data items to examine and MPP used to keep the training time within human time frames.<\/p>\n<p>Chatbots, operating both in text and spoken language have emerged so rapidly over just the last three years that in 2015 only 25% of surveyed companies had heard of them, until 2017 when 75% of companies are reported to be building them.<\/p>\n<p>An interesting feature emerging from NLP is that we have learned to take unstructured text and convert it to features in our predictive models alongside our traditional variables to create more accurate models.<\/p>\n<p>\u00a0<\/p>\n<p><span class=\"font-size-3\"><b>Internet of Things (IoT)<\/b><\/span><\/p>\n<p>IoT has created an industry of its own by taking the third capability of Hadoop and Big Data, the ability to process data in motion, and turning that relatively straightforward capability into an unbelievable variety of applications.<\/p>\n<p>\u00a0<a href=\"http:\/\/api.ning.com\/files\/c6kyI4oiDt8V-Snk4r*w4fUyW7gbf2tuuVyR*33dRdOMK4mqWsUpYfeagacW1BtZ2GBghCc3aheT7lJnDXt*HSBVt-tuKtyU\/nosql6.png\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/c6kyI4oiDt8V-Snk4r*w4fUyW7gbf2tuuVyR*33dRdOMK4mqWsUpYfeagacW1BtZ2GBghCc3aheT7lJnDXt*HSBVt-tuKtyU\/nosql6.png?width=500\" width=\"500\" class=\"align-center\"><\/a><\/p>\n<p>Hadoop allows us to look at and act on semi-structured data streaming inward from sensors and take action on it before it has even been stored.\u00a0 This leads to the capability of dramatically speeding up response time when compared to the previous paradigm of store-analyze-deploy.<\/p>\n<p>IoT systems lead us back to the very accurate and specific end of the insight scale.\u00a0 Some of its actions can be driven by complex predictive models but others may simply compare a sensor reading to a standard value and issue a message.\u00a0 These can be as simple as \u201coh, oh, the dog has left the yard\u201d or as sophisticated as \u201cget a doctor to patient Jones who is about to have a heart attack in the next 5 minutes\u201d.<\/p>\n<p>\u00a0<\/p>\n<p><span class=\"font-size-3\"><b>Image Processing, Reinforcement Learning, and Other Deep Learning Techniques<\/b><\/span><\/p>\n<p>The most emergent of our new data science capabilities are those that have been loosely branded \u2018artificial intelligence\u2019.\u00a0 NLP which has evolved from simple sentiment analysis and word clouds to full-fledged conversational ability should also be included in this category.\u00a0 Taken together they are the eyes, ears, arms and legs of our many robots including self-driving cars.<\/p>\n<p>\u00a0<a href=\"http:\/\/api.ning.com\/files\/c6kyI4oiDt8I-JBHx2z0UJM1NZmE-QObOT3QgvrjTRccDaQFgs1h7ZDKpuyRT5ppE7buk-3cupw3GEIrKzCVPcblKnkgHu2y\/nosql7.png\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/c6kyI4oiDt8I-JBHx2z0UJM1NZmE-QObOT3QgvrjTRccDaQFgs1h7ZDKpuyRT5ppE7buk-3cupw3GEIrKzCVPcblKnkgHu2y\/nosql7.png?width=500\" width=\"500\" class=\"align-center\"><\/a><\/p>\n<p>Like NLP, image processing relies in deep neural nets, mostly in the class of convolutional neural nets.\u00a0 Reinforcement learning is still evolving a common tool set but relies just as deeply on MPP of huge unstructured data sets.<\/p>\n<p>Of course there have been other advancements but they are more in the nature of refinements.\u00a0 Hadoop has been largely been replaced by Spark which continues all of its prior capabilities only better and faster.\u00a0 CPUs used in MPP are being paired with or replaced by GPUs or FPGAs to create horizontal process scaling that allows commercial projects to take advantage of super computer speeds.<\/p>\n<p>All of data science as we know it today, all of these innovations we\u2019ve seen over the last 10 years, continues to grow out of the not-so-simple revolution in how we store and process data with NoSQL and Hadoop.<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>About the author:\u00a0 Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001.\u00a0 He can be reached at:<\/p>\n<p><a href=\"mailto:Bill@DataScienceCentral.com\">Bill@DataScienceCentral.com<\/a><\/p>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:658233\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: William Vorhies Summary: \u00a0Which is more important, the data or the algorithms?\u00a0 This chicken and egg question led me to realize that it\u2019s the [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2018\/05\/15\/data-or-algorithms-which-is-more-important\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":473,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/476"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=476"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/476\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/467"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=476"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=476"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=476"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}