Author: William Vorhies
Summary: Which is more important, the data or the algorithms? This chicken and egg question led me to realize that it’s the data, and specifically the way we store and process the data that has dominated data science over the last 10 years. And it all leads back to Hadoop.
Recently I was challenged to speak on the role of data in data science. This almost sounds like a chicken and egg problem. How can you have one without the other? But as I reflected on how to explain this it also struck me that almost everything in the press today is about advances in algorithms. That’s mostly deep learning and reinforcement learning which are driving our chatbots, image apps, and self-driving cars.
So if you are fairly new to data science, say within the last five or six years you may have missed the fact that it is and was the data, or more specifically how we store and process the data that was the single most important factor in the explosion of data science over the last decade. In fact there was a single innovation that enabled data lakes, recommenders, IoT, natural language processing, image and video recognition, AI, and reinforcement learning.
Essentially all of these areas of major innovation can be tracked back to the single enabler, NoSQL Hadoop.
It was in 2006 that Doug Cutting and his team took the proprietary work done at Google to the Apache Institute and created open source Hadoop.
Most of you will recognize that this was also the birth of the era of Big Data, because Hadoop for the first time gave us a reasonable way to store, retrieve, and analyze anything. The addition of unstructured and semi-structured data like text, speech, image, and video created the possibilities of AI that we have today. It also let us store volumes of ordinary data like web logs or big transactional files that were previously simply too messy to store.
What you may not know, and I heard Doug Cutting himself quote at this last spring’s Strata Conference in San Jose is that the addition of unstructured and semi-structured data are not the most important feature of Hadoop. The most important feature is that it allowed many ordinary computers to function as a single computer. This was the birth of Massive Parallel Processing (MPP). If it hadn’t been for MPP the hardware we have today would never have evolved and today’s data science simply would not and could not exist.
It’s interesting to track the impact that this has had on each of the major data science innovations over the last decade:
Predictive Analytics
I have personally been practicing in predictive analytics since 2001. As valuable as that discipline was becoming to any major company with a large B2C market, we were restricted to basically numerical data.
As we move through this history I’ll use this graphic to help locate the impact of the ‘data’ versus the innovation it enables. On the vertical axis we have the domains of structured through unstructured data. On the horizontal axis, a description of whether that data science technique delivers very specific insights or just more directional guidance.
For the most part, in predictive modeling we were restricted to what we could extract from RDBMS systems like a BI warehouse, or with much more effort from transactional systems. A few of our algorithms like decision trees could directly handle standardized alpha fields like state abbreviations, but pretty much everything had to be converted to numeric.
Predictive models on the other hand deliver business insights that are extremely specific about consumer behavior or the future value of a target variable. Generally, predictive models continue to deliver accurate predictions in the range of 70% to 90% accuracy about questions like who will buy or what the spot price of oil will be next month.
Data Lakes
One of the first applications of our new found compute power and flexibility was Data Lakes. These are the ad hoc repositories where you can place a lot of data without having to predefine a schema or getting IT involved. These are the data scientist’s playground where we can explore hypotheses and look for patterns without a lot of cost or time.
Data Lakes in Hadoop could be established in a matter of hours and mostly without waiting for IT to help. These really speeded up the predictive modeling process since the volume of data that could be processed was rapidly expanding thanks to MPP. It also gave us a place to begin developing our techniques for NLP and image processing.
Recommenders
Now that we could handle the volume and complexity of web logs and large transactional files, the field of recommenders took off.
Recommender insights are directional in nature but answer really important questions on the minds of non-data scientists like:
- What should we buy.
- What should we watch or read.
- Who should we date or marry.
The evolution of Recommenders underlies all of search and ecommerce.
Natural Language Processing
As we move forward into about the last five years, the more important features of Big Data enabled by Hadoop and NoSQL have become its ability to support unstructured data and data in motion.
This is Alexa, Siri, Cortana, Google Assistant, and the thousands of chatbots that have started emerging just since 2015. NLP took several years to evolve and now requires deep learning algorithms like recurrent neural nets. Our deep learning algorithms wouldn’t be able to find these patterns without millions of data items to examine and MPP used to keep the training time within human time frames.
Chatbots, operating both in text and spoken language have emerged so rapidly over just the last three years that in 2015 only 25% of surveyed companies had heard of them, until 2017 when 75% of companies are reported to be building them.
An interesting feature emerging from NLP is that we have learned to take unstructured text and convert it to features in our predictive models alongside our traditional variables to create more accurate models.
Internet of Things (IoT)
IoT has created an industry of its own by taking the third capability of Hadoop and Big Data, the ability to process data in motion, and turning that relatively straightforward capability into an unbelievable variety of applications.
Hadoop allows us to look at and act on semi-structured data streaming inward from sensors and take action on it before it has even been stored. This leads to the capability of dramatically speeding up response time when compared to the previous paradigm of store-analyze-deploy.
IoT systems lead us back to the very accurate and specific end of the insight scale. Some of its actions can be driven by complex predictive models but others may simply compare a sensor reading to a standard value and issue a message. These can be as simple as “oh, oh, the dog has left the yard” or as sophisticated as “get a doctor to patient Jones who is about to have a heart attack in the next 5 minutes”.
Image Processing, Reinforcement Learning, and Other Deep Learning Techniques
The most emergent of our new data science capabilities are those that have been loosely branded ‘artificial intelligence’. NLP which has evolved from simple sentiment analysis and word clouds to full-fledged conversational ability should also be included in this category. Taken together they are the eyes, ears, arms and legs of our many robots including self-driving cars.
Like NLP, image processing relies in deep neural nets, mostly in the class of convolutional neural nets. Reinforcement learning is still evolving a common tool set but relies just as deeply on MPP of huge unstructured data sets.
Of course there have been other advancements but they are more in the nature of refinements. Hadoop has been largely been replaced by Spark which continues all of its prior capabilities only better and faster. CPUs used in MPP are being paired with or replaced by GPUs or FPGAs to create horizontal process scaling that allows commercial projects to take advantage of super computer speeds.
All of data science as we know it today, all of these innovations we’ve seen over the last 10 years, continues to grow out of the not-so-simple revolution in how we store and process data with NoSQL and Hadoop.
About the author: Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001. He can be reached at: