Author: Vincent Granville
In this article, I describe the various steps involved in managing a machine learning process from beginning to end. Depending on which company you work for, you may or may not be involved in all the steps. In larger companies, you typically focus on one or two specialized aspects of a project. In small companies, you may be involved in all the steps. Here the focus is on large projects, such as developing a taxonomy, as opposed to ad-hoc or one-time analyses. I also mention all the people involved, besides machine learning professionals.
Steps involved in machine learning projects
In chronological order, here are the main steps. Sometimes it is necessary to recognize errors in the process and move back and start again at an earlier step. This is by no mean a linear process, but more like trial and error experimentation.
1. Defining the problem and the metrics (also called features) that we want to track. Assessing the data available (internal and third party sources) or the databases that need to be created, as well as database architecture for optimum storing and processing. Discuss cloud architectures to choose from, data volume (potential future scaling issues), and data flows. Do we need real-time data? How much can safely be outsourced? Do we need to hire some staff? Discuss costs, ROI, vendors, and timeframe. Decision makers and business analysts are heavily involved, and data scientists and engineers may participate in the discussion.
2. Defining goals and types of analyses to be performed. Can we monetize the data? Are we going to use the data for segmentation, customer profiling and better targeting, to optimize some processes such as pricing or supply chain, for fraud detection, taxonomy creation, to increase sales, for competitive or marketing intelligence, or to improve user experience for instance via a recommendation engine or better search capacities? What are the most relevant goals? Who will be the main users?
3. Collecting the data. Assessing who has access to the data (and which parts of the data, such as summary tables versus life databases), and in what capacity. Here privacy and security issues are also discussed. The IT team, legal team and data engineers are typically involved. Dashboard design is also discussed, with the purpose of designing good dashboards for end-users such as decision makers, product or marketing team, or customers.
4. Exploratory data analysis. Here data scientists are more heavily involved, though this step should be automated as much as possible. You need to detect missing data and how to handle it (using imputation methods), identify outliers and what they mean, summarize and visualize the data, find erroneously coded data and duplicates, find correlations, perform preliminary analyses, find best predicting features and optimum binning techniques (see section 4 in this article). This could lead to the discovery of data flaws, and may force you to revisit and start again from a previous step, to fix any significant issue.
5. The true machine learning / modeling step. At this point, we assume that the data collected is stable enough, and can be used for its original purpose. Predictive models are being tested, neural networks or other algorithms / models are being trained with goodness-of-fit tests and cross-validation. The data is available for various analyses, such as post-mortem, fraud detection, or proof of concept. Algorithms are prototyped, automated and eventually implemented in production mode. Output data is stored in auxiliary tables for further use, such as email alerts or to populate dashboards. External data sources may be added and integrated. As this point, major data issues have been fixed.
6. Creation of end-user platform. Typically, it comes as dashboards featuring visualizations and summary data that can be exported in standardized formats, even spreadsheets. This provides the insights that can be acted upon by decision makers. The platform can be used for A/B testing. It can also come as a system of email alerts sent to decision makers, customers, or anyone who need to be informed.
7. Maintenance. The models need to be adapted to changing data, changing patterns, or changing definitions of core metrics. Some satellite database tables must be updated, for instance every six months. Maybe a new platform able to store more data is needed, and data migration must be planned. Audits are performed to keep the system sound. New metrics may be introduced, as new sources of data are collected. Old data may be archived. Now we should get a good idea of the long-term yield (ROI) of the project, what works well and what needs to be improved.
To receive a weekly digest of our new articles, subscribe to our newsletter, here.
About the author: Vincent Granville is a data science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at DataShaping.com, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target). You can access Vincent’s articles and books, here.