Common Errors in Machine Learning due to Poor Statistics Knowledge

Author: Vincent Granville

Probably the worst error is thinking there is a correlation when that correlation is purely artificial. Take a data set with 100,000 variables, say with 10 observations. Compute all the (99,999 * 100,000) / 2 cross-correlations. You are almost guaranteed to find one above 0.999999. This is best illustrated in may article How to Lie with P-values (also discussing how to handle and fix it.)

This is being done on such a large scale, I think it is probably the main cause of fake news, and the impact is disastrous on people who take for granted what they read in the news or what they hear from the government. Some people are sent to jail based on evidence tainted with major statistical flaws. Government money is spent, propaganda is generated, wars are started, and laws are created based on false evidence. Sometimes the data scientist has no choice but to knowingly cook the numbers to keep her job. Usually, these “bad stats” end up being featured in beautiful but faulty visualizations: axes are truncated, charts are distorted, observations and variables are carefully chosen just to make a (wrong) point.

Trusting data is another big source of errors. What’s the point of making a 99% accurate model if your data is 20% faulty, or worse, you failed to gather the right kind of data to start with, or the right predictors? Also, models with no sound cross-validations are bound to fail. In Fintech, you can do back-testing to check a model. But it is useless: what you need to do is called walk-forward, a process of testing your model trained on past data split into two sets: most recent data (the control case) and older data (the test case). Walk forward is akin to testing your data on future data that is already in your possession, it is called cross-validation in machine learning lingo. And then, you need to do it right: if the control and test data are too similar, you may end up with overfitting issues.

Trusting the R-squared is another source of potential problems. It depends on your sample size, so you can’t compare results for two sets of different sizes, and it is sensitive to outliers. Google alternatives to R-squared to find a solution. Also using the normal distribution as a panacea leads to many problems when dealing with data that has a different tail or that is not uni-modal or not symmetric. Sometimes a simple transformation, using a logistic map or logarithmic transform will fix the issue.

Even the choice of metrics can have huge consequences and lead to different conclusions based on a same data set. If your conclusions should be the same regardless of whether you use miles or yards, then choose scale-invariant modeling techniques.

Missing data can be handled inappropriately, being replaced by averages computed on available observations, even though better imputation techniques exist.. But what if that data is missing precisely because it behaves differently than your average? Think about surveys or Amazon reviews. Who write reviews and who do not? Of course the two categories of people are very different, and what’s more, the vast majority of people never write reviews: so reviews are based on a tiny, skewed sample of the users. The fix here is to have a few professional reviews blended with those from regular users, and score the users correctly to give the reader a better picture. If you fail to do it, soon enough all readers will know that your reviews are not trustworthy, and you might as well remove all reviews from your website, get rid of the data scientists working on the project, and save a lot of money and improve your business brand.

Much of this is discussed (with fixes) in my recent book Statistics: new foundations, toolbox, and machine learning recipes, available (for free) here.

Go to Source