Comparing Model Evaluation Techniques Part 1: Statistical Tools & Tests

Author: Stephanie Glen

Evaluating a model is just as important as creating the model in the first place. Even if you use the most statistically sound tools to create your model, the end result may not be what you expected. Which metric you use to test your model depends on the type of data you’re working with and your comfort level with statistics.

Model evaluation techniques answer three main questions:

  1. How well does your model match your data (in other words, what is the goodness of fit)?
  2.  Assuming you’ve created multiple models, which one is the best? (Note that the “best” can have different criteria according to situation/personal preferences; What is best for one situation may not be best for another.)
  3. Will your model predict new observations for your data set with accuracy?

The following summary of model evaluation techniques is by no means exhaustive; it’s intended to be a starting point if you’re unfamiliar with the available techniques. In part 1, I discuss some of the common Statistical Tools and Tests. Part 2 will cover tools for Clustering and Classification.


1. Confidence Interval

A confidence interval is a measure of how reliable a statistical estimate is.  For example, you might have calculated a mean or standard deviation for your set. But how reliable is that estimate? A confidence interval will give you an easy-to-understand boundary. For example, if your calculated mean is 10 feet, a confidence interval might tell you that 99% of results will fall between 9 feet and 11 feet (i.e. 1 foot either side of the mean).

2. Root Mean Square Error (RMSE)

RMSE is a measure of how spread out residuals are. In other words, it tells you how concentrated the data is around the line of best fit. . It’s one of the most popular metrics for evaluating continuous data, and is widely used in Excel. However, it’s a lot trickier to understand than simple statistics like confidence intervals because of the complex calculations involved. It also penalizes higher differences, meaning that it’s sensitive to outliers. Even if you don’t understand the calculations behind RMSE, Excel will still spit out an answer, leaving you to puzzle over the significance of the result. To further the complications, RMSE (and other similar metrics like bias and correlation coefficients) can really only be fully understood if you are very familiar with the underlying data/model.

 L^1 version of RSME.

RSME is sensitive to outliers, one reason why it’s fallen out of favor with many data scientists. An alternative is a more modern version, like the L^1 version described here.  



While basic statistical tools (like those listed above) are fairly easy to understand for the non-statistician, delving into the arena of statistical tests requires you to have a lot more in-depth knowledge not only of statistics, but of your data and model. The major advantages to running a statistical test is that it gives you great confidence in your results. For many professional arenas and publications, statistical tests are an absolute must. The downside is that you really have to know your data inside and out in order to interpret the results from these tests;  otherwise, it’s easy to misinterpret them.

1. Kolmogorov-Smirnov Test.

The Kolmogorov-Smirnov Goodness of Fit Test (K-S test) is a distribution free test that compares your data with a known distribution (usually a normal distribution) and lets you know if they have the same distribution. The fact that you don’t have to know the underlying distribution is a great advantage, but the test has several drawbacks, including the fact that you have to specify the locationscale, and shape parameters; These cannot be estimated from the data, as it will invalidate the test. Another big disadvantage is that it can’t be usually be used for discrete data without some particularly cumbersome calculations or an add-on for software.

 2. Lilliefors Test

Lilliefors test is  a corrected version of the K-S test for normality, generally gives a more accurate approximation of the test statistic’s distribution. This is especially true if you don’t know the population mean and standard deviation (which is usually the case). Many statistical packages (like SPSS) combine the two tests as a “Lilliefors corrected” K-S test.

3. Chi Square.

The Chi-Square test is similar to Kolmogorov-Smirnov, but it is a parametric test. This means that you have to know the underlying distribution in order to work with it. While the K-S test performs poorly when you estimate population parameters, chi-square can be successfully run with estimations. That said, a downside is that it doesn’t work well with small sample sizes.

In general, K-S is usually preferred for it’s higher power. However, when you don’t know a certain population parameter (i.e. the one you’re trying to estimate, like the mean), chi-square may be the better choice.


Comparison of the Goodness-of-Fit Tests: the Pearson Chi-square and Kolmogorov-Smirnov Tests

11 Important Model Evaluation Techniques Everyone Should Know

Go to Source