{"id":772,"date":"2018-07-08T19:00:36","date_gmt":"2018-07-08T19:00:36","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2018\/07\/08\/how-to-calculate-nonparametric-rank-correlation-in-python\/"},"modified":"2018-07-08T19:00:36","modified_gmt":"2018-07-08T19:00:36","slug":"how-to-calculate-nonparametric-rank-correlation-in-python","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2018\/07\/08\/how-to-calculate-nonparametric-rank-correlation-in-python\/","title":{"rendered":"How to Calculate Nonparametric Rank Correlation in Python"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Correlation is a measure of the association between two variables.<\/p>\n<p>It is easy to calculate and interpret when both variables have a well understood Gaussian distribution. When we do not know the distribution of the variables, we must use nonparametric rank correlation methods.<\/p>\n<p>In this tutorial, you will discover rank correlation methods for quantifying the association between variables with a non-Gaussian distribution.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>How rank correlation methods work and the methods are that are available.<\/li>\n<li>How to calculate and interpret the Spearman\u2019s rank correlation coefficient in Python.<\/li>\n<li>How to calculate and interpret the Kendall\u2019s rank correlation coefficient in Python.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into 4 parts; they are:<\/p>\n<ol>\n<li>Rank Correlation<\/li>\n<li>Test Dataset<\/li>\n<li>Spearman\u2019s Rank Correlation<\/li>\n<li>Kendall\u2019s Rank Correlation<\/li>\n<\/ol>\n<p><!-- Start shortcoder --><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Need help with Statistics for Machine Learning?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/142f75173f72a2%3A164f8be4f346dc\/5750943224168448\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"142f75173f72a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/142f75173f72a2%3A164f8be4f346dc\/5750943224168448\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1526328103.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><!-- End shortcoder v4.1.7--><\/p>\n<h2>Rank Correlation<\/h2>\n<p>Correlation refers to the association between the observed values of two variables.<\/p>\n<p>The variables may have a positive association, meaning that as the values for one variable increase, so do the values of the other variable. The association may also be negative, meaning that as the values of one variable increase, the values of the others decrease. Finally, the association may be neutral, meaning that the variables are not associated.<\/p>\n<p>Correlation quantifies this association, often as a measure between the values -1 to 1 for perfectly negatively correlated and perfectly positively correlated. The calculated correlation is referred to as the \u201c<em>correlation coefficient<\/em>.\u201d This correlation coefficient can then be interpreted to describe the measures.<\/p>\n<p>See the table below to help with interpretation the correlation coefficient.<\/p>\n<div id=\"attachment_5754\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-5754\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/04\/Table-of-Correlation-Coefficient-Values-and-Their-Interpretation-1024x344.png\" alt=\"Table of Correlation Coefficient Values and Their Interpretation\" width=\"1024\" height=\"344\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/04\/Table-of-Correlation-Coefficient-Values-and-Their-Interpretation-1024x344.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/04\/Table-of-Correlation-Coefficient-Values-and-Their-Interpretation-300x101.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/04\/Table-of-Correlation-Coefficient-Values-and-Their-Interpretation-768x258.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/04\/Table-of-Correlation-Coefficient-Values-and-Their-Interpretation.png 1054w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/p>\n<p class=\"wp-caption-text\">Table of Correlation Coefficient Values and Their Interpretation<br \/>Taken from \u201cNonparametric Statistics for Non-Statisticians: A Step-by-Step Approach\u201d.<\/p>\n<\/div>\n<p>The correlation between two variables that each have a Gaussian distribution can be calculated using standard methods such as the Pearson\u2019s correlation. This procedure cannot be used for data that does not have a Gaussian distribution. Instead, rank correlation methods must be used.<\/p>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Rank_correlation\">Rank correlation<\/a> refers to methods that quantify the association between variables using the ordinal relationship between the values rather than the specific values. Ordinal data is data that has label values and has an order or rank relationship; for example: \u2018<em>low<\/em>\u2018, \u2018<em>medium<\/em>\u2018, and \u2018<em>high<\/em>\u2018.<\/p>\n<p>Rank correlation can be calculated for real-valued variables. This is done by first converting the values for each variable into rank data. This is where the values are ordered and assigned an integer rank value. Rank correlation coefficients can then be calculated in order to quantify the association between the two ranked variables.<\/p>\n<p>Because no distribution for the values is assumed, rank correlation methods are referred to as distribution-free correlation or nonparametric correlation. Interestingly, rank correlation measures are often used as the basis for other statistical hypothesis tests, such as determining whether two samples were likely drawn from the same (or different) population distributions.<\/p>\n<p>Rank correlation methods are often named after the researcher or researchers that developed the method. Four examples of rank correlation methods are as follows:<\/p>\n<ul>\n<li>Spearman\u2019s Rank Correlation.<\/li>\n<li>Kendall\u2019s Rank Correlation.<\/li>\n<li>Goodman and Kruskal\u2019s Rank Correlation.<\/li>\n<li>Somers\u2019 Rank Correlation.<\/li>\n<\/ul>\n<p>In the following sections, we will take a closer look at two of the more common rank correlation methods: Spearman\u2019s and Kendall\u2019s.<\/p>\n<h2>Test Dataset<\/h2>\n<p>Before we demonstrate rank correlation methods, we must first define a test problem.<\/p>\n<p>In this section, we will define a simple two-variable dataset where each variable is drawn from a uniform distribution (e.g. non-Gaussian) and the values of the second variable depend on the values of the first value.<\/p>\n<p>Specifically, a sample of 1,000 random floating point values are drawn from a uniform distribution and scaled to the range 0 to 20. A second sample of 1,000 random floating point values are drawn from a uniform distribution between 0 and 10 and added to values in the first sample to create an association.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare data\r\ndata1 = rand(1000) * 20\r\ndata2 = data1 + (rand(1000) * 10)<\/pre>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># generate related variables\r\nfrom numpy.random import rand\r\nfrom numpy.random import seed\r\nfrom matplotlib import pyplot\r\n# seed random number generator\r\nseed(1)\r\n# prepare data\r\ndata1 = rand(1000) * 20\r\ndata2 = data1 + (rand(1000) * 10)\r\n# plot\r\npyplot.scatter(data1, data2)\r\npyplot.show()<\/pre>\n<p>Running the example generates the data sample and graphs the points on a scatter plot.<\/p>\n<p>We can clearly see that each variable has a uniform distribution and the positive association between the variables is visible by the diagonal grouping of the points from the bottom left to the top right of the plot.<\/p>\n<div id=\"attachment_5755\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-5755\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/04\/Scatter-Plot-of-Associated-Variables-Drawn-From-a-Uniform-Distribution-1024x768.png\" alt=\"Scatter Plot of Associated Variables Drawn From a Uniform Distribution\" width=\"1024\" height=\"768\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/04\/Scatter-Plot-of-Associated-Variables-Drawn-From-a-Uniform-Distribution-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/04\/Scatter-Plot-of-Associated-Variables-Drawn-From-a-Uniform-Distribution-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/04\/Scatter-Plot-of-Associated-Variables-Drawn-From-a-Uniform-Distribution-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/04\/Scatter-Plot-of-Associated-Variables-Drawn-From-a-Uniform-Distribution.png 1280w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/p>\n<p class=\"wp-caption-text\">Scatter Plot of Associated Variables Drawn From a Uniform Distribution<\/p>\n<\/div>\n<h2>Spearman\u2019s Rank Correlation<\/h2>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Spearman%27s_rank_correlation_coefficient\">Spearman\u2019s rank correlation<\/a> is named for Charles Spearman.<\/p>\n<p>It may also be called Spearman\u2019s correlation coefficient and is denoted by the lowercase greek letter rho (p). As such, it may be referred to as Spearman\u2019s rho.<\/p>\n<p>This statistical method quantifies the degree to which ranked variables are associated by a monotonic function, meaning an increasing or decreasing relationship. As a statistical hypothesis test, the method assumes that the samples are uncorrelated (fail to reject H0).<\/p>\n<blockquote>\n<p>The Spearman rank-order correlation is a statistical procedure that is designed to measure the relationship between two variables on an ordinal scale of measurement.<\/p>\n<\/blockquote>\n<p>\u2014 Page 124, <a href=\"https:\/\/amzn.to\/2HevldG\">Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach<\/a>, 2009.<\/p>\n<p>The intuition for the Spearman\u2019s rank correlation is that it calculates a Pearson\u2019s correlation (e.g. a parametric measure of correlation) using the rank values instead of the real values. Where the Pearson\u2019s correlation is the calculation of the covariance (or expected difference of observations from the mean) between the two variables normalized by the variance or spread of both variables.<\/p>\n<p>Spearman\u2019s rank correlation can be calculated in Python using the <a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.stats.spearmanr.html\">spearmanr() SciPy function<\/a>.<\/p>\n<p>The function takes two real-valued samples as arguments and returns both the correlation coefficient in the range between -1 and 1 and the p-value for interpreting the significance of the coefficient.<\/p>\n<pre class=\"crayon-plain-tag\"># calculate spearman's correlation\r\ncoef, p = spearmanr(data1, data2)<\/pre>\n<p>We can demonstrate the Spearman\u2019s rank correlation on the test dataset. We know that there is a strong association between the variables in the dataset and we would expect the Spearman\u2019s test to find this association.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># calculate the spearman's correlation between two variables\r\nfrom numpy.random import rand\r\nfrom numpy.random import seed\r\nfrom scipy.stats import spearmanr\r\n# seed random number generator\r\nseed(1)\r\n# prepare data\r\ndata1 = rand(1000) * 20\r\ndata2 = data1 + (rand(1000) * 10)\r\n# calculate spearman's correlation\r\ncoef, p = spearmanr(data1, data2)\r\nprint('Spearmans correlation coefficient: %.3f' % coef)\r\n# interpret the significance\r\nalpha = 0.05\r\nif p > alpha:\r\n\tprint('Samples are uncorrelated (fail to reject H0) p=%.3f' % p)\r\nelse:\r\n\tprint('Samples are correlated (reject H0) p=%.3f' % p)<\/pre>\n<p>Running the example calculates the Spearman\u2019s correlation coefficient between the two variables in the test dataset.<\/p>\n<p>The statistical test reports a strong positive correlation with a value of 0.9. The p-value is close to zero, which means that the likelihood of observing the data given that the samples are uncorrelated is very unlikely (e.g. 95% confidence) and that we can reject the null hypothesis that the samples are uncorrelated.<\/p>\n<pre class=\"crayon-plain-tag\">Spearmans correlation coefficient: 0.900\r\nSamples are correlated (reject H0) p=0.000<\/pre>\n<\/p>\n<h2>Kendall\u2019s Rank Correlation<\/h2>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Kendall_rank_correlation_coefficient\">Kendall\u2019s rank correlation<\/a> is named for Maurice Kendall.<\/p>\n<p>It is also called Kendall\u2019s correlation coefficient, and the coefficient is often referred to by the lowercase Greek letter tau (t). In turn, the test may be called Kendall\u2019s tau.<\/p>\n<p>The intuition for the test is that it calculates a normalized score for the number of matching or concordant rankings between the two samples. As such, the test is also referred to as Kendall\u2019s concordance test.<\/p>\n<p>The Kendall\u2019s rank correlation coefficient can be calculated in Python using the <a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.stats.kendalltau.html\">kendalltau() SciPy function<\/a>. The test takes the two data samples as arguments and returns the correlation coefficient and the p-value. As a statistical hypothesis test, the method assumes (H0) that there is no association between the two samples.<\/p>\n<pre class=\"crayon-plain-tag\"># calculate kendall's correlation\r\ncoef, p = kendalltau(data1, data2)<\/pre>\n<p>We can demonstrate the calculation on the test dataset, where we do expect a significant positive association to be reported.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># calculate the kendall's correlation between two variables\r\nfrom numpy.random import rand\r\nfrom numpy.random import seed\r\nfrom scipy.stats import kendalltau\r\n# seed random number generator\r\nseed(1)\r\n# prepare data\r\ndata1 = rand(1000) * 20\r\ndata2 = data1 + (rand(1000) * 10)\r\n# calculate kendall's correlation\r\ncoef, p = kendalltau(data1, data2)\r\nprint('Kendall correlation coefficient: %.3f' % coef)\r\n# interpret the significance\r\nalpha = 0.05\r\nif p > alpha:\r\n\tprint('Samples are uncorrelated (fail to reject H0) p=%.3f' % p)\r\nelse:\r\n\tprint('Samples are correlated (reject H0) p=%.3f' % p)<\/pre>\n<p>Running the example calculates the Kendall\u2019s correlation coefficient as 0.7, which is highly correlated.<\/p>\n<p>The p-value is close to zero (and printed as zero), as with the Spearman\u2019s test, meaning that we can confidently reject the null hypothesis that the samples are uncorrelated.<\/p>\n<pre class=\"crayon-plain-tag\">Kendall correlation coefficient: 0.709\r\nSamples are correlated (reject H0) p=0.000<\/pre>\n<\/p>\n<h2>Extensions<\/h2>\n<p>This section lists some ideas for extending the tutorial that you may wish to explore.<\/p>\n<ul>\n<li>List three examples where calculating a nonparametric correlation coefficient might be useful during a machine learning project.<\/li>\n<li>Update each example to calculate the correlation between uncorrelated data samples drawn from a non-Gaussian distribution.<\/li>\n<li>Load a standard machine learning dataset and calculate the pairwise nonparametric correlation between all variables.<\/li>\n<\/ul>\n<p>If you explore any of these extensions, I\u2019d love to know.<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Books<\/h3>\n<ul>\n<li><a href=\"https:\/\/amzn.to\/2HevldG\">Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach<\/a>, 2009.<\/li>\n<li><a href=\"https:\/\/amzn.to\/2GCKnfW\">Applied Nonparametric Statistical Methods<\/a>, Fourth Edition, 2007.<\/li>\n<li><a href=\"https:\/\/amzn.to\/2JofYzY\">Rank Correlation Methods<\/a>, 1990.<\/li>\n<\/ul>\n<h3>API<\/h3>\n<ul>\n<li><a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.stats.spearmanr.html\">scipy.stats.spearmanr() API<\/a><br \/><a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.stats.kendalltau.html\">scipy.stats.kendalltau() API<\/a><\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Nonparametric_statistics\">Nonparametric statistics on Wikipedia<\/a><\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Rank_correlation\">Rank correlation on Wikipedia<\/a><\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Spearman%27s_rank_correlation_coefficient\">Spearman\u2019s rank correlation coefficient on Wikipedia<\/a><\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Kendall_rank_correlation_coefficient\">Kendall rank correlation coefficient on Wikipedia<\/a><\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Goodman_and_Kruskal%27s_gamma\">Goodman and Kruskal\u2019s gamma on Wikipedia<\/a><\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Somers%27_D\">Somers\u2019 D on Wikipedia<\/a><\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered rank correlation methods for quantifying the association between variables with a non-Gaussian distribution.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How rank correlation methods work and the methods are that are available.<\/li>\n<li>How to calculate and interpret the Spearman\u2019s rank correlation coefficient in Python.<\/li>\n<li>How to calculate and interpret the Kendall\u2019s rank correlation coefficient in Python.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/how-to-calculate-nonparametric-rank-correlation-in-python\/\">How to Calculate Nonparametric Rank Correlation in Python<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/how-to-calculate-nonparametric-rank-correlation-in-python\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Correlation is a measure of the association between two variables. It is easy to calculate and interpret when both variables have a [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2018\/07\/08\/how-to-calculate-nonparametric-rank-correlation-in-python\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":773,"comment_status":"registered_only","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/772"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=772"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/772\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/773"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=772"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=772"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=772"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}