{"id":849,"date":"2018-07-29T19:00:33","date_gmt":"2018-07-29T19:00:33","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2018\/07\/29\/how-to-code-the-students-t-test-from-scratch-in-python\/"},"modified":"2018-07-29T19:00:33","modified_gmt":"2018-07-29T19:00:33","slug":"how-to-code-the-students-t-test-from-scratch-in-python","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2018\/07\/29\/how-to-code-the-students-t-test-from-scratch-in-python\/","title":{"rendered":"How to Code the Student\u2019s t-Test from Scratch in Python"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Perhaps one of the most widely used statistical hypothesis tests is the Student\u2019s t test.<\/p>\n<p>Because you may use this test yourself someday, it is important to have a deep understanding of how the test works. As a developer, this understanding is best achieved by implementing the hypothesis test yourself from scratch.<\/p>\n<p>In this tutorial, you will discover how to implement the Student\u2019s t-test statistical hypothesis test from scratch in Python.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>The Student\u2019s t-test will comment on whether it is likely to observe two samples given that the samples were drawn from the same population.<\/li>\n<li>How to implement the Student\u2019s t-test from scratch for two independent samples.<\/li>\n<li>How to implement the paired Student\u2019s t-test from scratch for two dependent samples.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_5881\" style=\"width: 650px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-5881\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/07\/How-to-Code-the-Students-t-Test-from-Scratch-in-Python.jpg\" alt=\"How to Code the Student's t-Test from Scratch in Python\" width=\"640\" height=\"427\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/How-to-Code-the-Students-t-Test-from-Scratch-in-Python.jpg 640w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/How-to-Code-the-Students-t-Test-from-Scratch-in-Python-300x200.jpg 300w\" sizes=\"(max-width: 640px) 100vw, 640px\"><\/p>\n<p class=\"wp-caption-text\">How to Code the Student\u2019s t-Test from Scratch in Python<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/62400641@N07\/33385804523\/\">n1d<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into three parts; they are:<\/p>\n<ol>\n<li>Student\u2019s t-Test<\/li>\n<li>Student\u2019s t-Test for Independent Samples<\/li>\n<li>Student\u2019s t-Test for Dependent Samples<\/li>\n<\/ol>\n<p><!-- Start shortcoder --><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Need help with Statistics for Machine Learning?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/142f75173f72a2%3A164f8be4f346dc\/5750943224168448\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"142f75173f72a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/142f75173f72a2%3A164f8be4f346dc\/5750943224168448\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1526328103.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><!-- End shortcoder v4.1.7--><\/p>\n<h2>Student\u2019s t-Test<\/h2>\n<p>The <a href=\"https:\/\/en.wikipedia.org\/wiki\/Student%27s_t-test\">Student\u2019s t-Test<\/a> is a statistical hypothesis test for testing whether two samples are expected to have been drawn from the same population.<\/p>\n<p>It is named for the pseudonym \u201c<em>Student<\/em>\u201d used by William Gosset, who developed the test.<\/p>\n<p>The test works by checking the means from two samples to see if they are significantly different from each other. It does this by calculating the standard error in the difference between means, which can be interpreted to see how likely the difference is, if the two samples have the same mean (the null hypothesis).<\/p>\n<p>The t statistic calculated by the test can be interpreted by comparing it to critical values from the t-distribution. The critical value can be calculated using the degrees of freedom and a significance level with the percent point function (PPF).<\/p>\n<p>We can interpret the statistic value in a two-tailed test, meaning that if we reject the null hypothesis, it could be because the first mean is smaller or greater than the second mean. To do this, we can calculate the absolute value of the test statistic and compare it to the positive (right tailed) critical value, as follows:<\/p>\n<ul>\n<li><strong>If abs(t-statistic) <= critical value<\/strong>: Accept null hypothesis that the means are equal.<\/li>\n<li><strong>If abs(t-statistic) > critical value<\/strong>: Reject the null hypothesis that the means are equal.<\/li>\n<\/ul>\n<p>We can also retrieve the cumulative probability of observing the absolute value of the t-statistic using the cumulative distribution function (CDF) of the t-distribution in order to calculate a p-value. The p-value can then be compared to a chosen significance level (alpha) such as 0.05 to determine if the null hypothesis can be rejected:<\/p>\n<ul>\n<li><strong>If p > alpha<\/strong>: Accept null hypothesis that the means are equal.<\/li>\n<li><strong>If p <= alpha<\/strong>: Reject null hypothesis that the means are equal.<\/li>\n<\/ul>\n<p>In working with the means of the samples, the test assumes that both samples were drawn from a Gaussian distribution. The test also assumes that the samples have the same variance, and the same size, although there are corrections to the test if these assumptions do not hold. For example, see <a href=\"https:\/\/en.wikipedia.org\/wiki\/Welch%27s_t-test\">Welch\u2019s t-test<\/a>.<\/p>\n<p>There are two main versions of Student\u2019s t-test:<\/p>\n<ul>\n<li><strong>Independent Samples<\/strong>. The case where the two samples are unrelated.<\/li>\n<li><strong>Dependent Samples<\/strong>. The case where the samples are related, such as repeated measures on the same population. Also called a paired test.<\/li>\n<\/ul>\n<p>Both the independent and the dependent Student\u2019s t-tests are available in Python via the <a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.stats.ttest_ind.html\">ttest_ind()<\/a> and <a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.stats.ttest_rel.html\">ttest_rel()<\/a> SciPy functions respectively.<\/p>\n<p><strong>Note<\/strong>: I recommend using these SciPy functions to calculate the Student\u2019s t-test for your applications, if they are suitable. The library implementations will be faster and less prone to bugs. I would only recommend implementing the test yourself for learning purposes or in the case where you require a modified version of the test.<\/p>\n<p>We will use the SciPy functions to confirm the results from our own version of the tests.<\/p>\n<p>Note, for reference, all calculations presented in this tutorial are taken directly from Chapter 9 \u201c<em>t Tests<\/em>\u201d in \u201c<a href=\"https:\/\/amzn.to\/2J2Qibd\">Statistics in Plain English<\/a>\u201c, Third Edition, 2010. I mention this because you may see the equations with different forms, depending on the reference text that you use.<\/p>\n<h2>Student\u2019s t-Test for Independent Samples<\/h2>\n<p>We\u2019ll start with the most common form of the Student\u2019s t-test: the case where we are comparing the means of two independent samples.<\/p>\n<h3>Calculation<\/h3>\n<p>The calculation of the t-statistic for two independent samples is as follows:<\/p>\n<pre class=\"crayon-plain-tag\">t = observed difference between sample means \/ standard error of the difference between the means<\/pre>\n<p>or<\/p>\n<pre class=\"crayon-plain-tag\">t = (mean(X1) - mean(X2)) \/ sed<\/pre>\n<p>Where <em>X1<\/em> and <em>X2<\/em> are the first and second data samples and <em>sed<\/em> is the standard error of the difference between the means.<\/p>\n<p>The standard error of the difference between the means can be calculated as follows:<\/p>\n<pre class=\"crayon-plain-tag\">sed = sqrt(se1^2 + se2^2)<\/pre>\n<p>Where <em>se1<\/em> and <em>se2<\/em> are the standard errors for the first and second datasets.<\/p>\n<p>The standard error of a sample can be calculated as:<\/p>\n<pre class=\"crayon-plain-tag\">se = std \/ sqrt(n)<\/pre>\n<p>Where <em>se<\/em> is the standard error of the sample, <em>std<\/em> is the sample standard deviation, and <em>n<\/em> is the number of observations in the sample.<\/p>\n<p>These calculations make the following assumptions:<\/p>\n<ul>\n<li>The samples are drawn from a Gaussian distribution.<\/li>\n<li>The size of each sample is approximately equal.<\/li>\n<li>The samples have the same variance.<\/li>\n<\/ul>\n<h3>Implementation<\/h3>\n<p>We can implement these equations easily using functions from the Python standard library, NumPy and SciPy.<\/p>\n<p>Let\u2019s assume that our two data samples are stored in the variables <em>data1<\/em> and <em>data2<\/em>.<\/p>\n<p>We can start off by calculating the mean for these samples as follows:<\/p>\n<pre class=\"crayon-plain-tag\"># calculate means\r\nmean1, mean2 = mean(data1), mean(data2)<\/pre>\n<p>We\u2019re halfway there.<\/p>\n<p>Now we need to calculate the standard error.<\/p>\n<p>We can do this manually, first by calculating the sample standard deviations:<\/p>\n<pre class=\"crayon-plain-tag\"># calculate sample standard deviations\r\nstd1, std2 = std(data1, ddof=1), std(data2, ddof=1)<\/pre>\n<p>And then the standard errors:<\/p>\n<pre class=\"crayon-plain-tag\"># calculate standard errors\r\nn1, n2 = len(data1), len(data2)\r\nse1, se2 = std1\/sqrt(n1), std2\/sqrt(n2)<\/pre>\n<p>Alternately, we can use the <em>sem()<\/em> SciPy function to calculate the standard error directly.<\/p>\n<pre class=\"crayon-plain-tag\"># calculate standard errors\r\nse1, se2 = sem(data1), sem(data2)<\/pre>\n<p>We can use the standard errors of the samples to calculate the \u201c<em>standard error on the difference between the samples<\/em>\u201c:<\/p>\n<pre class=\"crayon-plain-tag\"># standard error on the difference between the samples\r\nsed = sqrt(se1**2.0 + se2**2.0)<\/pre>\n<p>We can now calculate the t statistic:<\/p>\n<pre class=\"crayon-plain-tag\"># calculate the t statistic\r\nt_stat = (mean1 - mean2) \/ sed<\/pre>\n<p>We can also calculate some other values to help interpret and present the statistic.<\/p>\n<p>The number of degrees of freedom for the test is calculated as the sum of the observations in both samples, minus two.<\/p>\n<pre class=\"crayon-plain-tag\"># degrees of freedom\r\ndf = n1 + n2 - 2<\/pre>\n<p>The critical value can be calculated using the percent point function (PPF) for a given significance level, such as 0.05 (95% confidence).<\/p>\n<p>This function is available for the t distribution in SciPy, as follows:<\/p>\n<pre class=\"crayon-plain-tag\"># calculate the critical value\r\nalpha = 0.05\r\ncv = t.ppf(1.0 - alpha, df)<\/pre>\n<p>The p-value can be calculated using the cumulative distribution function on the t-distribution, again in SciPy.<\/p>\n<pre class=\"crayon-plain-tag\"># calculate the p-value\r\np = (1 - t.cdf(abs(t_stat), df)) * 2<\/pre>\n<p>Here, we assume a two-tailed distribution, where the rejection of the null hypothesis could be interpreted as the first mean is either smaller or larger than the second mean.<\/p>\n<p>We can tie all of these pieces together into a simple function for calculating the t-test for two independent samples:<\/p>\n<pre class=\"crayon-plain-tag\"># function for calculating the t-test for two independent samples\r\ndef independent_ttest(data1, data2, alpha):\r\n\t# calculate means\r\n\tmean1, mean2 = mean(data1), mean(data2)\r\n\t# calculate standard errors\r\n\tse1, se2 = sem(data1), sem(data2)\r\n\t# standard error on the difference between the samples\r\n\tsed = sqrt(se1**2.0 + se2**2.0)\r\n\t# calculate the t statistic\r\n\tt_stat = (mean1 - mean2) \/ sed\r\n\t# degrees of freedom\r\n\tdf = len(data1) + len(data2) - 2\r\n\t# calculate the critical value\r\n\tcv = t.ppf(1.0 - alpha, df)\r\n\t# calculate the p-value\r\n\tp = (1.0 - t.cdf(abs(t_stat), df)) * 2.0\r\n\t# return everything\r\n\treturn t_stat, df, cv, p<\/pre>\n<\/p>\n<h3>Worked Example<\/h3>\n<p>In this section we will calculate the t-test on some synthetic data samples.<\/p>\n<p>First, let\u2019s generate two samples of 100 Gaussian random numbers with the same variance of 5 and differing means of 50 and 51 respectively. We will expect the test to reject the null hypothesis and find a significant difference between the samples:<\/p>\n<pre class=\"crayon-plain-tag\"># seed the random number generator\r\nseed(1)\r\n# generate two independent samples\r\ndata1 = 5 * randn(100) + 50\r\ndata2 = 5 * randn(100) + 51<\/pre>\n<p>We can calculate the t-test on these samples using the built in SciPy function <em>ttest_ind()<\/em>. This will give us a t-statistic value and a p-value to compare to, to ensure that we have implemented the test correctly.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># Student's t-test for independent samples\r\nfrom numpy.random import seed\r\nfrom numpy.random import randn\r\nfrom scipy.stats import ttest_ind\r\n# seed the random number generator\r\nseed(1)\r\n# generate two independent samples\r\ndata1 = 5 * randn(100) + 50\r\ndata2 = 5 * randn(100) + 51\r\n# compare samples\r\nstat, p = ttest_ind(data1, data2)\r\nprint('t=%.3f, p=%.3f' % (stat, p))<\/pre>\n<p>Running the example, we can see a t-statistic value and p value.<\/p>\n<p>We will use these as our expected values for the test on these data.<\/p>\n<pre class=\"crayon-plain-tag\">t=-2.262, p=0.025<\/pre>\n<p>We can now apply our own implementation on the same data, using the function defined in the previous section.<\/p>\n<p>The function will return a t-statistic value and a critical value. We can use the critical value to interpret the t statistic to see if the finding of the test is significant and that indeed the means are different as we expected.<\/p>\n<pre class=\"crayon-plain-tag\"># interpret via critical value\r\nif abs(t_stat) <= cv:\r\n\tprint('Accept null hypothesis that the means are equal.')\r\nelse:\r\n\tprint('Reject the null hypothesis that the means are equal.')<\/pre>\n<p>The function also returns a p-value. We can interpret the p-value using an alpha, such as 0.05 to determine if the finding of the test is significant and that indeed the means are different as we expected.<\/p>\n<pre class=\"crayon-plain-tag\"># interpret via p-value\r\nif p > alpha:\r\n\tprint('Accept null hypothesis that the means are equal.')\r\nelse:\r\n\tprint('Reject the null hypothesis that the means are equal.')<\/pre>\n<p>We expect that both interpretations will always match.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># t-test for independent samples\r\nfrom math import sqrt\r\nfrom numpy.random import seed\r\nfrom numpy.random import randn\r\nfrom numpy import mean\r\nfrom scipy.stats import sem\r\nfrom scipy.stats import t\r\n\r\n# function for calculating the t-test for two independent samples\r\ndef independent_ttest(data1, data2, alpha):\r\n\t# calculate means\r\n\tmean1, mean2 = mean(data1), mean(data2)\r\n\t# calculate standard errors\r\n\tse1, se2 = sem(data1), sem(data2)\r\n\t# standard error on the difference between the samples\r\n\tsed = sqrt(se1**2.0 + se2**2.0)\r\n\t# calculate the t statistic\r\n\tt_stat = (mean1 - mean2) \/ sed\r\n\t# degrees of freedom\r\n\tdf = len(data1) + len(data2) - 2\r\n\t# calculate the critical value\r\n\tcv = t.ppf(1.0 - alpha, df)\r\n\t# calculate the p-value\r\n\tp = (1.0 - t.cdf(abs(t_stat), df)) * 2.0\r\n\t# return everything\r\n\treturn t_stat, df, cv, p\r\n\r\n# seed the random number generator\r\nseed(1)\r\n# generate two independent samples\r\ndata1 = 5 * randn(100) + 50\r\ndata2 = 5 * randn(100) + 51\r\n# calculate the t test\r\nalpha = 0.05\r\nt_stat, df, cv, p = independent_ttest(data1, data2, alpha)\r\nprint('t=%.3f, df=%d, cv=%.3f, p=%.3f' % (t_stat, df, cv, p))\r\n# interpret via critical value\r\nif abs(t_stat) <= cv:\r\n\tprint('Accept null hypothesis that the means are equal.')\r\nelse:\r\n\tprint('Reject the null hypothesis that the means are equal.')\r\n# interpret via p-value\r\nif p > alpha:\r\n\tprint('Accept null hypothesis that the means are equal.')\r\nelse:\r\n\tprint('Reject the null hypothesis that the means are equal.')<\/pre>\n<p>Running the example first calculates the test.<\/p>\n<p>The results of the test are printed, including the t-statistic, the degrees of freedom, the critical value, and the p-value.<\/p>\n<p>We can see that both the t-statistic and p-value match the outputs of the SciPy function. The test appears to be implemented correctly.<\/p>\n<p>The t-statistic and the p-value are then used to interpret the results of the test. We find that as we expect, there is sufficient evidence to reject the null hypothesis, finding that the sample means are likely different.<\/p>\n<pre class=\"crayon-plain-tag\">t=-2.262, df=198, cv=1.653, p=0.025\r\nReject the null hypothesis that the means are equal.\r\nReject the null hypothesis that the means are equal.<\/pre>\n<\/p>\n<h2>Student\u2019s t-Test for Dependent Samples<\/h2>\n<p>We can now look at the case of calculating the Student\u2019s t-test for dependent samples.<\/p>\n<p>This is the case where we collect some observations on a sample from the population, then apply some treatment, and then collect observations from the same sample.<\/p>\n<p>The result is two samples of the same size where the observations in each sample are related or paired.<\/p>\n<p>The t-test for dependent samples is referred to as the paired Student\u2019s t-test.<\/p>\n<h3>Calculation<\/h3>\n<p>The calculation of the paired Student\u2019s t-test is similar to the case with independent samples.<\/p>\n<p>The main difference is in the calculation of the denominator.<\/p>\n<pre class=\"crayon-plain-tag\">t = (mean(X1) - mean(X2)) \/ sed<\/pre>\n<p>Where <em>X1<\/em> and <em>X2<\/em> are the first and second data samples and <em>sed<\/em> is the standard error of the difference between the means.<\/p>\n<p>Here, <em>sed<\/em> is calculated as:<\/p>\n<pre class=\"crayon-plain-tag\">sed = sd \/ sqrt(n)<\/pre>\n<p>Where <em>sd<\/em> is the standard deviation of the difference between the dependent sample means and <em>n<\/em> is the total number of paired observations (e.g. the size of each sample).<\/p>\n<p>The calculation of <em>sd<\/em> first requires the calculation of the sum of the squared differences between the samples:<\/p>\n<pre class=\"crayon-plain-tag\">d1 = sum (X1[i] - X2[i])^2 for i in n<\/pre>\n<p>It also requires the sum of the (non squared) differences between the samples:<\/p>\n<pre class=\"crayon-plain-tag\">d2 = sum (X1[i] - X2[i]) for i in n<\/pre>\n<p>We can then calculate sd as:<\/p>\n<pre class=\"crayon-plain-tag\">sd = sqrt((d1 - (d2**2 \/ n)) \/ (n - 1))<\/pre>\n<p>That\u2019s it.<\/p>\n<h3>Implementation<\/h3>\n<p>We can implement the calculation of the paired Student\u2019s t-test directly in Python.<\/p>\n<p>The first step is to calculate the means of each sample.<\/p>\n<pre class=\"crayon-plain-tag\"># calculate means\r\nmean1, mean2 = mean(data1), mean(data2)<\/pre>\n<p>Next, we will require the number of pairs (<em>n<\/em>). We will use this in a few different calculations.<\/p>\n<pre class=\"crayon-plain-tag\"># number of paired samples\r\nn = len(data1)<\/pre>\n<p>Next, we must calculate the sum of the squared differences between the samples, as well as the sum differences.<\/p>\n<pre class=\"crayon-plain-tag\"># sum squared difference between observations\r\nd1 = sum([(data1[i]-data2[i])**2 for i in range(n)])\r\n# sum difference between observations\r\nd2 = sum([data1[i]-data2[i] for i in range(n)])<\/pre>\n<p>We can now calculate the standard deviation of the difference between means.<\/p>\n<pre class=\"crayon-plain-tag\"># standard deviation of the difference between means\r\nsd = sqrt((d1 - (d2**2 \/ n)) \/ (n - 1))<\/pre>\n<p>This is then used to calculate the standard error of the difference between the means.<\/p>\n<pre class=\"crayon-plain-tag\"># standard error of the difference between the means\r\nsed = sd \/ sqrt(n)<\/pre>\n<p>Finally, we have everything we need to calculate the t statistic.<\/p>\n<pre class=\"crayon-plain-tag\"># calculate the t statistic\r\nt_stat = (mean1 - mean2) \/ sed<\/pre>\n<p>The only other key difference between this implementation and the implementation for independent samples is the calculation of the number of degrees of freedom.<\/p>\n<pre class=\"crayon-plain-tag\"># degrees of freedom\r\ndf = n - 1<\/pre>\n<p>As before, we can tie all of this together into a reusable function. The function will take two paired samples and a significance level (alpha) and calculate the t-statistic, number of degrees of freedom, critical value, and p-value.<\/p>\n<p>The complete function is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># function for calculating the t-test for two dependent samples\r\ndef dependent_ttest(data1, data2, alpha):\r\n\t# calculate means\r\n\tmean1, mean2 = mean(data1), mean(data2)\r\n\t# number of paired samples\r\n\tn = len(data1)\r\n\t# sum squared difference between observations\r\n\td1 = sum([(data1[i]-data2[i])**2 for i in range(n)])\r\n\t# sum difference between observations\r\n\td2 = sum([data1[i]-data2[i] for i in range(n)])\r\n\t# standard deviation of the difference between means\r\n\tsd = sqrt((d1 - (d2**2 \/ n)) \/ (n - 1))\r\n\t# standard error of the difference between the means\r\n\tsed = sd \/ sqrt(n)\r\n\t# calculate the t statistic\r\n\tt_stat = (mean1 - mean2) \/ sed\r\n\t# degrees of freedom\r\n\tdf = n - 1\r\n\t# calculate the critical value\r\n\tcv = t.ppf(1.0 - alpha, df)\r\n\t# calculate the p-value\r\n\tp = (1.0 - t.cdf(abs(t_stat), df)) * 2.0\r\n\t# return everything\r\n\treturn t_stat, df, cv, p<\/pre>\n<\/p>\n<h3>Worked Example<\/h3>\n<p>In this section, we will use the same dataset in the worked example as we did for the independent Student\u2019s t-test.<\/p>\n<p>The data samples are not paired, but we will pretend they are. We expect the test to reject the null hypothesis and find a significant difference between the samples.<\/p>\n<pre class=\"crayon-plain-tag\"># seed the random number generator\r\nseed(1)\r\n# generate two independent samples\r\ndata1 = 5 * randn(100) + 50\r\ndata2 = 5 * randn(100) + 51<\/pre>\n<p>As before, we can evaluate the test problem with the SciPy function for calculating a paired t-test. In this case, the <em>ttest_rel()<\/em> function.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># Paired Student's t-test\r\nfrom numpy.random import seed\r\nfrom numpy.random import randn\r\nfrom scipy.stats import ttest_rel\r\n# seed the random number generator\r\nseed(1)\r\n# generate two independent samples\r\ndata1 = 5 * randn(100) + 50\r\ndata2 = 5 * randn(100) + 51\r\n# compare samples\r\nstat, p = ttest_rel(data1, data2)\r\nprint('Statistics=%.3f, p=%.3f' % (stat, p))<\/pre>\n<p>Running the example calculates and prints the t-statistic and the p-value.<\/p>\n<p>We will use these values to validate the calculation of our own paired t-test function.<\/p>\n<pre class=\"crayon-plain-tag\">Statistics=-2.372, p=0.020<\/pre>\n<p>We can now test our own implementation of the paired Student\u2019s t-test.<\/p>\n<p>The complete example, including the developed function and interpretation of the results of the function, is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># t-test for dependent samples\r\nfrom math import sqrt\r\nfrom numpy.random import seed\r\nfrom numpy.random import randn\r\nfrom numpy import mean\r\nfrom scipy.stats import t\r\n\r\n# function for calculating the t-test for two dependent samples\r\ndef dependent_ttest(data1, data2, alpha):\r\n\t# calculate means\r\n\tmean1, mean2 = mean(data1), mean(data2)\r\n\t# number of paired samples\r\n\tn = len(data1)\r\n\t# sum squared difference between observations\r\n\td1 = sum([(data1[i]-data2[i])**2 for i in range(n)])\r\n\t# sum difference between observations\r\n\td2 = sum([data1[i]-data2[i] for i in range(n)])\r\n\t# standard deviation of the difference between means\r\n\tsd = sqrt((d1 - (d2**2 \/ n)) \/ (n - 1))\r\n\t# standard error of the difference between the means\r\n\tsed = sd \/ sqrt(n)\r\n\t# calculate the t statistic\r\n\tt_stat = (mean1 - mean2) \/ sed\r\n\t# degrees of freedom\r\n\tdf = n - 1\r\n\t# calculate the critical value\r\n\tcv = t.ppf(1.0 - alpha, df)\r\n\t# calculate the p-value\r\n\tp = (1.0 - t.cdf(abs(t_stat), df)) * 2.0\r\n\t# return everything\r\n\treturn t_stat, df, cv, p\r\n\r\n# seed the random number generator\r\nseed(1)\r\n# generate two independent samples (pretend they are dependent)\r\ndata1 = 5 * randn(100) + 50\r\ndata2 = 5 * randn(100) + 51\r\n# calculate the t test\r\nalpha = 0.05\r\nt_stat, df, cv, p = dependent_ttest(data1, data2, alpha)\r\nprint('t=%.3f, df=%d, cv=%.3f, p=%.3f' % (t_stat, df, cv, p))\r\n# interpret via critical value\r\nif abs(t_stat) <= cv:\r\n\tprint('Accept null hypothesis that the means are equal.')\r\nelse:\r\n\tprint('Reject the null hypothesis that the means are equal.')\r\n# interpret via p-value\r\nif p > alpha:\r\n\tprint('Accept null hypothesis that the means are equal.')\r\nelse:\r\n\tprint('Reject the null hypothesis that the means are equal.')<\/pre>\n<p>Running the example calculates the paired t-test on the sample problem.<\/p>\n<p>The calculated t-statistic and p-value match what we expect from the SciPy library implementation. This suggests that the implementation is correct.<\/p>\n<p>The interpretation of the t-test statistic with the critical value, and the p-value with the significance level both find a significant result, rejecting the null hypothesis that the means are equal.<\/p>\n<pre class=\"crayon-plain-tag\">t=-2.372, df=99, cv=1.660, p=0.020\r\nReject the null hypothesis that the means are equal.\r\nReject the null hypothesis that the means are equal.<\/pre>\n<\/p>\n<h3>Extensions<\/h3>\n<p>This section lists some ideas for extending the tutorial that you may wish to explore.<\/p>\n<ul>\n<li>Apply each test to your own contrived sample problem.<\/li>\n<li>Update the independent test and add the correction for samples with different variances and sample sizes.<\/li>\n<li>Perform a code review of one of the tests implemented in the SciPy library and summarize the differences in the implementation details.<\/li>\n<\/ul>\n<p>If you explore any of these extensions, I\u2019d love to know.<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Books<\/h3>\n<ul>\n<li><a href=\"https:\/\/amzn.to\/2J2Qibd\">Statistics in Plain English<\/a>, Third Edition, 2010.<\/li>\n<\/ul>\n<h3>API<\/h3>\n<ul>\n<li><a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.stats.ttest_ind.html\">scipy.stats.ttest_ind API<\/a><\/li>\n<li><a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.stats.ttest_rel.html\">scipy.stats.ttest_rel API<\/a><\/li>\n<li><a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.stats.sem.html\">scipy.stats.sem API<\/a><\/li>\n<li><a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.stats.t.html\">scipy.stats.t API<\/a><\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Student%27s_t-test\">Student\u2019s t-test on Wikipedia<\/a><\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Welch%27s_t-test\">Welch\u2019s t-test on Wikipedia<\/a><\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to implement the Student\u2019s t-test statistical hypothesis test from scratch in Python.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>The Student\u2019s t-test will comment on whether it is likely to observe two samples given that the samples were drawn from the same population.<\/li>\n<li>How to implement the Student\u2019s t-test from scratch for two independent samples.<\/li>\n<li>How to implement the paired Student\u2019s t-test from scratch for two dependent samples.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/how-to-code-the-students-t-test-from-scratch-in-python\/\">How to Code the Student\u2019s t-Test from Scratch in Python<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/how-to-code-the-students-t-test-from-scratch-in-python\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Perhaps one of the most widely used statistical hypothesis tests is the Student\u2019s t test. Because you may use this test yourself [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2018\/07\/29\/how-to-code-the-students-t-test-from-scratch-in-python\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":850,"comment_status":"registered_only","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/849"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=849"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/849\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/850"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=849"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=849"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=849"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}