{"id":4873,"date":"2021-07-29T06:30:06","date_gmt":"2021-07-29T06:30:06","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2021\/07\/29\/a-simple-regression-problem\/"},"modified":"2021-07-29T06:30:06","modified_gmt":"2021-07-29T06:30:06","slug":"a-simple-regression-problem","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2021\/07\/29\/a-simple-regression-problem\/","title":{"rendered":"A Simple Regression Problem"},"content":{"rendered":"<p>Author: Vincent Granville<\/p>\n<div>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9326895084?profile=original\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9326895084?profile=RESIZE_710x\" width=\"600\" class=\"align-center\"><\/a><\/p>\n<p>This article is part of a new series featuring problems with solution, to help you hone your machine learning and pattern recognition skills. Try to solve this problem by yourself first, before looking at the solution. Today&#8217;s problem also has an intriguing mathematical appeal and solution: this allows you to check if your solution found using machine learning techniques, is correct or not. The level is for beginners.\u00a0<\/p>\n<p>The problem is as follows. Let <em>X<\/em><span style=\"font-size: 8pt;\">1<\/span>, <em>X<\/em><span style=\"font-size: 8pt;\">2<\/span>, <em>X<\/em><span style=\"font-size: 8pt;\">3<\/span> and so on be a sequence recursively defined by X<span style=\"font-size: 8pt;\"><em>n<\/em>+1<\/span> = Stdev(X<span style=\"font-size: 8pt;\">1<\/span>, &#8230;, <em>X<span style=\"font-size: 8pt;\">n<\/span><\/em>). Here <em>X<\/em><span style=\"font-size: 8pt;\">1<\/span>, the initial condition, is a positive real number or random variable. Thus,<\/p>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9326797280?profile=original\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9326797280?profile=RESIZE_710x\" width=\"300\" class=\"align-center\"><\/a><\/p>\n<p>It is clear that <em>X<span style=\"font-size: 8pt;\">n<\/span><\/em> = <em>A<span style=\"font-size: 8pt;\">n<\/span> X<span style=\"font-size: 8pt;\">1<\/span><\/em>, where <em>A<span style=\"font-size: 8pt;\">n<\/span><\/em>\u00a0is a number that does not depend on <em>X<\/em><span style=\"font-size: 8pt;\">1<\/span>. So we can assume, without loss of generality, that <span style=\"font-size: 8pt;\"><span style=\"font-size: 10pt;\"><em>X<\/em><\/span>1<\/span> = 1. For instance, <em>A<\/em><span style=\"font-size: 8pt;\">1<\/span> = 1 and <em>A<\/em><span style=\"font-size: 8pt;\">2<\/span> = 0. The purpose here is to study the behavior of <em>A<span style=\"font-size: 8pt;\">n<\/span><\/em> (for large <em>n<\/em>) using simple model fitting techniques. I plotted the first few values of <em>A<span style=\"font-size: 8pt;\">n<\/span><\/em>, below. In the figure below, the X-axis represents <em>n<\/em>, and the Y-axis represents <em>A<span style=\"font-size: 8pt;\">n<\/span><\/em>. The question is: how to approximate <em>A<span style=\"font-size: 8pt;\">n<\/span><\/em> as a simple function of <em>n<\/em>? Of course, a linear regression won&#8217;t work. What about a polynomial regression?<\/p>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9326801281?profile=original\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9326801281?profile=RESIZE_710x\" width=\"500\" class=\"align-center\"><\/a><\/p>\n<p>The first 600 values of <em>A<span style=\"font-size: 8pt;\">n<\/span><\/em> are available <a href=\"http:\/\/datashaping.com\/stdv.txt\" target=\"_blank\" rel=\"noopener\">here<\/a>, as a text file.<\/p>\n<p><span style=\"font-size: 14pt;\"><strong>Solution<\/strong><\/span><\/p>\n<p>A tool as basic as Excel is good enough to find the solution. However, if you use Excel, the built-in function Stdev has a correcting factor that needs to be taken care of. But you can just use the values of <em>A<span style=\"font-size: 8pt;\">n<\/span><\/em> available in my text file mentioned above, to avoid this problem.<\/p>\n<p>If you use Excel, you can try various types of trend lines to approximate the blue curve, and even compute the regression coefficients and the R-squared for each tested model. You will find very quickly that the power trend line is the best model by far, that is, <em>A<span style=\"font-size: 8pt;\">n<\/span><\/em> is very well approximated (for large values of <em>n<\/em>) by <em>A<span style=\"font-size: 8pt;\">n<\/span><\/em> = <i>b<\/i>\u00a0<em>n<\/em>^<em>c<\/em>. Here <em>n<\/em>^<em>c<\/em> stands for <em>n<\/em> at power <em>c<\/em>; also, <em>b<\/em> and <em>c<\/em> are the regression coefficients. In other words, log <em>A<span style=\"font-size: 8pt;\">n<\/span><\/em> = log <em>b<\/em> + <em>c<\/em> log <em>n<\/em> (approximately).\u00a0<\/p>\n<p>What is very interesting, is that using some mathematics, you can actually compute the exact value of <em>c<\/em>. Indeed, <em>c<\/em> is solution of the equation <em>c<\/em>^2 = (2<em>c<\/em> + 1) (<em>c<\/em> + 1)^2, see <a href=\"https:\/\/math.stackexchange.com\/questions\/4190405\/asymptotic-behavior-of-recurrence-x-n1-mboxstdevx-1-dots-x-n\" target=\"_blank\" rel=\"noopener\">here<\/a>. This is a polynomial equation of degree 3, so the exact value of <em>c<\/em> can be computed. The approximation is <em>c<\/em> = -0.3522011. It is however very hard to get the exact value of <em>b<\/em>.\u00a0<\/p>\n<p>It would interesting to plot the residual error for each estimated value of <em>A<span style=\"font-size: 8pt;\">n<\/span><\/em>, and see if it shows some pattern. This could lead to a better approximation: <em>A<span style=\"font-size: 8pt;\">n<\/span><\/em> = <em>b<\/em> <em>n<\/em>^<em>c<\/em> (1 + <em>d\u00a0<\/em>\/\u00a0<em>n<\/em>), with three parameters: <em>b<\/em>, <em>c<\/em> (unchanged) and <em>d<\/em>.<\/p>\n<\/p>\n<div class=\"postbody\">\n<div class=\"xg_user_generated\">\n<p><span style=\"font-size: 12pt;\"><em>To receive a weekly digest of our new articles, subscribe to our newsletter,\u00a0<a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/check-out-our-dsc-newsletter\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/em><\/span><\/p>\n<p><span style=\"font-size: 12pt;\"><em><strong>About the author<\/strong>:\u00a0 Vincent Granville is a d<span class=\"lt-line-clamp__raw-line\">ata science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at\u00a0<a href=\"http:\/\/datashaping.com\/\" target=\"_blank\" rel=\"noopener\">DataShaping.com<\/a>, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target).<\/span>\u00a0He recently opened\u00a0<a href=\"https:\/\/www.parisrestaurantandbar.com\/\" target=\"_blank\" rel=\"noopener\">Paris Restaurant<\/a>, in Anacortes. You can access Vincent&#8217;s articles and books,\u00a0<a href=\"http:\/\/datashaping.com\/\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/em><\/span><\/p>\n<\/div>\n<\/div>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:1059902\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Vincent Granville This article is part of a new series featuring problems with solution, to help you hone your machine learning and pattern recognition [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2021\/07\/29\/a-simple-regression-problem\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":475,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4873"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=4873"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4873\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/473"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=4873"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=4873"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=4873"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}