{"id":2646,"date":"2019-10-03T06:32:23","date_gmt":"2019-10-03T06:32:23","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/10\/03\/surprising-uses-of-synthetic-random-data-sets\/"},"modified":"2019-10-03T06:32:23","modified_gmt":"2019-10-03T06:32:23","slug":"surprising-uses-of-synthetic-random-data-sets","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/10\/03\/surprising-uses-of-synthetic-random-data-sets\/","title":{"rendered":"Surprising Uses of Synthetic Random Data Sets"},"content":{"rendered":"<p>Author: Vincent Granville<\/p>\n<div>\n<p>I have used synthetic data sets many times for simulation purposes, most recently in my articles <a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/six-degrees-of-separation-between-any-two-data-sets\" target=\"_blank\" rel=\"noopener noreferrer\">Six degrees of Separations between any two Datasets<\/a> and <a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/how-to-lie-with-p-values\" target=\"_blank\" rel=\"noopener noreferrer\">How to Lie with <em>p<\/em>-values<\/a>. Many applications (including the data sets themselves) can be found in my books <a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/fee-book-applied-stochastic-processes\" target=\"_blank\" rel=\"noopener noreferrer\">Applied Stochastic Processes<\/a> and <a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/free-book-statistics-new-foundations-toolbox-and-machine-learning\" target=\"_blank\" rel=\"noopener noreferrer\">New Foundations of Statistical Science<\/a>. For instance, these data sets can be used to benchmark some statistical tests of hypothesis (the null hypothesis known to be true or false in advance) and to assess the power of such tests or confidence intervals. In other cases, it is used to simulate clusters and test cluster detection \/ pattern detection algorithms, see <a href=\"https:\/\/www.analyticbridge.datasciencecentral.com\/profiles\/blogs\/how-to-detect-a-pattern-problem-and-solution\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>.\u00a0 I also used such data sets to discover two new deep conjectures in number theory (see <a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/two-new-deep-conjectures-in-probabilistic-number-theory\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>), to design new Fintech models such as <em>bounded Brownian motions<\/em>, and find new families of statistical distributions (see <a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/a-strange-family-of-statistical-distributions\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>).<\/p>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3641314354?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3641314354?profile=RESIZE_710x\" class=\"align-center\"><\/a><\/p>\n<p style=\"text-align: center;\"><em>Goldbach&#8217;s comet (source: <a href=\"https:\/\/en.wikipedia.org\/wiki\/Goldbach%27s_comet\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>)<\/em><\/p>\n<p>In this article, I focus on peculiar random data sets to prove &#8212; heuristically &#8212; two of the most famous math conjectures in number theory, related to prime numbers: the Twin Prime conjecture, and the Goldbach conjecture. The methodology is at the intersection of probability theory, experimental math, and probabilistic number theory. It involves working with infinite data sets, dwarfing any data set found in any business context. While similar ideas have been explored in the past based on the concept of density (see <a href=\"https:\/\/www.dartmouth.edu\/~chance\/chance_news\/for_chance_news\/Riemann\/cramer.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>), in this article they are presented in a more general and simpler framework, using the concept of synthetic random data set. These data sets are truly random with some probability distributions assigned to them. The expectation or variance of some of the data set features (the average gap between two successive values for instance) can many times be computed explicitly, or can lead to some interesting asymptotic results.\u00a0<\/p>\n<p><strong>Synthetic Random Sets and Twin Primes<\/strong><\/p>\n<p>The sets considered here consist of infinitely many positive integers. They are built as follows, using an arbitrary function <em>f<\/em> taking values between 0 and 1, and defined over the set of all positive integers.<\/p>\n<p>Let <em>S<\/em>(<em>f<\/em>) be the random set in question. The positive integer <em>n<\/em> belongs to <em>S<\/em>(<em>f<\/em>) with probability <em>f<\/em>(<em>n<\/em>). In other words, for each integer <em>n<\/em>, generate a uniform deviate <em>U<\/em>(<em>n<\/em>) on [0,1]. If and only if <em>U<\/em>(<em>n<\/em>)\u00a0 <\u00a0\u00a0<em>f<\/em>(<em>n<\/em>) then add <em>n<\/em> to <em>S<\/em>(<em>f<\/em>). The deviates <em>U<\/em>(<em>n<\/em>) are independent. Now consider <em>f<\/em>(<em>n<\/em>) = 1 \/ log <em>n<\/em> if <em>n<\/em> > 2, and <em>f<\/em>(1) = <em>f<\/em>(2) = 0. With this particular choice of <em>f<\/em>, the set <em>S<\/em>(<em>f<\/em>) has the same &#8220;density&#8221; as the set of prime numbers thanks to the <a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/simple-proof-of-prime-number-theorem\" target=\"_blank\" rel=\"noopener noreferrer\">prime number theorem<\/a>, and behaves in a very similar way. Let us define the following functions (actually, they are random variables):<\/p>\n<ul>\n<li><em>M<\/em>(<em>n<\/em>) is the number of elements in <em>S<\/em>(<em>f<\/em>) that are smaller than <em>n<\/em>.<\/li>\n<li><em>T<\/em>(<em>n<\/em>) is the number of twin pairs (<em>k<\/em>, <em>k<\/em>+1) with both <em>k<\/em>, <em>k<\/em>+1 belonging to <em>S<\/em>(<em>f<\/em>) and <em>k<\/em> \u2264 <em>n<\/em>.<\/li>\n<\/ul>\n<p>The twins in <em>S<\/em>(<em>f<\/em>) play the same role as <a href=\"https:\/\/en.wikipedia.org\/wiki\/Twin_prime\" target=\"_blank\" rel=\"noopener noreferrer\">twin primes<\/a>. It is is easy to prove that <em>M<\/em>(<em>n<\/em>) (its expectation) is asymptotically equal to the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Logarithmic_integral_function\" target=\"_blank\" rel=\"noopener noreferrer\">logarithm integral<\/a> Li(<em>n<\/em>). The same is true for the <a href=\"http:\/\/mathworld.wolfram.com\/PrimeCountingFunction.html\" target=\"_blank\" rel=\"noopener noreferrer\">prime counting function<\/a>. Furthermore, the prime numbers themselves is a particular case of <em>S<\/em>(<em>f<\/em>), a particular realization of the random set <em>S<\/em>(<em>f)<\/em>\u00a0with the same function <em>f<\/em>. (you need to ignore even integers in the construction process, but other than that, it&#8217;s straightforward.)<\/p>\n<p>Then we also have:<\/p>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3641218210?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3641218210?profile=RESIZE_710x\" class=\"align-center\"><\/a><\/p>\n<p>Here <strong>E<\/strong> denotes the expectation; <em>C<\/em> and <em>C<\/em>&#8216; are two positive constants.. The same result is conjectured to be true for twin primes. It is easy to show that there are infinitely many twin numbers in <em>S<\/em>(<em>f<\/em>), and to assess whether the series consisting of the reciprocals of these twins, converge (it does). The same arguments could be applied to twin primes, but of course, it would no constitute a proof, just some heuristic argumentation.\u00a0<\/p>\n<p><strong>The Goldbach Conjecture<\/strong><\/p>\n<p>The <a href=\"https:\/\/en.wikipedia.org\/wiki\/Goldbach%27s_conjecture\" target=\"_blank\" rel=\"noopener noreferrer\">Goldbach conjecture<\/a> states that every even integer is the sum of two primes. Using our random sets, with the same function <em>f<\/em>, we can easily obtain a stronger result about the asymptotic value for the expected number of ways an even number <em>n<\/em> can be represented as a sum of two primes. The answer is\u00a0<\/p>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3641292985?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3641292985?profile=RESIZE_710x\" class=\"align-center\"><\/a><\/p>\n<p>In short, we proved that the result is true for the random set that we built, and thus it must (likely) be\u00a0 true for prime numbers as well, which is the topic of the conjecture. The above asymptotic equivalence is based on the following fact. For many functions <em>f<\/em> including the one we are interested in (<em>f<\/em>(<em>k<\/em>) = 1 \/ log <em>k<\/em> if <em>k<\/em> >2), we have:<\/p>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3641295922?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3641295922?profile=RESIZE_710x\" class=\"align-center\"><\/a><\/p>\n<p>Here the implicit constant under the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Big_O_notation\" target=\"_blank\" rel=\"noopener noreferrer\">Landau symbol <em>O<\/em><\/a>\u00a0is equal to 1. See <a href=\"https:\/\/math.stackexchange.com\/questions\/3374248\/the-following-seems-true-for-a-wide-range-of-functions-sum-k-1n-fkfn-k\/\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a> for details. I could not find a proof of this simple result, but the above sum is the discrete <a href=\"https:\/\/en.wikipedia.org\/wiki\/Convolution_theorem\" target=\"_blank\" rel=\"noopener noreferrer\">self-convolution<\/a> of the function <em>f<\/em>. Another conjecture, more general than Goldbach (indeed also a generalization of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Waring%27s_problem\" target=\"_blank\" rel=\"noopener noreferrer\">Waring&#8217;s problem<\/a>), can be found in one of my previous articles, <a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/number-theory-nice-generalization-of-the-waring-conjecture\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>. Finally, the concept of random sets can be extended to sets of real numbers, or sets of matrices, polynomials, functions, set of sets, or even nested random sets.<\/p>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/number-theory-nice-generalization-of-the-waring-conjecture\"><\/a><\/p>\n<\/p>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:893221\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Vincent Granville I have used synthetic data sets many times for simulation purposes, most recently in my articles Six degrees of Separations between any [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/10\/03\/surprising-uses-of-synthetic-random-data-sets\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":457,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2646"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=2646"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2646\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/459"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=2646"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=2646"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=2646"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}