{"id":908,"date":"2018-08-16T06:54:29","date_gmt":"2018-08-16T06:54:29","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2018\/08\/16\/neural-networks-from-a-bayesian-perspective\/"},"modified":"2018-08-16T06:54:29","modified_gmt":"2018-08-16T06:54:29","slug":"neural-networks-from-a-bayesian-perspective","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2018\/08\/16\/neural-networks-from-a-bayesian-perspective\/","title":{"rendered":"Neural Networks from a Bayesian Perspective"},"content":{"rendered":"<p>Author: Yoel Zeldes<\/p>\n<div>\n<div class=\"uiScale uiScale-ui--regular uiScale-caption--regular postMetaHeader u-paddingBottom10 row\">\n<div class=\"ui-caption postMetaHeader-socialProof2 col u-size12of12 u-paddingBottom20\"><\/div>\n<div class=\"col u-size12of12 js-postMetaLockup\">\n<div class=\"uiScale uiScale-ui--regular uiScale-caption--regular postMetaLockup postMetaLockup--authorWithBio u-flexCenter js-postMetaLockup\">\n<div class=\"u-flex1 u-paddingLeft15 u-overflowHidden\">\n<div class=\"ui-caption postMetaInline js-testPostMetaInlineSupplemental\"><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"postArticle-content js-postField js-notesSource js-trackedPost\">\n<div class=\"section-content\">\n<div class=\"section-inner sectionLayout--insetColumn\"><\/div>\n<div class=\"section-inner sectionLayout--fullWidth\">\n<div class=\"aspectRatioPlaceholder is-locked\">\n<div class=\"progressiveMedia js-progressiveMedia graf-image is-canvasLoaded is-imageLoaded\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/2000\/1*1XBZ8ocyjdSxP5VCiUvXzg.jpeg?width=668\" width=\"687\" height=\"459\"><\/div>\n<\/div>\n<\/div>\n<div class=\"section-inner sectionLayout--insetColumn\">\n<p id=\"ea1a\" class=\"graf graf--p graf-after--figure\">Understanding what a model doesn\u2019t know is important both from the practitioner\u2019s perspective and for the end users of many different machine learning applications. In<span>\u00a0<\/span><a href=\"https:\/\/engineering.taboola.com\/using-uncertainty-interpret-model\" class=\"markup--anchor markup--p-anchor\" rel=\"noopener\" target=\"_blank\">our previous blog post<\/a><span>\u00a0<\/span>we discussed the different types of uncertainty. We explained how we can use it to interpret and debug our models.<\/p>\n<p id=\"0dc3\" class=\"graf graf--p graf-after--p\">In this post we\u2019ll discuss different ways to obtain uncertainty in Deep Neural Networks. Let\u2019s start by looking at neural networks from a Bayesian perspective.<\/p>\n<p class=\"graf graf--p graf-after--p\">\n<h3 id=\"8d0b\" class=\"graf graf--h3 graf-after--p\"><span style=\"font-size: 18pt;\">Bayesian learning\u00a0101<\/span><\/h3>\n<p id=\"4480\" class=\"graf graf--p graf-after--h3\"><span class=\"markup--quote markup--p-quote is-other\">Bayesian statistics allow us to draw conclusions based on both evidence (data) and our prior knowledge about the world. This is often contrasted with frequentist statistics which only consider evidence.<span>\u00a0<\/span><\/span><span class=\"markup--quote markup--p-quote is-other\">The prior knowledge captures our belief on which model generated the data, or what the weights of that model are.<\/span><span>\u00a0<\/span><span class=\"markup--quote markup--p-quote is-other\">We can represent this belief using a<span>\u00a0<\/span><em class=\"markup--em markup--p-em\">prior distribution<\/em><span>\u00a0<\/span>p(w) over the model\u2019s weights.<\/span><\/p>\n<p id=\"6349\" class=\"graf graf--p graf-after--p\">As we collect more data we update the prior distribution and turn in into a<span>\u00a0<\/span><em class=\"markup--em markup--p-em\">posterior distribution<\/em><span>\u00a0<\/span>using Bayes\u2019 law, in a process called<span>\u00a0<\/span><em class=\"markup--em markup--p-em\">Bayesian updating<\/em>:<\/p>\n<p><\/p>\n<div class=\"aspectRatioPlaceholder is-locked\">\n<div class=\"progressiveMedia js-progressiveMedia graf-image is-canvasLoaded is-imageLoaded\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*iG_jZ05lXo97QeNshpEVGg.png?width=362\" width=\"362\"><\/div>\n<\/div>\n<p><\/p>\n<p id=\"fe94\" class=\"graf graf--p graf-after--figure\">This equation introduces another key player in Bayesian learning\u200a\u2014\u200athe<span>\u00a0<\/span><em class=\"markup--em markup--p-em\">likelihood<\/em>, defined as p(y|x,w). This term represents how likely the data is, given the model\u2019s weights<span>\u00a0<\/span><em class=\"markup--em markup--p-em\">w<\/em>.<\/p>\n<p class=\"graf graf--p graf-after--figure\">\n<h3 id=\"0434\" class=\"graf graf--h3 graf-after--p\"><span style=\"font-size: 18pt;\">Neural networks from a Bayesian perspective<\/span><\/h3>\n<p id=\"d51b\" class=\"graf graf--p graf-after--h3\">A neural network\u2019s goal is to estimate the likelihood p(y|x,w). This is true even when you\u2019re not explicitly doing that, e.g.<span>\u00a0<\/span><a href=\"https:\/\/www.jessicayung.com\/mse-as-maximum-likelihood\" class=\"markup--anchor markup--p-anchor\" rel=\"noopener\" target=\"_blank\">when you minimize MSE<\/a>.<\/p>\n<p id=\"203d\" class=\"graf graf--p graf-after--p\">To find the best model weights we can use<span>\u00a0<\/span><em class=\"markup--em markup--p-em\">Maximum Likelihood Estimation\u00a0<\/em>(MLE):<\/p>\n<p><\/p>\n<div class=\"aspectRatioPlaceholder is-locked\">\n<div class=\"aspectRatioPlaceholder-fill\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*rWY2S4QkHSD5G5C0kXWvRw.png?width=489\" width=\"489\"><img class=\"progressiveMedia-image js-progressiveMedia-image\"><\/div>\n<\/div>\n<p><\/p>\n<p id=\"a762\" class=\"graf graf--p graf-after--figure\">Alternatively, we can use our prior knowledge, represented as a prior distribution over the weights, and maximize the posterior distribution. This approach is called<span>\u00a0<\/span><em class=\"markup--em markup--p-em\">Maximum Aposteriori Estimation<\/em><span>\u00a0<\/span>(MAP)<strong class=\"markup--strong markup--p-strong\">:<\/strong><\/p>\n<p><\/p>\n<div class=\"aspectRatioPlaceholder is-locked\">\n<div class=\"aspectRatioPlaceholder-fill\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*7iYHTt42UvOaCXV5mSMOFA.png\" width=\"564\" height=\"110\"><\/div>\n<div class=\"progressiveMedia js-progressiveMedia graf-image is-canvasLoaded\"><img class=\"progressiveMedia-image js-progressiveMedia-image\"><\/div>\n<\/div>\n<p><\/p>\n<p id=\"aa32\" class=\"graf graf--p graf-after--figure\">The term<span>\u00a0<\/span><em class=\"markup--em markup--p-em\">logP(w)<\/em>, which represents our prior, acts as a regularization term. Choosing a Gaussian distribution with mean 0 as the prior, you\u2019ll get the mathematical equivalence of L2 regularization.<\/p>\n<p id=\"e7b4\" class=\"graf graf--p graf-after--p\">Now that we start thinking about neural networks as probabilistic creatures, we can let the fun begin. For start, who says we have to output one set of weights at the end of the training process? What if instead of learning the model\u2019s weights, we learn a distribution over the weights? This will allow us to estimate uncertainty over the weights. So how do we do that?<\/p>\n<p class=\"graf graf--p graf-after--p\">\n<h3 id=\"9914\" class=\"graf graf--h3 graf-after--p\"><span style=\"font-size: 18pt;\">Once you go Bayesian, you never go\u00a0back<\/span><\/h3>\n<p id=\"6b12\" class=\"graf graf--p graf-after--h3\">We start again with a prior distribution over the weights and aim at finding their posterior distribution. This time, instead of optimizing the network\u2019s weights directly we\u2019ll average over all possible weights (referred to as marginalization).<\/p>\n<p id=\"56ed\" class=\"graf graf--p graf-after--p\">At inference, instead of taking the single set of weights that maximized the posterior distribution (or the likelihood, if we\u2019re working with MLE), we consider all possible weights, weighted by their probability. This is achieved using an integral:<\/p>\n<p><\/p>\n<div class=\"aspectRatioPlaceholder is-locked\">\n<div class=\"aspectRatioPlaceholder-fill\"><a href=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*ll9k3E53XDOSSJaEgPQWTA.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*ll9k3E53XDOSSJaEgPQWTA.png?width=499\" width=\"499\" class=\"align-center\"><\/a><\/div>\n<div class=\"progressiveMedia js-progressiveMedia graf-image is-canvasLoaded\"><img class=\"progressiveMedia-image js-progressiveMedia-image\"><\/div>\n<\/div>\n<p><\/p>\n<p id=\"b04d\" class=\"graf graf--p graf-after--figure\"><em class=\"markup--em markup--p-em\">x<\/em><span>\u00a0<\/span>is a data point for which we want to infer<span>\u00a0<\/span><em class=\"markup--em markup--p-em\">y<\/em>, and<span>\u00a0<\/span><em class=\"markup--em markup--p-em\">X<\/em>,<em class=\"markup--em markup--p-em\">Y<\/em><span>\u00a0<\/span>are training data. The first term p(y|x,w) is our good old likelihood, and the second term p(w|X,Y) is the posterior probability of the model\u2019s weights given the data.<\/p>\n<p id=\"b8db\" class=\"graf graf--p graf-after--p\">We can think about it as an ensemble of models weighted by the probability of each model. Indeed this is equivalent to an ensemble of infinite number of neural networks, with the same architecture but with different weights.<\/p>\n<p class=\"graf graf--p graf-after--p\">\n<h3 id=\"8a34\" class=\"graf graf--h3 graf-after--p\"><span style=\"font-size: 18pt;\">Are we there\u00a0yet?<\/span><\/h3>\n<p id=\"6307\" class=\"graf graf--p graf-after--h3\">Ay, There\u2019s the rub! Turns out that this integral is intractable in most cases. This is because<span>\u00a0<\/span><span class=\"markup--quote markup--p-quote is-other\">the posterior probability cannot be evaluated analytically<\/span>.<\/p>\n<p id=\"e61a\" class=\"graf graf--p graf-after--p\">This problem is not unique to Bayesian Neural Networks. You would run into this problem in many cases of Bayesian learning, and many methods to overcome this have been developed over the years. We can divide these methods into two families: variational inference and sampling methods.<\/p>\n<p><\/p>\n<div class=\"aspectRatioPlaceholder is-locked\">\n<div class=\"aspectRatioPlaceholder-fill\"><a href=\"https:\/\/cdn-images-1.medium.com\/max\/2000\/1*IYUzJxzbB_gYLwQynrOJZg.png\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/2000\/1*IYUzJxzbB_gYLwQynrOJZg.png?width=809\" width=\"626\" class=\"align-center\" height=\"209\"><\/a><\/div>\n<div class=\"progressiveMedia js-progressiveMedia graf-image is-canvasLoaded\"><img class=\"progressiveMedia-image js-progressiveMedia-image\"><\/div>\n<\/div>\n<p><\/p>\n<h4 id=\"f852\" class=\"graf graf--h4 graf-after--figure\"><span style=\"font-size: 14pt;\">Monte Carlo\u00a0sampling<\/span><\/h4>\n<p id=\"8f79\" class=\"graf graf--p graf-after--h4\">We have a problem. The posterior distribution is intractable. What if instead of computing the integral over the true distribution we\u2019ll approximate it with the average of samples drawn from it? One way to do that is the<span>\u00a0<\/span><a href=\"https:\/\/towardsdatascience.com\/a-zero-math-introduction-to-markov-chain-monte-carlo-methods-dcba889e0c50\" class=\"markup--anchor markup--p-anchor\" target=\"_blank\" rel=\"noopener\">Markov Chain Monte Carlo<\/a>\u200a\u2014\u200ayou construct a markov chain with the desired distribution as its equilibrium distribution.<\/p>\n<p class=\"graf graf--p graf-after--h4\">\n<h4 id=\"cb29\" class=\"graf graf--h4 graf-after--p\"><span style=\"font-size: 14pt;\">Variational Inference<\/span><\/h4>\n<p id=\"0b0f\" class=\"graf graf--p graf-after--h4\">Another solution is to approximate the true intractable distribution with a different distribution from a tractable family. To measure the similarity of the two distribution we can use KL divergence:<\/p>\n<p><\/p>\n<div class=\"aspectRatioPlaceholder is-locked\">\n<div class=\"aspectRatioPlaceholder-fill\"><\/div>\n<p><a href=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*vnkWQl8Rqf_lFgt98DStUA.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*vnkWQl8Rqf_lFgt98DStUA.png\" class=\"align-center\"><\/a><\/div>\n<p><\/p>\n<p id=\"bdc1\" class=\"graf graf--p graf-after--figure\">Let<span>\u00a0<\/span><em class=\"markup--em markup--p-em\">q<\/em><span>\u00a0<\/span>be a variational distribution parameterized by \u03b8. We want to find the value of \u03b8 that minimizes the KL divergence:<\/p>\n<p><\/p>\n<div class=\"aspectRatioPlaceholder is-locked\">\n<div class=\"aspectRatioPlaceholder-fill\"><a href=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*92HfheZdC_yaaW3pAxv5DQ.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*92HfheZdC_yaaW3pAxv5DQ.png?width=601\" width=\"601\" class=\"align-center\"><\/a><\/div>\n<\/div>\n<p><\/p>\n<p id=\"4a5c\" class=\"graf graf--p graf-after--figure\">Look at what we\u2019ve got: the first term is the KL divergence between the variational distribution and the prior distribution. The second term is the likelihood with regards to<span>\u00a0<\/span><em class=\"markup--em markup--p-em\">q<\/em>\u03b8. So we\u2019re looking for<span>\u00a0<\/span><em class=\"markup--em markup--p-em\">q<\/em>\u03b8 that explains the data best, but on the other hand is as close as possible to the prior distribution. This is just another way to introduce regularization into neural networks!<\/p>\n<p id=\"c767\" class=\"graf graf--p graf-after--p\">Now that we have<span>\u00a0<\/span><em class=\"markup--em markup--p-em\">q<\/em>\u03b8 we can use it to make predictions:<\/p>\n<p><\/p>\n<div class=\"aspectRatioPlaceholder is-locked\">\n<div class=\"aspectRatioPlaceholder-fill\"><\/div>\n<p><a href=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*F0MjIP31zdlQ9Jrni_z5YA.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*F0MjIP31zdlQ9Jrni_z5YA.png\" class=\"align-center\"><\/a><\/div>\n<p><\/p>\n<p id=\"ebeb\" class=\"graf graf--p graf-after--figure\">The above formulation comes from a<span>\u00a0<\/span><a href=\"http:\/\/proceedings.mlr.press\/v37\/blundell15.html\" class=\"markup--anchor markup--p-anchor\" rel=\"noopener\" target=\"_blank\">work by DeepMind<\/a><span>\u00a0<\/span>in 2015. Similar ideas were presented by<span>\u00a0<\/span><a href=\"http:\/\/proceedings.mlr.press\/v37\/blundell15.html\" class=\"markup--anchor markup--p-anchor\" rel=\"noopener\" target=\"_blank\">graves<\/a><span>\u00a0<\/span>in 2011 and go back to<span>\u00a0<\/span><a href=\"http:\/\/www.cs.toronto.edu\/~fritz\/absps\/colt93.pdf\" class=\"markup--anchor markup--p-anchor\" rel=\"noopener\" target=\"_blank\">Hinton and van Camp<\/a>in 1993. The<span>\u00a0<\/span><a href=\"https:\/\/www.youtube.com\/watch?v=FD8l2vPU5FY\" class=\"markup--anchor markup--p-anchor\" rel=\"noopener\" target=\"_blank\">keynote<\/a><span>\u00a0<\/span>in NIPS Bayesian Deep Learning workshop had a very nice overview of how these ideas evolved over the years.<\/p>\n<p id=\"7732\" class=\"graf graf--p graf-after--p\">OK, but what if we don\u2019t want to train a model from scratch? What if we have a trained model that we want to get uncertainty estimation from? Can we do that?<\/p>\n<p id=\"6849\" class=\"graf graf--p graf-after--p\">It turns out that if we use dropout during training, we actually can.<\/p>\n<p><\/p>\n<div class=\"aspectRatioPlaceholder is-locked\">\n<div class=\"progressiveMedia js-progressiveMedia graf-image is-canvasLoaded\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/2000\/1*jRSBtNk1G02Q2KqrC5tv_Q.jpeg\" width=\"597\" height=\"398\"><img class=\"progressiveMedia-image js-progressiveMedia-image\"><\/div>\n<\/div>\n<p>Professional data scientists contemplating the uncertainty of their model\u200a\u2014\u200aan illustration<\/p>\n<h4 id=\"a4a8\" class=\"graf graf--h4 graf-after--figure\"><span style=\"font-size: 14pt;\">Dropout as a mean for uncertainty<\/span><\/h4>\n<p id=\"ba1c\" class=\"graf graf--p graf-after--h4\"><span class=\"markup--quote markup--p-quote is-other\"><a href=\"http:\/\/jmlr.org\/papers\/v15\/srivastava14a.html\" class=\"markup--anchor markup--p-anchor\" rel=\"noopener\" target=\"_blank\">Dropout<\/a><span>\u00a0<\/span>is a well used practice as a regularizer.<\/span><span>\u00a0<\/span>In training time, you randomly sample nodes and drop them out, that is\u200a\u2014\u200aset their output to 0. The motivation? You don\u2019t want to over rely on specific nodes, which might imply overfitting.<\/p>\n<p id=\"549d\" class=\"graf graf--p graf-after--p\">In 2016,<span>\u00a0<\/span><a href=\"http:\/\/proceedings.mlr.press\/v48\/gal16.pdf\" class=\"markup--anchor markup--p-anchor\" rel=\"noopener\" target=\"_blank\">Gal and Ghahramani<\/a><span>\u00a0<\/span>showed that if you apply dropout at inference time as well, you can easily get an uncertainty estimator:<\/p>\n<ol class=\"postList\">\n<li id=\"22ad\" class=\"graf graf--li graf-after--p\">Infer y|x multiple times, each time sample a different set of nodes to drop out.<\/li>\n<li id=\"9c6f\" class=\"graf graf--li graf-after--li\">Average the predictions to get the final prediction E(y|x).<\/li>\n<li id=\"0f9a\" class=\"graf graf--li graf-after--li\">Calculate the sample variance of the predictions.<\/li>\n<\/ol>\n<p id=\"521d\" class=\"graf graf--p graf-after--li\">That\u2019s it! You got an estimate of the variance! The<span>\u00a0<\/span><a href=\"http:\/\/www.cs.ox.ac.uk\/people\/yarin.gal\/website\/blog_3d801aa532c1ce.html\" class=\"markup--anchor markup--p-anchor\" rel=\"noopener\" target=\"_blank\">intuition<\/a><span>\u00a0<\/span>behind this approach is that the training process can be thought of as training 2^m different models simultaneously\u200a\u2014\u200awhere m is the number of nodes in the network: each subset of nodes that is not dropped out defines a new model. All models share the weights of the nodes they don\u2019t drop out. At every batch, a randomly sampled set of these models is trained.<\/p>\n<p id=\"0406\" class=\"graf graf--p graf-after--p\">After training, you have in your hands an ensemble of models. If you use this ensemble at inference time as described above, you get the ensemble\u2019s uncertainty.<\/p>\n<p class=\"graf graf--p graf-after--p\">\n<h3 id=\"2f46\" class=\"graf graf--h3 graf-after--p\"><span style=\"font-size: 18pt;\">Sampling methods vs Variational Inference<\/span><\/h3>\n<p id=\"f889\" class=\"graf graf--p graf-after--h3\">In terms of the<span>\u00a0<\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Bias%E2%80%93variance_tradeoff\" class=\"markup--anchor markup--p-anchor\" rel=\"noopener\" target=\"_blank\">bias-variance tradeoff<\/a>, variational inference has high bias because we choose the distributions family. This is a strong assumption that we\u2019re making, and as any strong assumption, it introduces bias. However, it\u2019s stable, with low variance.<\/p>\n<p id=\"ee8b\" class=\"graf graf--p graf-after--p\">Sampling methods on the other hand have low bias, because we don\u2019t make assumptions about the distribution. This comes at the price of high variance, since the result is dependent on the samples we draw.<\/p>\n<p class=\"graf graf--p graf-after--p\">\n<h3 id=\"0fec\" class=\"graf graf--h3 graf-after--p\"><span style=\"font-size: 18pt;\">Final thoughts<\/span><\/h3>\n<p id=\"7515\" class=\"graf graf--p graf-after--h3\">Being able to estimate the model uncertainty is a hot topic. It\u2019s important to be aware of it in high risk applications such as medical assistants and self-driving cars. It\u2019s also a valuable tool to understand which data could benefit the model, so we can go and get it.<\/p>\n<p id=\"bc11\" class=\"graf graf--p graf-after--p\">In this post we covered some of the approaches to get model uncertainty estimations. There are many more methods out there, so if you feel highly uncertain about it, go ahead and look for more data \ud83d\ude42<\/p>\n<p id=\"e1c4\" class=\"graf graf--p graf-after--p\">In the next post we\u2019ll show you how to use uncertainty in recommender systems, and specifically\u200a\u2014\u200ahow to tackle the<span>\u00a0<\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Multi-armed_bandit\" class=\"markup--anchor markup--p-anchor\" rel=\"noopener\" target=\"_blank\">exploration-exploitation challenge<\/a>. Stay tuned.<\/p>\n<p id=\"53fa\" class=\"graf graf--p graf-after--p\"><em class=\"markup--em markup--p-em\">This is the second post of a series related to a paper we\u2019re presenting in a workshop in this year KDD conference:<\/em><a href=\"https:\/\/arxiv.org\/abs\/1711.02487\" class=\"markup--anchor markup--p-anchor\" rel=\"noopener\" target=\"_blank\"><span>\u00a0<\/span><em class=\"markup--em markup--p-em\">deep density networks and uncertainty in recommender systems<\/em><\/a><em class=\"markup--em markup--p-em\">.<\/em><\/p>\n<p id=\"1242\" class=\"graf graf--p graf-after--p graf--trailing\"><em class=\"markup--em markup--p-em\">The first post can be found<span>\u00a0<\/span><\/em><a href=\"https:\/\/engineering.taboola.com\/using-uncertainty-interpret-model\" class=\"markup--anchor markup--p-anchor\" rel=\"noopener\" target=\"_blank\"><em class=\"markup--em markup--p-em\">here<\/em><\/a><em class=\"markup--em markup--p-em\">.<\/em><\/p>\n<p class=\"graf graf--p graf-after--p graf--trailing\">\n<\/div>\n<\/div>\n<div class=\"section-divider\">\n<hr class=\"section-divider\"><\/div>\n<div class=\"section-content\">\n<div class=\"section-inner sectionLayout--insetColumn\">\n<p class=\"graf graf--p graf--leading graf--trailing\">\n<p id=\"41d3\" class=\"graf graf--p graf--leading graf--trailing\">This is a joint post with\u00a0<a href=\"https:\/\/medium.com\/@inbarnaor\" class=\"markup--user markup--p-user\" target=\"_blank\" rel=\"noopener\">Inbar Naor<\/a>. Originally published at<span>\u00a0<\/span><a href=\"https:\/\/engineering.taboola.com\/neural-networks-bayesian-perspective\" class=\"markup--anchor markup--p-anchor\" rel=\"noopener\" target=\"_blank\">engineering.taboola.com<\/a><\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"u-padding0 u-clearfix u-backgroundGrayLightest u-print-hide supplementalPostContent js-responsesWrapper\">\n<div class=\"container u-maxWidth740\">\n<div class=\"responsesStreamWrapper u-maxWidth640 js-responsesStreamWrapper\">\n<div class=\"responsesStream-editor cardChromeless u-marginBottom20 u-paddingLeft20 u-paddingRight20 js-responsesStreamEditor\">\n<div class=\"inlineNewPostControl js-inlineNewPostControl\">\n<div class=\"inlineEditor is-collapsed is-postEditMode js-inlineEditor\">\n<div class=\"u-paddingTop20 js-block js-inlineEditorContent\">\n<div class=\"inlineEditor-header\">\n<div class=\"inlineEditor-headerContent\"><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:751446\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Yoel Zeldes Understanding what a model doesn\u2019t know is important both from the practitioner\u2019s perspective and for the end users of many different machine [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2018\/08\/16\/neural-networks-from-a-bayesian-perspective\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":468,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/908"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=908"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/908\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/471"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=908"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=908"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=908"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}