{"id":827,"date":"2018-07-24T06:55:02","date_gmt":"2018-07-24T06:55:02","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2018\/07\/24\/when-variable-reduction-doesnt-work\/"},"modified":"2018-07-24T06:55:02","modified_gmt":"2018-07-24T06:55:02","slug":"when-variable-reduction-doesnt-work","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2018\/07\/24\/when-variable-reduction-doesnt-work\/","title":{"rendered":"When Variable Reduction Doesn\u2019t Work"},"content":{"rendered":"<p>Author: William Vorhies<\/p>\n<div>\n<p><strong><em>Summary:<\/em><\/strong><em>\u00a0 Exceptions sometimes make the best rules.\u00a0 Here\u2019s an example of well accepted variable reduction techniques resulting in an inferior model and a case for dramatically expanding the number of variables we start with.<\/em><\/p>\n<p>\u00a0<\/p>\n<p><a href=\"http:\/\/api.ning.com\/files\/GOtZVJLj-wqtnowdqSWYTxbiW5p7RH0eJtYjWiuVHQSNXOG042ek9lS8az8bL747rQegHxP8CGuHgeVC1nE1AnVbe0mC8Esy\/noexception.png\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/GOtZVJLj-wqtnowdqSWYTxbiW5p7RH0eJtYjWiuVHQSNXOG042ek9lS8az8bL747rQegHxP8CGuHgeVC1nE1AnVbe0mC8Esy\/noexception.png?width=300\" width=\"300\" class=\"align-right\"><\/a>One of the things that keeps us data scientists on our toes is that the well-established rules-of-thumb don\u2019t always work.\u00a0 Certainly one of the most well-worn of these rules is the parsimonious model; always seek to create the best model with the fewest variables.\u00a0 And woe to you who violate this rule.\u00a0 Your model will over fit, include false random correlations, or at very least will just be judged to be slow and clunky.<\/p>\n<p>Certainly this is a rule I embrace when building models so I was surprised and then delighted to find a well conducted study by Lexis\/Nexis that lays out a case where this clearly isn\u2019t true.<\/p>\n<p>\u00a0<\/p>\n<p><span style=\"font-size: 12pt;\"><strong>A Little Background<\/strong><\/span><\/p>\n<p>In highly regulated industries like insurance and lending the variables that are allowed for use are highly regulated as are the modeling techniques.\u00a0 Techniques are generally limited to those that are highly explainable, mostly GLM and simple decision trees.\u00a0 Data can\u2019t include anything that is overtly discriminatory under the law so, for example, race, sex, and age can\u2019t be used, or at least not directly.\u00a0 All of this works against model accuracy.<\/p>\n<p>Traditionally what agencies could use to build risk models has been defined as \u2018traditional data\u2019, that which the consumer has submitted with their application and the data that can be added from the major credit rating agencies.\u00a0 In this last case Experian and the others offer some 250 different variables and except for those that are specifically excluded by law, this seems like a pretty good sized inventory of predictive features.<\/p>\n<p>But in the US and especially abroad the market contains many \u2018thin-file\u2019 or \u2018no-file\u2019 consumers who would like to borrow but for which traditional data sources simply don\u2019t exist.\u00a0 Millennials feature in this group because their cohort is young and doesn\u2019t yet have much borrowing or credit history.\u00a0 But also in this group are the folks judged to be marginal credit risks, some of whom could be good customers if only we knew how to judge the risk.<\/p>\n<p>\u00a0<\/p>\n<p><span style=\"font-size: 12pt;\"><strong>Enter the World of Alternative Data<\/strong><\/span><\/p>\n<p>\u2018Alternative data\u2019 is considered to be any data not directly related to the consumer\u2019s credit behavior, basically anything other than the application data and consumer credit bureau data.\u00a0 A variety of agencies are prepared to provide it and it can include:<\/p>\n<ol>\n<li>Transaction data (e.g. checking account data)<\/li>\n<li>Telecom\/utility\/rent data<\/li>\n<li>Social profile data<\/li>\n<li>Social network data<\/li>\n<li>Clickstream data<\/li>\n<li>Audio and text data<\/li>\n<li>Survey data<\/li>\n<li>Mobile app data<\/li>\n<\/ol>\n<p>As it turns out lenders have been embracing alternative data for the last several years and see real improvements in their credit models, particularly at the low end of the scores.\u00a0 Even the CFPB has provisionally endorsed this to bring credit to the underserved.<\/p>\n<p>\u00a0<\/p>\n<p><span style=\"font-size: 12pt;\"><strong><a href=\"http:\/\/api.ning.com\/files\/GOtZVJLj-wohwe7iFLV4XnZ7IbNkdQLh-kxZuwuMsMO3nQ8R5swuvrgpHyahdSHJrGK1lHj3o3R5em7RtL-j-5uWn4I6bl6A\/modeldevleft.jpg\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/GOtZVJLj-wohwe7iFLV4XnZ7IbNkdQLh-kxZuwuMsMO3nQ8R5swuvrgpHyahdSHJrGK1lHj3o3R5em7RtL-j-5uWn4I6bl6A\/modeldevleft.jpg?width=200\" width=\"200\" class=\"align-right\"><\/a>From a Data Science Perspective<\/strong><\/span><\/p>\n<p>From a data science perspective, in this example we started out with on the order of 250 candidate features from \u2018traditional data\u2019, and now, using \u2018alternative data\u2019 we can add an additional 1,050 features.\u00a0 What\u2019s the first thing you do when you have 1,300 candidate variables?\u00a0 You go through the steps necessary to identify only the most predictive variables and discard the rest.<\/p>\n<p>\u00a0<\/p>\n<p><span style=\"font-size: 12pt;\"><strong>Here\u2019s Where It Gets Interesting<\/strong><\/span><\/p>\n<p>Lexis\/Nexis, the provider of the alternative data, set out to demonstrate that a credit model built on all 1,300 features was superior to one built on only 250 traditional features.\u00a0 The data was drawn from a full-file auto lending portfolio of just under 11 million instances.\u00a0 You and I might have concluded that even 250 was too many but in order to keep the test rigorous they introduced these constraints.\u00a0<\/p>\n<ol>\n<li>The technique was limited to forward stepwise logistic regression. This provided clear univariate feedback on the importance of each variable.<\/li>\n<li>Only two models would be compared, one with the top 250 most predictive attributes and the other with all 1,300 attributes. This eliminated any bias from variable selection that might be introduced by the modeler.<\/li>\n<li>The variables for the 250 var model were selected by ranking the predictive power of each variables correlation to the dependent variable. As it happened all of the alternate variables fell outside the top 250 with the highest ranking 296<sup>th<\/sup>.<\/li>\n<li>The models were created with the same overall data prep procedures such as binning rules.<\/li>\n<\/ol>\n<p>\u00a0<\/p>\n<p><span style=\"font-size: 12pt;\"><strong>What Happened<\/strong><\/span><\/p>\n<p>As you might expect, the first and most important variable was the same for both models but began to diverge at the second variable.\u00a0 The second variable in the 1,300 model was actually 296<sup>th<\/sup> based on the earlier predictive power analysis.\u00a0<\/p>\n<p>When the model was completed the alternative data made up 25% of the model\u2019s accuracy although none would have been included based on the top 250 predictive variables.<\/p>\n<p>The KS (Kolmogorov-Smirnov) statistic was 4.3% better for the 1,300 model compared to the 250 model.<\/p>\n<p>\u00a0<\/p>\n<p><span style=\"font-size: 12pt;\"><strong>The Business Importance<\/strong><\/span><\/p>\n<p>The distribution of scores and charge offs for each models was very similar but in the bottom 5% of scores things changed.\u00a0 There was a 6.4% increase in the number of predicted charge offs in this bottom group.\u00a0<\/p>\n<p>Since the distributions are the essentially the same this can be seen as higher scores that might have been rated credit worthy migrating into the lowest categories of credit worthiness allowing better decisions about denial or pricing based on risk.\u00a0 Conversely it appears that some lowest rated borrowers were given a boost with the additional data.<\/p>\n<p>That also translates to a competitive advantage for those using the alternative data compared to those who don\u2019t.\u00a0 You can see the original study <a href=\"http:\/\/images.solutions.lexisnexis.com\/Web\/LexisNexis\/%7Ba73d4c61-eb8d-4256-8ca0-073cb0781421%7D_LexisNexis_Risk_Solutions_-_Modeling_Blended_Alternative_and_Traditional_Data.pdf\"><em><u>here<\/u><\/em><\/a>.<\/p>\n<p>\u00a0<\/p>\n<p><strong>T<span style=\"font-size: 12pt;\">here are Four Lesson for Data Scientists Here<\/span><\/strong><\/p>\n<ol>\n<li><strong>Think outside the box and consider the value of a large number of variables when first developing or refining your model.<\/strong> It wasn\u2019t until just a few years ago that the insurance industry started looking at alternative data and on the margin it has increased accuracy in important ways.\u00a0 FICO published this chart showing the relative value of each category of alternative data strongly supporting using more variables.<\/li>\n<\/ol>\n<p>\u00a0<a href=\"http:\/\/api.ning.com\/files\/GOtZVJLj-wqqw3hbcSfGUJiUgk9DXU4*xmYj0UIlLPvu1u-3cZ7B3lSIp4XQlXUdKfzswFeqRQ9VmYizgQFK6hymN6Qmy2x9\/FICOalternativedata.png\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/GOtZVJLj-wqqw3hbcSfGUJiUgk9DXU4*xmYj0UIlLPvu1u-3cZ7B3lSIp4XQlXUdKfzswFeqRQ9VmYizgQFK6hymN6Qmy2x9\/FICOalternativedata.png?width=550\" width=\"550\" class=\"align-center\"><\/a><\/p>\n<ol start=\"2\">\n<li><strong>Be careful about using \u2018tried and true\u2019 variable selection techniques.<\/strong> In the Lexis\/Nexis case starting the modeling process with variable selection based on univariate correlation with the dependent variable was misleading.\u00a0 There are a variety of other techniques they could have tried.<\/li>\n<li><strong>Depending on the amount of prep, it still may not be worthwhile expanding your variables so dramatically.<\/strong> More data always means more prep means more time which in a commercial environment you may not have.\u00a0 Still, be open to exploration.<\/li>\n<li><strong>Adding \u2018alternate source\u2019 data to your decision making can be a two edged sword.<\/strong> In India, measures as obscure as how often a user charges his cell phone or its average charge level has proven to be predictive.\u00a0 In that credit-starved environment these innovative measures are welcomed when they provide greater access to credit.<\/li>\n<\/ol>\n<p>On the other hand just this week a major newspaper in England published as expose of comparative auto insurance rates where it discovered that individuals applying with a Hotmail account were paying as much as 7% more than those with Gmail accounts.\u00a0 Apparently British insurers had found a legitimate correlation between risk and this alternative data.\u00a0 It did not sit well with the public and the companies are now on the defensive.<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>About the author:\u00a0 Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001.\u00a0 He can be reached at:<\/p>\n<p><a href=\"mailto:Bill@DataScienceCentral.com\">Bill@DataScienceCentral.com<\/a><\/p>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:688194\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: William Vorhies Summary:\u00a0 Exceptions sometimes make the best rules.\u00a0 Here\u2019s an example of well accepted variable reduction techniques resulting in an inferior model and [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2018\/07\/24\/when-variable-reduction-doesnt-work\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":472,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/827"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=827"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/827\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/474"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=827"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=827"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=827"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}