{"id":1311,"date":"2018-11-20T06:34:26","date_gmt":"2018-11-20T06:34:26","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2018\/11\/20\/matching-the-exact-matching-of-matchit\/"},"modified":"2018-11-20T06:34:26","modified_gmt":"2018-11-20T06:34:26","slug":"matching-the-exact-matching-of-matchit","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2018\/11\/20\/matching-the-exact-matching-of-matchit\/","title":{"rendered":"Matching the Exact Matching of MatchIt"},"content":{"rendered":"<p>Author: steve miller<\/p>\n<div>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/135848432?profile=original\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/135848432?profile=original\" class=\"align-full\"><\/a><\/p>\n<\/p>\n<p>I started a series on<span>\u00a0<\/span><a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/mixing-amp-matching-in-r-for-data-science\">causal inference for data science<\/a><span>\u00a0<\/span>a few weeks back. I think CI methodologies offer great potential for the DS discipline, given that much of our data is observational i.e. outside experimental control.<\/p>\n<p>As I noted then, &#8220;The platinum design for causal inference is the experiment where subjects are randomly assigned to the different treatment groups. With randomization, the effects of uncontrolled or confounding factors (the Z&#8217;s) should, within sampling limitations, be &#8220;equal&#8221; or &#8220;balanced&#8221; across treatments or values of X. In such settings of &#8220;controlled&#8221; Z&#8217;s, the analyst is much more confident that a correlation between X and Y actually indicates causality.<\/p>\n<p>But what about the in-situ data gathering schemes generally seen in the DS world, where data are observational, and confounders are free to roam? What is one to do? The answer: consider causal inference techniques that attempt to statistically mimic the randomized experiment.&#8221;<\/p>\n<p>In that blog, I introduced data the from the<span>\u00a0<\/span><a href=\"https:\/\/www.census.gov\/programs-surveys\/acs\/data.html\">American Community Survey<\/a>. Details of data set construction can be found<span>\u00a0<\/span><a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/mixing-amp-matching-in-r-for-data-science\">there<\/a>.<\/p>\n<p>The question I purported to address with that data is what if any income difference it makes for individuals to hold a terminal master&#8217;s degree vs a terminal bachelor&#8217;s. Since we can&#8217;t conduct an experiment where population is assigned at random to either master&#8217;s or bachelor&#8217;s degree &#8220;treatments&#8221;, it made sense to consider a CI technique such as matching to see if we could untangle the effects of the education &#8220;treatment&#8221; from uncontrolled covariates\/confounders such as age, sex, marital status, and race that might differ between the education groups out of the gate.<\/p>\n<p>The technique I deployed was nearest neighbor matching using the results of a propensity model detailing if\/how the &#8220;treatment&#8221; covaried with the confounders. The results indicated that if all impactful confounders had been included &#8212; a critical assumption &#8212; that there was indeed a meaningful difference in income between the two education levels. Moreover, when the matching adjustments were applied, the income difference was smaller, but still off-the-charts significant. This reduction made sense given that master&#8217;s degreed cases were older and more likely to be married &#8212; indicators that positively related to income on their own.<\/p>\n<p>Though I was pretty happy with the results, I was less enthused about the computational intensity of the chosen technique. It took over 70 minutes for the calculations against a random subset of 250,000 of the more than .5M suitable records to complete. With that kind of performance, such models would be less than ideal for data science work.<\/p>\n<p>I also discovered critiques of<span>\u00a0<\/span><a href=\"https:\/\/gking.harvard.edu\/files\/gking\/files\/psnot.pdf\">propensity model-driven matching by Harvard professor Gary King et. al.<\/a>, who&#8217;re trailblazers in causal inference and authors of the popular R CI package,<span>\u00a0<\/span><a href=\"https:\/\/cran.r-project.org\/web\/packages\/MatchIt\/vignettes\/matchit.pdf\">MatchIt<\/a>.<\/p>\n<p>As a result, I decided for this analysis to try &#8220;exact matching&#8221; on the entire .5M+ data file. em is a much simpler and computationally more benign technique that only involves basic SQL-like wrangling. It turns out that em worked quite well with this data, completing calculations against the full file in under 30 seconds. The code and results are detailed below.<\/p>\n<p>The technology used in the analysis is JupyterLab with Microsoft Open R, 3.4.4. For the matching work, the MatchIt, tableone, and data.table packages are deployed.<\/p>\n<p>Next time I&#8217;ll consider coarsened exact matching, an extension to em that promotes a higher matching rate, thus potentially lowering estimate variance.<\/p>\n<p>Find the remainder of the blog\u00a0<a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/135848558?profile=original\" target=\"_blank\" rel=\"noopener\">here.<\/a><\/p>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:779313\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: steve miller I started a series on\u00a0causal inference for data science\u00a0a few weeks back. I think CI methodologies offer great potential for the DS [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2018\/11\/20\/matching-the-exact-matching-of-matchit\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":463,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1311"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=1311"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1311\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/461"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=1311"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=1311"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=1311"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}