{"id":2294,"date":"2019-06-25T06:32:18","date_gmt":"2019-06-25T06:32:18","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/06\/25\/writing-reading-large-r-dataframes-datatables\/"},"modified":"2019-06-25T06:32:18","modified_gmt":"2019-06-25T06:32:18","slug":"writing-reading-large-r-dataframes-datatables","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/06\/25\/writing-reading-large-r-dataframes-datatables\/","title":{"rendered":"Writing\/Reading Large R dataframes\/datatables."},"content":{"rendered":"<p>Author: steve miller<\/p>\n<div>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3088643838?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3088643838?profile=RESIZE_710x\" class=\"align-full\"><\/a><\/p>\n<p>I recently downloaded a<span>\u00a0<\/span><a href=\"https:\/\/factfinder.census.gov\/faces\/nav\/jsf\/pages\/searchresults.xhtml?refresh=t#\">5 year Public Use Microsample (PUMS)<\/a><span>\u00a0<\/span>from the latest release of the American Community Survey (ACS) census data. The data contain a wealth of demographic information on both American households and individuals (population). The final household and population data stores are quite large for desktop computing: household consists of almost 7.5M records with 233 attributes, while population is just under 15.8M cases and 286 variables.<\/p>\n<p>In addition to enabling a wealth of demographic analyses, these census data are quite suitable for performance testing functions to ingest, munge, and deliver analytics. That is, if your computer has enough firepower: both R and Python-Pandas constrain data structures by the size of memory. Fortunately, my Wintel notebook, with 64 GB RAM and 2 TB disk\/solid state storage, is up to hardware task here.<\/p>\n<p>My focus with this blog is on determining how R&#8217;s dataframe\/data.table read and write capabilities measure up to 15+ GB of raw input. Working with data this size can often deliver more clear-cut benchmarks than smaller tests repeated and aggregated. Indeed oftentimes, as is the case in this notebook, the analyst can experience order of magnitude performance differences between various approaches.<\/p>\n<p>In the analyses below, I contrast the elapsed time of writing the 18 GB population dataframe to OS files using three different csv functions, R&#8217;s saveRDS function with and without compression, the interoperable feather library, and the nonesuch fst library. I then, in turn, read these just-produced OS files back into dataframes\/datatables and compare timing results. Each of the seven read\/write approaches produces files that are portable across disparate R platforms. The<span>\u00a0<\/span><a href=\"https:\/\/blog.rstudio.com\/2016\/03\/29\/feather\/\">feather package<\/a>, in addition, interoperates between R and Python-Pandas &#8212; a major benefit.<\/p>\n<p>At the conclusion of the performance tests, I outline a generic approach to efficiently sourcing data from R to work in both R and Python-Pandas platforms using functionality from a combination of the fst and feather packages. I demonstrate the approach in R and, using the nifty reticulate package, in Python-Pandas as well.<\/p>\n<p>The technology used is JupyterLab 0.32.1, Anaconda Python 3.6.5, Pandas 0.23.0, and R 3.6.0. The R data.table, fst, feather, and reticulate packages are featured.<\/p>\n<p>Read the entire post\u00a0<a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3088664811?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\">here.<\/a><\/p>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:845179\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: steve miller I recently downloaded a\u00a05 year Public Use Microsample (PUMS)\u00a0from the latest release of the American Community Survey (ACS) census data. The data [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/06\/25\/writing-reading-large-r-dataframes-datatables\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":468,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2294"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=2294"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2294\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/470"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=2294"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=2294"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=2294"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}