Writing/Reading Large R dataframes/data.tables — Addendum.

Author: steve miller

After posting my most recent blog using census data to illustrate handling “large” dataframes in R exploiting fst and feather file formats, I realized I could have taken the analysis a step further.

Recall my then description of “demographic information on both American households and individuals (population). The final household and population data stores are quite large for desktop computing: household consists of almost 7.5M records with 233 attributes, while population is just under 15.8M cases and 286 variables.”

In addition to working with the two individual dataframes/data.tables, one could also consider the merge of household and population. For each household, there are one or more population records; each population record is, in turn, for one and only one household. Indeed, there’s an attribute, serialno, that can be used to join household and population to a resulting data.table of almost 15.8M records and over 500 attributes — consuming in excess of 32GB memory. This is quite large for desktop R.

So, I just had to attempt to produce such a structure, despite the R in-memory limitation, and then write fst and feather files for subsequent use. Alas, the attempt to produce the supersize data.table on my 64GB Wintel notebook failed with a memory allocation error (R’s not the most efficient memory manager.). My response? Get the join to work on a 128GB notebook, then “save” the resulting data in fst and feather files. Once produced, transport those files to the 64GB machine to see if they can be read. Turns out that approach works quite well.

The remainder of this notebook splits attention between 128GB RAM and 64GB RAM machines, producing the joined data.table on the former and demoing access of the data from fst and feather files on the latter. Once the grand fst file is built, it can be redeployed to a smaller memory machine and accessed like an in-memory data.table with projection and filtering. The performance is excellent.

The technology used is JupyterLab 0.35.4, Anaconda Python 3.7.3, Pandas 0.24.2, and R 3.6.0. The R tidyverse, data.table, fst, and feather packages are featured.

See the entire blog here.

Go to Source