{"id":3422,"date":"2020-05-07T06:33:49","date_gmt":"2020-05-07T06:33:49","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/05\/07\/johns-hopkins-covid-19-data-and-r-part-i-data-table-handling\/"},"modified":"2020-05-07T06:33:49","modified_gmt":"2020-05-07T06:33:49","slug":"johns-hopkins-covid-19-data-and-r-part-i-data-table-handling","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/05\/07\/johns-hopkins-covid-19-data-and-r-part-i-data-table-handling\/","title":{"rendered":"Johns Hopkins Covid-19 Data and R, Part I &#8212; data.table handling."},"content":{"rendered":"<p>Author: steve miller<\/p>\n<div>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/4791262478?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/4791262478?profile=RESIZE_710x\" class=\"align-full\"><\/a><\/p>\n<\/p>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p><em>Summary: This blog showcases the handling of daily data of cases\/deaths from Covid-19 in the U.S. published by the<span>&nbsp;<\/span><a href=\"https:\/\/github.com\/CSSEGISandData\/COVID-19\/tree\/master\/csse_covid_19_data\/csse_covid_19_time_series\">Center for Systems Science and Engineering<\/a><span>&nbsp;<\/span>at Johns Hopkins University. The technology deployed to manage and explore the data is R along with its splendid data.table package. Analysts with several months of R experience should benefit from the notebook below.<\/em><\/p>\n<p>It&#8217;s pretty hard to consume any analytics&#8217; media these days without seeing explorations of Covid-19 data. I was late to Covid EDA, but am now all in, hoping I can make even a small contribution to the pandemic response. A good starting point for Covid data is the<span>&nbsp;<\/span><a href=\"https:\/\/coronavirus.jhu.edu\/map.html\">Center for Systems Science and Engineering<\/a><span>&nbsp;<\/span>at Johns Hopkins University, my alma mater. The CSSE maintains a Covid-19 dashboard and posts<span>&nbsp;<\/span><a href=\"https:\/\/github.com\/CSSEGISandData\/COVID-19\/tree\/master\/csse_covid_19_data\/csse_covid_19_time_series\">confirmed case and fatality files daily for the U.S. and the world<\/a>.<\/p>\n<p>I started looking at that data about a week ago using R, planning later to examine the same data with Python and Julia. The downloadable case and death files hint of spreadsheets, with an ever-expanding date repeating group holding the case\/death cumulative counts. The granularity of the data is at county or other jurisdiction within state, so ultimately a normalized relational structure would key on the combination of state, jurisdiction, and date. A problem with the data, noted on the website, is that &#8220;The time series tables are subject to be updated if inaccuracies are identified in our historical data. The daily reports will not be adjusted in these instances to maintain a record of raw data.&#8221; In other words, there are some anomalies in the data that must be accounted for. I try to manage around them best I can with summarization and moving averages.<\/p>\n<p>Any data management work I do in R is built on the nonpareil<span>&nbsp;<\/span><a href=\"https:\/\/cran.r-project.org\/web\/packages\/data.table\/data.table.pdf\">data.table package<\/a>, which adds immeasurable functionality to R&#8217;s native data.frame. A newbie serious about learning R for analytics should make an investment in data.table. It&#8217;ll take some time, but the rewards are well worth the effort. Python programmers are starting to see the<span>&nbsp;<\/span><a href=\"https:\/\/github.com\/h2oai\/datatable\">Python data.table<\/a><span>&nbsp;<\/span>as a competitor to the venerable Pandas.<\/p>\n<p>This is the first of a two-part series on R with the CSSE case\/fatality data. Part I here details the loading\/shaping\/grouping of the data, while Part II will explore the data using ggplot. My hope is that readers will find some of the code useful in their own work.<\/p>\n<p>The supporting platform is a Wintel 10 notebook with 128 GB RAM, along with software JupyterLab 1.2.4 and R 3.6.2. The R data.table, tidyverse, pryr, plyr, fst, and knitr packages are featured, as well as functions from my personal stash, detailed below.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Set options, load packages, include personal functions, and migrate to the working directory. The functions blanks, meta, mykab, and prheadtail are used extensively.<\/p>\n<p>Read the entire blog&nbsp;<a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/4791290285?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\">here.<\/a><\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:950124\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: steve miller Summary: This blog showcases the handling of daily data of cases\/deaths from Covid-19 in the U.S. published by the&nbsp;Center for Systems Science [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/05\/07\/johns-hopkins-covid-19-data-and-r-part-i-data-table-handling\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":471,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3422"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=3422"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3422\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/462"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=3422"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=3422"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=3422"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}