{"id":4544,"date":"2021-04-05T06:36:54","date_gmt":"2021-04-05T06:36:54","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2021\/04\/05\/what-is-good-data-and-where-do-you-find-it\/"},"modified":"2021-04-05T06:36:54","modified_gmt":"2021-04-05T06:36:54","slug":"what-is-good-data-and-where-do-you-find-it","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2021\/04\/05\/what-is-good-data-and-where-do-you-find-it\/","title":{"rendered":"What is Good Data and Where Do You Find It?"},"content":{"rendered":"<p>Author: Stephanie Glen<\/p>\n<div>\n<ul>\n<li>\n<a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/8741078471?profile=original\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/8741078471?profile=RESIZE_710x\" class=\"align-full\"><\/a>Bad data is worse than no data at all.<\/li>\n<li>What is \u201cgood\u201d data and where do you find it?<\/li>\n<li>Best practices for data analysis.<\/li>\n<\/ul>\n<p>There\u2019s no such thing as perfect data, but there are several factors that qualify data as good [1]:<\/p>\n<ul>\n<li>It\u2019s readable and well-documented,<\/li>\n<li>It\u2019s readily available. For example, it\u2019s accessible through a trusted digital repository.<\/li>\n<li>The data is tidy and re-usable by others with a focus on ease of (re-)executability and reliance on deterministically obtained results [2].<\/li>\n<\/ul>\n<p>Following a few best practices will ensure that any data you collect and analyze will be as good as it gets.<\/p>\n<p><strong>1. Collect Data Carefully<\/strong><\/p>\n<p>Good data sets will come with flaws, and these flaws should be readily apparent. For example, an honest data set will have any errors or limitations clearly noted. However, it\u2019s really up to you, the analyst, to make an informed decision about the quality of data once you have it in hand. Use the same due diligence you would take in making a major purchase: once you\u2019ve found your \u201cperfect\u201d data set, perform more web-searches with the goal of uncovering any flaws.<\/p>\n<p>Some key questions to consider [3] :<\/p>\n<ul>\n<li>Where did the numbers come from? What do they mean?<\/li>\n<li>How was the data collected?<\/li>\n<li>Is the data current?<\/li>\n<li>How accurate is the data?<\/li>\n<\/ul>\n<p>Three great sources to collect data from<\/p>\n<p><strong>US Census Bureau<\/strong><\/p>\n<p>U.S. Census Bureau data is available to anyone for free. To download a CSV file:<\/p>\n<ul>\n<li>Go to data.census.gov[4]<\/li>\n<li>Search for the topic you\u2019re interested in.\u00a0<\/li>\n<li>Select the \u201cDownload\u201d button.<\/li>\n<\/ul>\n<p>The wide range of good data held by the Census Bureau is staggering. For example, I typed \u201cInstitutional\u201d to bring up the population in institutional facilities by sex and age, while data scientist Emily Kubiceka used U.S. Census Bureau data to compare hearing and deaf Americans [5].<\/p>\n<p><strong>Data.gov<\/strong><\/p>\n<p>Data.gov [6] contains data from many different US government agencies including climate, food safety, and government budgets. There&#8217;s a staggering amount of information to be gleaned. As an example, I found\u00a0<span>40,261 datasets\u00a0 for &#8220;covid-19&#8221; including:<\/span><\/p>\n<ul>\n<li>Louisville Metro Government estimated expenditures related to COVID-19.<span>\u00a0<\/span>\n<\/li>\n<li>State of Connecticut statistics for Connecticut correctional facilities.<\/li>\n<li>Locations offering COVID-19 testing in Chicago.<\/li>\n<\/ul>\n<p><strong>Kaggle<\/strong><\/p>\n<p>Kaggle [7] is a huge repository for public and private data. It\u2019s where you\u2019ll find data from The University of California, Irvine\u2019s Machine Learning Repository, data on the Zika virus outbreak, and even data on people attempting to buy firearms.\u00a0 Unlike the government websites listed above, you&#8217;ll need to check the license information for re-use of a particular dataset. Plus, not all data sets are wholly reliable: check your sources carefully before use.<\/p>\n<p><strong>2. Analyze with Care<\/strong><\/p>\n<p>So, you\u2019ve found the ideal data set, and you\u2019ve checked it to make sure it\u2019s not riddled with flaws. Your analysis is going to be passed along to many people, most (or all) of whom aren\u2019t mind readers. They may not know what steps you took in analyzing your data, so make sure your steps are clear with the following best practices [3]:<\/p>\n<ul>\n<li>\n<strong>Don\u2019t<\/strong> use X, Y or Z for variable names or units. <strong>Do<\/strong> use descriptive names like \u201c2020 prison population\u201d or \u201cNumber of ice creams sold.\u201d<\/li>\n<li>\n<strong>Don\u2019t<\/strong> guess which models fit. <strong>Do<\/strong> perform exploratory data analysis, check residuals, and validate your results with out-of-sample testing when possible.<\/li>\n<li>\n<strong>Don\u2019t<\/strong> create visual puzzles. <strong>Do<\/strong> create well-scaled and well-labeled graphs with appropriate titles and labels. Other tips [8]: Use readable fonts, small and neat legends and avoid overlapping text.<\/li>\n<li>\n<strong>Don\u2019t<\/strong> assume that regression is a magic tool. Do test for linearity and normality, transforming variables if necessary.<\/li>\n<li>\n<strong>Don\u2019t<\/strong> pass on a model unless you know exactly what it means. <strong>Do<\/strong> be prepared to explain the logic behind the model, including any assumptions made.\u00a0\u00a0<\/li>\n<li>\n<strong>Don\u2019t<\/strong> leave out uncertainty. <strong>Do<\/strong> report your standard errors and confidence intervals.<\/li>\n<li>\n<strong>Don\u2019t<\/strong> delete your modeling scratch paper. <strong>Do<\/strong> leave a paper trail, like annotated files, for others to follow. Your predecessor (when you\u2019ve moved along to better pastures) will thank you.<\/li>\n<\/ul>\n<p><strong>3. Don\u2019t be the weak link in the chain<\/strong><\/p>\n<p>Bad data doesn\u2019t appear from nowhere. That data set you started with was created by someone, possibly several people, in several different stages. If they too have followed these best practices, then the result will be a helpful piece of data analysis. But if you introduce error, and fail to account for it, those errors are going to be compounded as the data gets passed along.\u00a0<\/p>\n<p><strong>References<\/strong><\/p>\n<p>Data set image:\u00a0Pro8055, CC BY-SA 4.0 via Wikimedia Commons<\/p>\n<p>[1] <a href=\"https:\/\/researchdata.wisc.edu\/events\/good-data-examples-love-your-data-week-2017\/\" target=\"_blank\" rel=\"noopener\">Message of the day<\/a><\/p>\n<p>[2] <a href=\"https:\/\/royalsocietypublishing.org\/doi\/pdf\/10.1098\/rsta.2020.0069\" target=\"_blank\" rel=\"noopener\">Learning from reproducing computational results: introducing three principles and the Reproduction Package<\/a><\/p>\n<p>[3] <a href=\"https:\/\/people.duke.edu\/~rnau\/notroubl.htm\" target=\"_self\" rel=\"noopener\">How to avoid trouble:\u00a0\u00a0principles of good data analysis<\/a><\/p>\n<p>\u00a0[4] <a href=\"https:\/\/data.census.gov\/cedsci\/\" target=\"_blank\" rel=\"noopener\">United States Census Bureau<\/a><\/p>\n<p><a href=\"https:\/\/www.sciencenewsforstudents.org\/article\/better-data-forecasts-computer-science-census-weather\">[5] Better data lead to better forecasts<\/a><\/p>\n<p>[6] <a href=\"http:\/\/data.gov\/\" target=\"_blank\" rel=\"noopener\">Data.gov<\/a><\/p>\n<p>[7] <a href=\"http:\/\/kaggle.com\/\" target=\"_blank\" rel=\"noopener\">Kaggle<\/a><\/p>\n<p>[8]<a href=\"https:\/\/robjhyndman.com\/hyndsight\/graphics\/\" target=\"_self\" rel=\"noopener\">Twenty rules for good graphics<\/a><\/p>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:1046025\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Stephanie Glen Bad data is worse than no data at all. What is \u201cgood\u201d data and where do you find it? Best practices for [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2021\/04\/05\/what-is-good-data-and-where-do-you-find-it\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":459,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4544"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=4544"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4544\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/468"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=4544"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=4544"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=4544"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}