{"id":1115,"date":"2018-10-03T06:42:56","date_gmt":"2018-10-03T06:42:56","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2018\/10\/03\/lots-of-free-open-source-datasets-to-make-your-ai-better\/"},"modified":"2018-10-03T06:42:56","modified_gmt":"2018-10-03T06:42:56","slug":"lots-of-free-open-source-datasets-to-make-your-ai-better","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2018\/10\/03\/lots-of-free-open-source-datasets-to-make-your-ai-better\/","title":{"rendered":"Lots of Free Open Source Datasets to Make Your AI Better"},"content":{"rendered":"<p>Author: William Vorhies<\/p>\n<div>\n<p><strong><em>Summary:<\/em><\/strong><em>\u00a0 There are several approaches to reducing the cost of training data for AI, one of which is to get it for free.\u00a0 Here are some excellent sources.<\/em><\/p>\n<p>\u00a0<\/p>\n<p><a href=\"http:\/\/api.ning.com\/files\/nipT839RZyOGwDrhaV*9CDdAYsh7mflxIc2QRj1-Jpr7jWfsWuYeKpEeS710YlzhwGOWxL-L9j1fVQeI2C4vXcvvl32MI0a6\/freestuff2.28.06PM1.png\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/nipT839RZyOGwDrhaV*9CDdAYsh7mflxIc2QRj1-Jpr7jWfsWuYeKpEeS710YlzhwGOWxL-L9j1fVQeI2C4vXcvvl32MI0a6\/freestuff2.28.06PM1.png?width=250\" width=\"250\" class=\"align-right\"><\/a>Recently we wrote that training data (not just data in general) is the new oil.\u00a0 It\u2019s the difficulty and expense of acquiring labeled training data that causes many deep learning projects to be abandoned.\u00a0<\/p>\n<p>It also matters a great deal just how good you want your new deep learning app to be.\u00a0 A <a href=\"https:\/\/www.deeplearningbook.org\/\"><em><u>2016 study<\/u><\/em><\/a> by Goodfellow, Bengio and Courville concluded you could get \u2018acceptable\u2019 performance with about 5,000 labeled examples per category BUT it would take <strong>10 Million labeled examples per category<\/strong> to \u201cmatch or exceed human performance\u201d.\u00a0<\/p>\n<p>There are a number of technologies coming up through research now that promise more accurate auto labeling to make creating training data less costly and time consuming.\u00a0 <a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/breaking-through-the-cost-barrier-to-deep-learning\"><em><u>Snorkel<\/u><\/em><\/a> from the Stanford Dawn Project is one we covered recently.\u00a0 This area is getting a lot of research attention.<\/p>\n<p>Another approach is to build on someone else\u2019s work using publically available datasets.\u00a0 You can begin by building your model in the borrowed set, you can blend your data with the borrowed data, or you could use the <a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/transfer-learning-deep-learning-for-everyone\"><em><u>transfer learning approach<\/u><\/em><\/a> to repurpose the front end of an existing model to train on your more limited data.<\/p>\n<p>Whatever your strategy, the ability to build on publically available datasets is always something you\u2019ll want to consider, so your ability to find them becomes key.\u00a0<\/p>\n<p>Here are some notes on where you might start your search.\u00a0 These won\u2019t all be labeled image and text but a lot of them are.\u00a0 And for those of you looking to use ML and statistical techniques, there\u2019s plenty here for you too.\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><span style=\"font-size: 12pt;\"><strong>Google<\/strong><\/span><\/p>\n<p><a href=\"http:\/\/api.ning.com\/files\/nipT839RZyMQQ9UFu6sst78vxThrOW2y2Md56fNppBAHWCqY*aSsPn7ptSTqQzYlyx*AnQDme3l92e0IC9LIuaPttKEN362X\/searchdatasets.jpg\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/nipT839RZyMQQ9UFu6sst78vxThrOW2y2Md56fNppBAHWCqY*aSsPn7ptSTqQzYlyx*AnQDme3l92e0IC9LIuaPttKEN362X\/searchdatasets.jpg?width=250\" width=\"250\" class=\"align-right\"><\/a>Wouldn\u2019t it be delightful to just Google the type and subject of datasets we want.\u00a0 You may already have your favorites, for example NOAA and NASA for weather.\u00a0 But until early September, Google search didn\u2019t include metadata search for datasets.\u00a0 However, thanks to Google\u2019s acquisition of Schema.org the metadata for datasets is now recognized by Google\u2019s knowledge graph.\u00a0 This is in beta.\u00a0 You can find it here:<\/p>\n<p><a href=\"https:\/\/toolbox.google.com\/datasetsearch\"><em><u>https:\/\/toolbox.google.com\/datasetsearch<\/u><\/em><\/a><\/p>\n<p>Google staff says they\u2019ve already indexed more than a million items that appear to be datasets but there\u2019s a way to go before this is pure.\u00a0 There are some refinements already available.\u00a0 Here\u2019s a subsidiary search site just for truly public datasets.<\/p>\n<p><u><a href=\"https:\/\/cloud.google.com\/public-datasets\/\"><em>https:\/\/cloud.google.com\/public-datasets\/<\/em><\/a><\/u><\/p>\n<p>This page will also lead you to some special subsets like:<\/p>\n<ul>\n<li><a href=\"https:\/\/cloud.google.com\/bigquery\/public-data\/\"><em><u>Google BigQuery Public Datasets<\/u><\/em><\/a> (the first terabyte download is free but charges apply after that).<\/li>\n<li>Google Genomics Public Datasets<\/li>\n<li>Geo Imagery Datasets<\/li>\n<\/ul>\n<p>Too much interesting stuff here to list but it includes Github, Medicare data, public IRS forms data, and about 9 million URLs to images that have been labeled spanning 6,000 categories.<\/p>\n<p>\u00a0<\/p>\n<p><span style=\"font-size: 12pt;\"><strong>Microsoft<\/strong><\/span><\/p>\n<p>Not to be outdone, Microsoft recently launched a similar site called Microsoft Research Open Data, also in beta.<\/p>\n<p><em><u><a href=\"https:\/\/msropendata.com\/\">https:\/\/msropendata.com\/<\/a><\/u><\/em><\/p>\n<p>MS Research Open Data doesn\u2019t search the entire web, but rather makes available 53 previously proprietary datasets all in the realm of deep learning, both text\/speech and image.<\/p>\n<p>\u00a0<\/p>\n<p><span style=\"font-size: 12pt;\"><strong>Academic Torrents<\/strong><\/span><\/p>\n<p>This smaller not-for-profit offers just under 2,000 datasets totaling about 28 terabytes.\u00a0 This is a distributed system for sharing very large datasets covering a very eclectic range of topics.\u00a0 It is searchable, but perhaps not with the comprehensive nature of the Google site.\u00a0 In addition to downloading, you might want to consider uploading your dataset for others to this site.<\/p>\n<p><a href=\"http:\/\/academictorrents.com\/\"><em><u>http:\/\/academictorrents.com\/<\/u><\/em><\/a><\/p>\n<p><em><u>\u00a0<\/u><\/em><\/p>\n<p><span style=\"font-size: 12pt;\"><strong>Skymind<\/strong><\/span><\/p>\n<p>Skymind is a commercial platform to rapidly prototype, deploy, maintain, and retrain machine learning models.\u00a0 They offer 101 datasets from a variety of sources that cover Natural-Image, Geospatial, Facial, Video, Text , Question answering, Sentiment, Recommendation and ranking systems, Networks and Graphs, Speech Datasets, Symbolic Music, Health &#038; Biology, and Government &#038; statistical data sets.<\/p>\n<p><a href=\"https:\/\/skymind.ai\/wiki\/open-datasets\"><em><u>https:\/\/skymind.ai\/wiki\/open-datasets<\/u><\/em><\/a><\/p>\n<p><em><u>\u00a0<\/u><\/em><\/p>\n<p><span style=\"font-size: 12pt;\"><strong>Github \/ Kaggle \/ Federal Government Sources<\/strong><\/span><\/p>\n<p>We should never forget our tried and true traditional sources:<\/p>\n<p>\u00a0<\/p>\n<p><strong>Github:<\/strong>\u00a0 565 data sets.<\/p>\n<p><em><u><a href=\"https:\/\/github.com\/awesomedata\/awesome-public-datasets\">https:\/\/github.com\/awesomedata\/awesome-public-datasets<\/a><\/u><\/em><\/p>\n<p>\u00a0<\/p>\n<p><strong>Kaggle Public Datasets:<\/strong> 10,992 current listings.<\/p>\n<p><a href=\"https:\/\/www.kaggle.com\/datasets\"><em><u>https:\/\/www.kaggle.com\/datasets<\/u><\/em><\/a><\/p>\n<p>\u00a0<\/p>\n<p><strong>Data.Gov.<\/strong> The home of the US Government\u2019s open data.\u00a0 Currently 302,944 datasets.<\/p>\n<p><u><a href=\"https:\/\/www.data.gov\/\"><em>https:\/\/www.data.gov\/<\/em><\/a><\/u><\/p>\n<p><em><u>\u00a0<\/u><\/em><\/p>\n<p><span style=\"font-size: 12pt;\"><strong>Figure Eight<\/strong><\/span><\/p>\n<p>This commercial provider of human-in-the-loop data currently offers only eight datasets.\u00a0 The reason for their inclusion here is unique.\u00a0 Figure Eight makes its reputation by providing accurate data, especially enhancing the accuracy of its client\u2019s data.\u00a0<\/p>\n<p>What we have not discussed above is the issue of accuracy and you should rightly be concerned before you accept a public dataset as foundation for your new AI application.<\/p>\n<p><em><u><a href=\"https:\/\/www.figure-eight.com\/datasets\/\">https:\/\/www.figure-eight.com\/datasets\/<\/a><\/u><\/em><\/p>\n<p>The other interesting aspect of Figure Eight is their promotion of active learning techniques.\u00a0 The phrase \u2018Active Learning\u2019 can be a bit misleading.\u00a0 It\u2019s actually a method for incrementally improving training data quality without committing all the training data for human review, and approaching a statistically supportable point of \u2018best accuracy\u2019, balanced against lower cost.\u00a0<\/p>\n<p>Incidentally, Figure Eight makes a convincing case that especially in the area of NLP where you might be training chatbots; it\u2019s worth the investment to have multiple reviewers for each item drawn from different demographics in order to avoid cultural bias in the interpretation.\u00a0 You know, is that a hoagie, sub, hero, po-boy, grinder, torpedo, etc.\u00a0 If active learning is of interest there\u2019s a <a href=\"https:\/\/www.datasciencecentral.com\/video\/dsc-webinar-series-the-essentials-of-training-data-for-machine\"><em><u>good DSC webinar here<\/u><\/em><\/a> with Figure Eight\u2019s leading expert in the field.<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blog\/list?user=0h5qapp2gbuf8\"><em><u>Other articles by Bill Vorhies.<\/u><\/em><\/a><\/p>\n<p>\u00a0<\/p>\n<p>About the author:\u00a0 Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001.\u00a0 He can be reached at:<\/p>\n<p><a href=\"mailto:Bill@Data-Magnum.com\">Bill@Data-Magnum.com<\/a> <span>or<\/span> <a href=\"mailto:Bill@DataScienceCentral.com\">Bill@DataScienceCentral.com<\/a><\/p>\n<p><span>\u00a0<\/span><\/p>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:764964\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: William Vorhies Summary:\u00a0 There are several approaches to reducing the cost of training data for AI, one of which is to get it for [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2018\/10\/03\/lots-of-free-open-source-datasets-to-make-your-ai-better\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":463,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1115"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=1115"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1115\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/472"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=1115"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=1115"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=1115"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}