{"id":1818,"date":"2019-03-05T17:00:00","date_gmt":"2019-03-05T17:00:00","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/03\/05\/doing-our-part-to-share-open-data-responsibly\/"},"modified":"2019-03-05T17:00:00","modified_gmt":"2019-03-05T17:00:00","slug":"doing-our-part-to-share-open-data-responsibly","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/03\/05\/doing-our-part-to-share-open-data-responsibly\/","title":{"rendered":"Doing our part to share open data responsibly"},"content":{"rendered":"<p>Author: <\/p>\n<div>\n<div class=\"block-paragraph\">\n<div class=\"rich-text\">\n<p>This past weekend marked <a href=\"https:\/\/opendataday.org\/\">Open Data Day<\/a>, an annual celebration of making data freely available to everyone. Communities around the world <a href=\"https:\/\/blog.okfn.org\/2019\/02\/21\/these-are-the-grantees-of-the-open-data-day-2019-mini-grant-scheme\/\">organized events<\/a>, and we\u2019re taking a moment here at Google to share our own perspective on the importance of open data. More accessible data can meaningfully help people and organizations, and we\u2019re doing our part by opening datasets, providing access to APIs and aggregated product data, and developing tools to make data more accessible and useful.<\/p>\n<h2>Responsibly opening datasets<\/h2>\n<\/p>\n<p>Sharing datasets is increasingly important as more people adopt machine learning through open frameworks like TensorFlow. We\u2019ve released over 50 <a href=\"https:\/\/ai.google\/tools\/datasets\/\">open datasets<\/a> for other developers and researchers to use. These include <a href=\"https:\/\/research.google.com\/youtube8m\/\">YouTube 8M<\/a>, a corpus of annotated videos used externally for video understanding; the <a href=\"https:\/\/hdrplusdata.org\/\">HDR+ Burst Photography dataset<\/a>, which helps others experiment with the technology that powers <a href=\"https:\/\/ai.googleblog.com\/2017\/10\/portrait-mode-on-pixel-2-and-pixel-2-xl.html\">Pixel features like Portrait Mode<\/a>; and <a href=\"https:\/\/storage.googleapis.com\/openimages\/web\/index.html\">Open Images<\/a>, along with the <a href=\"https:\/\/ai.google\/tools\/datasets\/open-images-extended-crowdsourced\/\">Open Images Extended<\/a> dataset which increases photo diversity.<\/p>\n<p>Just because data is open doesn\u2019t mean it will be useful, however. First, a dataset needs to be cleaned so that any insights developed from it are based on well-structured and accurate examples. Cleaning a large dataset is no small feat; before opening up our own, we spend hundreds of hours standardizing data and validating quality. Second, a dataset should be shared in a machine-readable format that\u2019s easy for others to use, such as JSON rather than PDF. Finally, consider whether the dataset is representative of the intended content. Even if data is usable and representative of some situations, it may not be appropriate for every application. For instance, if a dataset contains mostly North American animal images, it may help you classify a deer, but not a giraffe. Tools like <a href=\"https:\/\/pair-code.github.io\/facets\/\">Facets<\/a> can help you analyze the makeup of a dataset and evaluate the best ways to put it to use. We\u2019re also working to build more <a href=\"https:\/\/ai.googleblog.com\/2018\/12\/adding-diversity-to-images-with-open.html\">representative datasets<\/a> through interfaces like the <a href=\"https:\/\/play.google.com\/store\/apps\/details?id=com.google.android.apps.village.boond\">Crowdsource application<\/a>. To guide others\u2019 use of your own dataset, consider publishing a data card which denotes authorship, composition and suggested use cases (<a href=\"https:\/\/ai.google\/static\/documents\/datasets\/open-images-extended-crowdsourced.pdf\">here\u2019s<\/a> an example from our Open Images Extended release). <\/p>\n<h2>Making data findable and useful<\/h2>\n<\/p>\n<p>It\u2019s not enough to just make good data open, though&#8211;it also needs to be findable. Researchers, developers, journalists and other curious data-seekers often struggle to locate data scattered across the web\u2019s thousands of repositories. Our <a href=\"https:\/\/toolbox.google.com\/datasetsearch\">Dataset Search<\/a> tool helps people find data sources wherever they\u2019re hosted, as long as the data is described in a way that search engines can locate. Since the tool launched a few months ago, we\u2019ve seen the number of unique datasets on the platform double to 10 million, including contributions from the U.S. National Ocean and Atmospheric Administration (NOAA), National Institutes of Health (NIH), the Federal Reserve, the European Data Portal, the World Bank and government portals from every continent.<\/p>\n<p>What makes data useful is how easily it can be analyzed. Though there\u2019s more open data today, data scientists spend significant time analyzing it across multiple sources. To help solve that problem, we\u2019ve created <a href=\"https:\/\/www.datacommons.org\/\">Data Commons<\/a>. It\u2019s a knowledge graph of data sources that lets users \u00a0treat various datasets of interest\u2014regardless of source and format\u2014as if they are all in a single local database. Anyone can contribute datasets or build applications powered by the infrastructure. For people using the platform, that means less time engineering data and more time generating insights. We\u2019re already seeing exciting use cases of Data Commons. In one UC Berkeley <a href=\"http:\/\/www.ds100.org\/\">data science course<\/a> taught by Josh Hug and Fernando Perez, students used Census, CDC and Bureau of Labor Statistics data to <a href=\"http:\/\/datacommons.org\/colab\">correlate<\/a> obesity levels across U.S. cities with other health and economic factors. Typically, that analysis would take days or weeks; using Data Commons, students were able to build high-fidelity models in less than an hour. We hope to partner with other educators and researchers\u2014if you\u2019re interested, reach out to collaborate@datacommons.org.<\/p>\n<h2>Balancing trade-offs<\/h2>\n<p>There are trade-offs to opening up data, and we aim to balance various sensitivities with the potential benefits of sharing. One consideration is that broad data openness can facilitate uses that don\u2019t align with <a href=\"https:\/\/ai.google\/principles\/\">our AI Principles<\/a>. For instance, we recently made <a href=\"https:\/\/blog.google\/outreach-initiatives\/google-news-initiative\/advancing-research-fake-audio-detection\/\">synthetic speech data<\/a> available only to researchers participating in the 2019 ASVspoof Challenge, to ensure that the data can be used to develop tools to detect deepfakes, while limiting misuse.<\/p>\n<p>Extreme data openness can also risk exposing user or proprietary information, causing privacy breaches or threatening the security of our platforms. We allow third party developers to build on services like Maps, Gmail and more via APIs, so they can build their own products while user data is kept safe. We also publish aggregated product data like <a href=\"https:\/\/trends.google.com\/trends\/\">Search Trends<\/a> to share information of public interest in a privacy-preserving way.<\/p>\n<p>While there can be benefits to using sensitive data in controlled and principled ways, like <a href=\"https:\/\/ai.googleblog.com\/2018\/05\/deep-learning-for-electronic-health.html\">predicting medical conditions or events<\/a>, it\u2019s critical that safeguards are in place so that training machine learning models doesn\u2019t compromise individual privacy. Emerging research provides promising new avenues to learn from sensitive data. One is <a href=\"https:\/\/ai.googleblog.com\/2017\/04\/federated-learning-collaborative.html\">Federated Learning<\/a>, a technique for training global ML models without data ever leaving a person\u2019s device, which we\u2019ve recently made available open-source with <a href=\"https:\/\/www.tensorflow.org\/federated\">TensorFlow Federated<\/a>. Another is <a href=\"https:\/\/github.com\/tensorflow\/privacy\">Differential Privacy<\/a>, which can offer strong guarantees that training data details aren\u2019t inappropriately exposed in ML models. Additionally, researchers are experimenting more and more with using small training datasets and zero-shot learning, as we demonstrated in our recent <a href=\"https:\/\/ai.googleblog.com\/2018\/11\/improved-grading-of-prostate-cancer.html\">prostate cancer detection research<\/a> and <a href=\"https:\/\/ai.googleblog.com\/2016\/11\/zero-shot-translation-with-googles.html\">work on Google Translate<\/a>.<\/p>\n<p>We hope that our efforts will help people access and learn from clean, useful, relevant and privacy-preserving open data from Google to solve the problems that matter to them. We also encourage other organizations to consider how they can contribute\u2014whether by opening their own datasets, facilitating usability by cleaning them before release, using schema.org metadata standards to increase findability, enhancing transparency through data cards or considering trade-offs like user privacy and misuse. To everyone who has come together over the past week to celebrate open data: we look forward to seeing what you build.<\/p>\n<\/p>\n<\/div>\n<\/div>\n<\/div>\n<p><a href=\"https:\/\/www.blog.google\/technology\/ai\/sharing-open-data\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: This past weekend marked Open Data Day, an annual celebration of making data freely available to everyone. Communities around the world organized events, and [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/03\/05\/doing-our-part-to-share-open-data-responsibly\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":463,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1818"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=1818"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1818\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/457"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=1818"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=1818"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=1818"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}