{"id":5206,"date":"2021-11-15T06:33:53","date_gmt":"2021-11-15T06:33:53","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2021\/11\/15\/common-mistakes-when-outsourcing-private-data-and-how-to-avoid-them\/"},"modified":"2021-11-15T06:33:53","modified_gmt":"2021-11-15T06:33:53","slug":"common-mistakes-when-outsourcing-private-data-and-how-to-avoid-them","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2021\/11\/15\/common-mistakes-when-outsourcing-private-data-and-how-to-avoid-them\/","title":{"rendered":"Common Mistakes When Outsourcing Private Data and How to Avoid Them"},"content":{"rendered":"<p>Author: Yulia Gavrilova<\/p>\n<div>\n<p><span style=\"font-weight: 400;\"><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9802327083?profile=original\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9802327083?profile=RESIZE_710x\" width=\"720\" class=\"align-full\"><\/a><\/span><\/p>\n<p><span style=\"font-weight: 400;\">The majority of companies choose to outsource their ML solution. It makes sense since AI development demands unique and hard-to-obtain expertise and experience. That is why it&#8217;s better to work with a team that specializes in this kind of development.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">However, if you want to make a custom machine learning model, you need to provide the outsourcing team with your data. And this is where problems start. How do you pass your sensitive data to external parties without putting the security and privacy of your clients at risk?<\/span><\/p>\n<p><span style=\"font-weight: 400;\">At <a href=\"https:\/\/serokell.io\/\" target=\"_blank\" rel=\"noopener\">Serokell<\/a>, we often meet clients concerned about how their data is going to be used. I talked to Ivan Markov, Head of the Data Science department, to prepare this guide to answer the most common questions and help you to feel protected when working with external teams.\u00a0<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Three common data-related scenarios when working with clients<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">First, let\u2019s talk about how machine learning works. In ML, we use algorithms that run on data and learn from it. It&#8217;s easy to understand that data is essential in ML \u2015 without it, you will not obtain the result you want.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">When working with clients, ML teams often have to face one of the three unpleasant scenarios:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Data doesn&#8217;t exist.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Data is open-source.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Data is confidential.\u00a0<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">The first scenario is the most common one and also the most complicated one. As people say, data is the new gold. You can&#8217;t just find what you need on the internet for a custom model tailored to one specific business. Unfortunately, when we face a situation like this, we have to decline the project.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The open-source scenario is slightly better. The data is already there, and anyone can use it. But, let&#8217;s say you decided to google photos of random people. If you&#8217;re training a model just for fun and won&#8217;t tell anyone, any AI ethicist will say to you that&#8217;s morally wrong. But it&#8217;s hard for the authorities to know that you&#8217;re doing it. But what if you want to create a commercial face recognition system? These people didn&#8217;t give their consent for you to train your face recognition model on their photos, and you and your company can be in serious trouble. Even Facebook had to face legal consequences and delete its database of<\/span> <a href=\"https:\/\/arxiv.org\/pdf\/1805.00932.pdf\"><span style=\"font-weight: 400;\">scraped Instagram photos<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">So when you take open-source data, it&#8217;s always important to know what kind of license is protecting it. It depends on the license, but usually, that&#8217;s illegal to use open-source for commercial purposes. Of course, somebody would have to prove that you used this data illegally if your code isn&#8217;t made open-source. It&#8217;s not that easy to catch you. But still, this will stain your reputation forever. We don&#8217;t recommend that.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Finally, there is a third option. The client comes to you, and they have data. But they ask you to build a model without transmitting this data. That&#8217;s extremely hard, as you can guess, and not a lot of data scientists can or are willing to do it. There can be various reasons for this approach. The client has sensitive data, tries to protect the client&#8217;s privacy, or has something to hide. We don&#8217;t know. The problem is that it&#8217;s hard to build a model that gives reproducible results without seeing the data. You have to be sure that the data you&#8217;re training the model on is similar or identical. Otherwise, it won&#8217;t work.\u00a0<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">What is the alternative to these terrible situations?<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">There are several things that can help you deal with each of these scenarios successfully.\u00a0<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Know what personal data is\u00a0<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">General education of both sides usually helps. The developer should be transparent about what they are going to do with the data. And client needs to know how to protect themselves in case something goes wrong. Usually, a well-made contract and an NDA are what you need. In this case, both sides understand that if the client&#8217;s private information gets into the internet, there will be lawsuits. In the contract, it is necessary to report precisely where this personal data is. Quite often, at this stage, sides discover that private data is not needed at all! The ML team doesn&#8217;t need your customers&#8217; names or gender, or age \u2015 all this can be extracted from transactions in the anonymized form!\u00a0<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Learn how to do anonymization well<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">How does anonymization work? Let&#8217;s take retail, for example. It is necessary to anonymize the numbers of the loyalty or credit cards. An excellent solution can be cryptographic hash functions which represent card numbers in the form of numeric\/letter strings, and only the customer knows the key to translating them back. These numbers cannot be associated with a real person.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">There are cases when models need actual personal data, for example, in medicine. It is possible to restore the sex by MRI, but it&#8217;s a more complicated task for age. And for diagnosis, you usually need it. There is a way out: divide people into age groups. 18-24, 25-36, every patient&#8217;s age falls into one of the classes. You don&#8217;t even need to label these groups in an open way; call them a, b, and c. This is enough for the model to take age information into account. But you still need a formal patient&#8217;s consent (usually, patients sign this form at the check-in).\u00a0<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Learn to use remote server access well<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Many companies rely on remote server access. In this case, they give access via SSH, and the developer can only execute commands there, with no Internet access. For the team, this is highly inconvenient. You do not see the screen, but for an ML engineer, it&#8217;s essential to see the data for the speed of development and visualization. But you will probably find people who will agree to that. The problem is that setting up remote desktop protocol right is quite tricky. You need to make sure that the communication only goes one way, and you need to know what you are doing to fine-tune everything. In the meanwhile, this is usually not required if you did the anonymization right.\u00a0<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Conclusion<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">So, summing up, what are the major mistakes when outsourcing private data?\u00a0<\/span><\/p>\n<ol>\n<li><span style=\"font-weight: 400;\">Anonymization is badly made. It is necessary to double-check that all fields correspond to the typing.\u00a0<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Messed up remote access. Traffic control is either expensive or complicated, but don&#8217;t do it if you&#8217;re unsure you can do it right.<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Overdid anonymization. In this case, you can&#8217;t learn anything from data.<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Poorly drafted contract. Write down what is disclosure, what kind of data is not allowed, how much is the turnover. Please consult with a specialist who will advise you on how to do it right.<\/span><\/li>\n<li><span style=\"font-weight: 400;\">If you use data illegally, then you cannot give out such data. If someone tells on you, then that&#8217;s it for you and your business. In medicine, you cannot even give it to trustees, according to the law, even if it&#8217;s not open-source.<\/span><\/li>\n<\/ol>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:1076231\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Yulia Gavrilova The majority of companies choose to outsource their ML solution. It makes sense since AI development demands unique and hard-to-obtain expertise and [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2021\/11\/15\/common-mistakes-when-outsourcing-private-data-and-how-to-avoid-them\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":461,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/5206"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=5206"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/5206\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/469"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=5206"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=5206"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=5206"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}