{"id":2922,"date":"2019-12-13T06:33:48","date_gmt":"2019-12-13T06:33:48","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/13\/the-rise-of-fake-news-a-machine-learning-challenge\/"},"modified":"2019-12-13T06:33:48","modified_gmt":"2019-12-13T06:33:48","slug":"the-rise-of-fake-news-a-machine-learning-challenge","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/13\/the-rise-of-fake-news-a-machine-learning-challenge\/","title":{"rendered":"The Rise of Fake News. A Machine Learning challenge!"},"content":{"rendered":"<p>Author: Faruqui Ismail<\/p>\n<div>\n<p>By Faruqui Ismail and NookaRaju Garimella<\/p>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3765079479?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3765079479?profile=RESIZE_710x\" class=\"align-center\"><\/a><\/p>\n<p style=\"text-align: center;\"><span style=\"font-size: 8pt;\">Reporters with various forms of &#8220;fake news&#8221; from an 1894 illustration by <a href=\"https:\/\/en.wikipedia.org\/wiki\/Frederick_Burr_Opper\">Frederick Burr Opper<\/a><\/span><\/p>\n<p>\u00a0<\/p>\n<p>We\u2019ve always pictured the rise of artificial intelligence as being the end of civilization, at least from watching movies like \u2018<em>The Terminator \u2013 Judgement Day<\/em>\u2019. We could not have imagined that something as insignificant as misinformation, would lead to the collapse of organisations; beginning wars and even mass suicides.<\/p>\n<p>\u00a0<\/p>\n<p>The definition of what we regard as \u201cFake\u201d news has a broad spectrum. Consider an article published in 2001, which was true at the time. That same article being published now, excluding the date\u2026 giving it an appearance of recently occurring events. Would be regarded as \u201cmisinformation\u201d or \u201cFake\u201d.<\/p>\n<p>\u00a0<\/p>\n<p>In summary, we identified a need to identify the truth from misinformation and created a product that would help us do that. We began by creating 2 robots using BeautifulSoup (bs4) and Selenium, these robots extracted data from various fake news sites according to Wikipedia. We then supplemented this data with GitHub data (refer to acknowledgements).<\/p>\n<p>\u00a0<\/p>\n<p>Post cleaning and reworking the data using some Natural Language Processing(NLP) techniques, we proceeded to create features. By asking the question, what makes a fake news article different from a non-fake news article? We agreed on the following:<\/p>\n<\/p>\n<ul>\n<li>The % of punctuation&#8217;s in an article (by \u2018over-dramatizing\u2019 events people will use more punctuation&#8217;s than usual)<\/li>\n<li>The % of capital letters in an article (once again, this takes care of e.g. \u201cDID YOU KNOW\u201d)<\/li>\n<li>If the article came from a website known for publicizing sensational\/fake stories as tracked by Wikipedia<\/li>\n<li>Finally, we looked at poor sentence construction. Sentences constructed too long are usually indicative of someone who is not a journalist writing the article<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<p>To increase the overall accuracy of the final prediction. These features were then checked to see if they were not too correlated, and that the sub contents of some of these features did not overlap e.g.:<\/p>\n<\/p>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3765117179?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3765117179?profile=RESIZE_710x\" class=\"align-center\" width=\"261\" height=\"200\"><\/a><\/p>\n<p>To avoid over-fitting of the model, feature transformation was done. This helped normalize the feature which helped prevent over-fitting. This visual is an example of the transformation done of % of upper case letters to the new article:<\/p>\n<p>\u00a0<\/p>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3765124808?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3765124808?profile=RESIZE_710x\" class=\"align-center\" width=\"576\" height=\"152\"><\/a><\/p>\n<p>These small changes increased the final prediction precision by 9.63%.<\/p>\n<p>\u00a0<\/p>\n<p>Once these features were created, we dove into NLP. We removed all stop words; tokenized and stemmed the data; excluded all punctuation&#8217;s from the text etc.<\/p>\n<p>Considering prediction times, preference was given to Porter stemming over Lemmatizing, NLP generally creates a massive quantity of features.<\/p>\n<p>\u00a0<\/p>\n<p>Again, balancing precision with the time it takes to run the program was a key consideration on which vectorizer to use. GridSearchCV to the rescue. We ran TFIDF Vectorizer as well as a Count vectorizer on certain parameters and recorded their fit times and prediction scores:<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3765127278?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3765127278?profile=RESIZE_710x\" class=\"align-center\" width=\"559\" height=\"312\"><\/a><\/p>\n<\/p>\n<p>RandomForest was a strong candidate for our prediction, hence we used it. To identify the best possible parameters in the machine learning algorithm. A grid was constructed which provided the optimal n_est and depth which would yield the highest precision, accuracy and recall.<\/p>\n<p>\u00a0<\/p>\n<table width=\"604\">\n<tbody>\n<tr>\n<td width=\"54\">\n<p><span style=\"font-size: 8pt;\">Est: 50<\/span><\/p>\n<\/td>\n<td width=\"115\">\n<p><span style=\"font-size: 8pt;\">\u00a0Depth: 10<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Precision: 0.6921<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Recall: 0.4833<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Accuracy: 0.4769<\/span><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"54\">\n<p><span style=\"font-size: 8pt;\">Est: 50<\/span><\/p>\n<\/td>\n<td width=\"115\">\n<p><span style=\"font-size: 8pt;\">\u00a0Depth: 30<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Precision: 0.8405<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Recall: 0.8166<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Accuracy: 0.7923<\/span><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"54\">\n<p><span style=\"font-size: 8pt;\">Est: 50<\/span><\/p>\n<\/td>\n<td width=\"115\">\n<p><span style=\"font-size: 8pt;\">\u00a0Depth: 90<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Precision: 0.8479<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Recall: 0.8416<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Accuracy: 0.8461<\/span><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"54\">\n<p><span style=\"font-size: 8pt;\">Est: 50<\/span><\/p>\n<\/td>\n<td width=\"115\">\n<p><span style=\"font-size: 8pt;\">\u00a0Depth: None<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Precision: 0.8143<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Recall: 0.7916<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Accuracy: 0.8076<\/span><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"54\">\n<p><span style=\"font-size: 8pt;\">Est: 100<\/span><\/p>\n<\/td>\n<td width=\"115\">\n<p><span style=\"font-size: 8pt;\">\u00a0Depth: 10<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Precision: 0.7159<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Recall: 0.6416<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Accuracy: 0.6153<\/span><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"54\">\n<p><span style=\"font-size: 8pt;\">Est: 100<\/span><\/p>\n<\/td>\n<td width=\"115\">\n<p><span style=\"font-size: 8pt;\">\u00a0Depth: 30<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Precision: 0.8352<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Recall: 0.8<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Accuracy: 0.7923<\/span><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"54\">\n<p><span style=\"font-size: 8pt;\">Est: 100<\/span><\/p>\n<\/td>\n<td width=\"115\">\n<p><span style=\"font-size: 8pt;\">\u00a0Depth: 90<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Precision: 0.8685<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Recall: 0.8583<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Accuracy: 0.8615<\/span><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"54\">\n<p><span style=\"font-size: 8pt;\">Est: 100<\/span><\/p>\n<\/td>\n<td width=\"115\">\n<p><span style=\"font-size: 8pt;\">\u00a0Depth: None<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Precision: 0.8936<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Recall: 0.9166<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Accuracy: 0.9076<\/span><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"54\">\n<p><span style=\"font-size: 8pt;\">Est: 150<\/span><\/p>\n<\/td>\n<td width=\"115\">\n<p><span style=\"font-size: 8pt;\">\u00a0Depth: 10<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Precision: 0.7066<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Recall: 0.6<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Accuracy: 0.5615<\/span><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"54\">\n<p><span style=\"font-size: 8pt;\">Est: 150<\/span><\/p>\n<\/td>\n<td width=\"115\">\n<p><span style=\"font-size: 8pt;\">\u00a0Depth: 30<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Precision: 0.8398<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Recall: 0.8333<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Accuracy: 0.8230<\/span><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"54\">\n<p><span style=\"font-size: 8pt;\">Est: 150<\/span><\/p>\n<\/td>\n<td width=\"115\">\n<p><span style=\"font-size: 8pt;\">\u00a0Depth: 90<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Precision: 0.8613<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Recall: 0.8583<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Accuracy: 0.8461<\/span><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"54\">\n<p><span style=\"font-size: 8pt;\">Est: 150<\/span><\/p>\n<\/td>\n<td width=\"115\">\n<p><span style=\"font-size: 8pt;\">\u00a0Depth: None<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Precision: 0.8786<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Recall: 0.8833<\/span><\/p>\n<\/td>\n<td width=\"138\">\n<p><span style=\"font-size: 8pt;\">Accuracy: 0.8769<\/span><\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p> This entire project was then packaged into a web framework using Django. The view showed whether the data was e.g. unreliable, junk science, fake, true etc. It is in the process of being published on the web.\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>Authors and Creators:<\/p>\n<div class=\"LI-profile-badge\"><a class=\"LI-simple-link\" href=\"https:\/\/za.linkedin.com\/in\/faruqui?trk=profile-badge\">Faruqui Ismail<\/a><\/div>\n<p><a href=\"https:\/\/www.linkedin.com\/in\/nookaraju-garimella-49538110\/\" target=\"_blank\" rel=\"noopener noreferrer\">Nooka Raju Garimella<\/a><\/p>\n<\/p>\n<p><strong>Acknowledgements<\/strong>:<\/p>\n<p><span style=\"font-size: 8pt;\"><em>GitHub data: <a href=\"https:\/\/github.com\/several27\/FakeNewsCorpus\">https:\/\/github.com\/several27\/FakeNewsCorpus<\/a><\/em><\/span><\/p>\n<p><span style=\"font-size: 8pt;\"><em>Wikipedia fake news list: <a href=\"https:\/\/en.wikipedia.org\/wiki\/List_of_fake_news_websites\">https:\/\/en.wikipedia.org\/wiki\/List_of_fake_news_websites<\/a><\/em><\/span><\/p>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:914320\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Faruqui Ismail By Faruqui Ismail and NookaRaju Garimella Reporters with various forms of &#8220;fake news&#8221; from an 1894 illustration by Frederick Burr Opper \u00a0 [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/13\/the-rise-of-fake-news-a-machine-learning-challenge\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":472,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2922"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=2922"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2922\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/457"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=2922"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=2922"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=2922"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}