{"id":2954,"date":"2019-12-20T17:55:01","date_gmt":"2019-12-20T17:55:01","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/20\/finding-a-good-read-among-billions-of-choices\/"},"modified":"2019-12-20T17:55:01","modified_gmt":"2019-12-20T17:55:01","slug":"finding-a-good-read-among-billions-of-choices","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/20\/finding-a-good-read-among-billions-of-choices\/","title":{"rendered":"Finding a good read among billions of choices"},"content":{"rendered":"<p>Author: Kim Martineau | MIT Quest for Intelligence<\/p>\n<div>\n<p>With billions of books, news stories, and documents online, there\u2019s never been a better time to be reading \u2014 if you have time to sift through all the options. \u201cThere\u2019s a ton of text on the internet,\u201d says\u00a0<a href=\"http:\/\/people.csail.mit.edu\/jsolomon\/\">Justin Solomon<\/a>, an assistant professor at MIT. \u201cAnything to help cut through all that material is extremely useful.\u201d<\/p>\n<p>With the\u00a0<a href=\"https:\/\/mitibmwatsonailab.mit.edu\/\">MIT-IBM Watson AI Lab<\/a>\u00a0and his\u00a0<a href=\"http:\/\/groups.csail.mit.edu\/gdpgroup\/\">Geometric Data Processing Group<\/a>\u00a0at MIT, Solomon recently presented a new technique for cutting through massive amounts of text at the\u00a0<a href=\"https:\/\/nips.cc\/\">Conference on Neural Information Processing Systems<\/a> (NeurIPS). Their method combines three popular text-analysis tools \u2014 topic modeling, word embeddings, and optimal transport \u2014 to deliver better, faster results than competing methods on a popular benchmark for classifying documents.<\/p>\n<p>If an algorithm knows what you liked in the past, it can scan the millions of possibilities for something similar. As natural language processing techniques improve, those \u201cyou might also like\u201d suggestions are getting speedier and more relevant.\u00a0<\/p>\n<p>In the method presented at NeurIPS, an algorithm summarizes a collection of, say, books, into topics based on commonly-used words in the collection. It then divides each book into its five to 15 most important topics, with an estimate of how much each topic contributes to the book overall.\u00a0<\/p>\n<p>To compare books, the researchers use two other tools: word embeddings, a technique that turns words into lists of numbers to reflect their similarity in popular usage, and optimal transport, a framework for calculating the most efficient way of moving objects \u2014 or data points \u2014 among multiple destinations.\u00a0<\/p>\n<p>Word embeddings make it possible to leverage optimal transport twice: first to compare topics within the collection as a whole, and then, within any pair of books, to measure how closely common themes overlap.\u00a0<\/p>\n<p>The technique works especially well when scanning large collections of books and lengthy documents. In the study, the researchers offer the example of Frank Stockton\u2019s \u201cThe Great War Syndicate,\u201d a 19th\u00a0century American novel that anticipated the rise of nuclear weapons. If you\u2019re looking for a similar book, a topic model would help to identify the dominant themes shared with other books \u2014 in this case, nautical, elemental, and martial.\u00a0<\/p>\n<p>But a topic model alone wouldn\u2019t identify Thomas Huxley\u2019s 1863 lecture,\u00a0\u201c<a href=\"http:\/\/www.gutenberg.org\/files\/2922\/2922-h\/2922-h.htm#linkimage-0002\">The Past Condition of Organic Nature<\/a>,\u201d as a good match. The writer was a champion of Charles Darwin\u2019s theory of evolution, and his lecture, peppered with mentions of fossils and sedimentation, reflected emerging ideas about geology. When the themes in Huxley\u2019s lecture are matched with Stockton\u2019s novel via optimal transport, some cross-cutting motifs emerge: Huxley\u2019s geography, flora\/fauna, and knowledge themes map closely to Stockton\u2019s nautical, elemental, and martial themes, respectively.<\/p>\n<p>Modeling books by their representative topics, rather than individual words, makes high-level comparisons possible. \u201cIf you ask someone to compare two books, they break each one into easy-to-understand concepts, and then compare the concepts,\u201d says the study\u2019s lead author\u00a0<a href=\"https:\/\/moonfolk.github.io\/\">Mikhail Yurochkin<\/a>, a researcher at IBM.\u00a0<\/p>\n<p>The result is faster, more accurate comparisons, the study shows. The researchers compared 1,720 pairs of books in the Gutenberg Project dataset in one second \u2014 more than 800 times faster than the next-best method.<\/p>\n<p>The technique also does a better job of accurately sorting documents than rival methods \u2014 for example, grouping books in the Gutenberg dataset by author, product reviews on Amazon by department, and BBC sports stories by sport. In a series of visualizations, the authors show that their method neatly clusters documents by type.<\/p>\n<p>In addition to categorizing documents quickly and more accurately, the method offers a window into the model\u2019s decision-making process. Through the list of topics that appear, users can see why the model is recommending a document.<\/p>\n<p>The study\u2019s other authors are\u00a0<a href=\"https:\/\/www.csail.mit.edu\/person\/sebastian-claici\">Sebastian Claici<\/a>\u00a0and\u00a0<a href=\"http:\/\/people.csail.mit.edu\/eddchien\/\">Edward Chien<\/a>, a graduate student and a postdoc, respectively, at MIT\u2019s Department of Electrical Engineering and Computer Science and Computer Science and Artificial Intelligence Laboratory, and\u00a0<a href=\"https:\/\/researcher.watson.ibm.com\/researcher\/view.php?person=ibm-Farzaneh\">Farzaneh Mirzazadeh<\/a>, a researcher at IBM.<\/p>\n<\/div>\n<p><a href=\"http:\/\/news.mit.edu\/2019\/finding-good-read-among-billions-of-choices-1220\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Kim Martineau | MIT Quest for Intelligence With billions of books, news stories, and documents online, there\u2019s never been a better time to be [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/20\/finding-a-good-read-among-billions-of-choices\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":459,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2954"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=2954"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2954\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/460"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=2954"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=2954"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=2954"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}