{"id":2531,"date":"2019-09-04T06:34:23","date_gmt":"2019-09-04T06:34:23","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/09\/04\/nlp-picks-bestsellers-a-lesson-in-using-nlp-for-hidden-feature-extraction\/"},"modified":"2019-09-04T06:34:23","modified_gmt":"2019-09-04T06:34:23","slug":"nlp-picks-bestsellers-a-lesson-in-using-nlp-for-hidden-feature-extraction","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/09\/04\/nlp-picks-bestsellers-a-lesson-in-using-nlp-for-hidden-feature-extraction\/","title":{"rendered":"NLP Picks Bestsellers \u2013 A Lesson in Using NLP for Hidden Feature Extraction"},"content":{"rendered":"<p>Author: William Vorhies<\/p>\n<div>\n<p><strong><em>Summary:<\/em><\/strong><em>\u00a0 99% of our application of NLP has to do with chatbots or translation.\u00a0 This is a very interesting story about expanding the bounds of NLP and feature creation to predict bestselling novels.\u00a0 The authors created over 20,000 NLP features, about 2,700 of which proved to be predictive with a 90% accuracy rate in predicting NYT bestsellers.<\/em><\/p>\n<p>\u00a0<\/p>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3515945869?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3515945869?profile=RESIZE_710x\" width=\"300\" class=\"align-right\"><\/a>It\u2019s a pretty rare individual who hasn\u2019t had a personal experience with NLP (Natural Language Processing).\u00a0 About 99% of those experiences are in the form of chatbots or translators, either text or speech in, and text or speech out.<\/p>\n<p>This has proved to be one of the hottest and most economically valuable applications of deep learning but it\u2019s not the whole story.<\/p>\n<p>I recently picked up a copy of a 2016 book entitled <em>\u201cThe Bestseller Code \u2013 Anatomy of the Blockbuster Novel\u201d<\/em> which promised a story about using NLP and machine learning to predict which US fiction novels would make the New York Times Best Sellers list and which would not.<\/p>\n<p>There are about 55,000 new works of fiction published each year (and that doesn\u2019t count self-published).\u00a0 Less than \u00bd% or about 200 to 220 make the NYT Bestseller list in a year.\u00a0 Only 3 or 4 of those will sell more than a million copies.<\/p>\n<p>The authors, Jodie Archer (background in publishing), and Matt Jockers (cofounder of the Stanford Literary Lab) write about their model which has an astounding 90% success rate in predicting which books will make the NYT list using a corpus of 5,000 novels from the last 30 years which included 500 NYT Bestsellers.<\/p>\n<p>The book, which I heartily recommend, is not a data science book, nor is it a how-to-write-a-bestseller.\u00a0 And while it has elements of both it\u2019s mostly reporting about the most interesting finds among the 20,000 extracted features they developed, about 2,800 of which proved to be predictive.\u00a0 More on that later.<\/p>\n<p>What struck me was the potential this field of \u2018stylometrics\u2019 has for extracting hidden features for almost any problem which has a large amount of text as one of its data sources.\u00a0 Could be CSR logs of customer interaction, could be doctor\u2019s notes, blogs, or warranty repair descriptions where we\u2019re really only scratching the surface with word clouds and sentiment analysis.<\/p>\n<p>This is a great data science story because it illustrates just how deep you can go in extracting features (20,000) from text that can then be used alone or in conjunction with structured or semi-structured data features to enhance the accuracy of predictive models.<\/p>\n<p>Not all their techniques applied to novels will translate to more pedestrian business problems but it ought to at least spark your imagination.<\/p>\n<p>\u00a0<\/p>\n<p><span style=\"font-size: 12pt;\"><strong>Stylometrics and Digital Humanities<\/strong><\/span><\/p>\n<p>A little preface.\u00a0 Apparently the academic world is fast embracing NLP in these techniques called \u2018stylometrics\u2019, part of the digital humanities movement.\u00a0 The best known applications so far are in author attribution.\u00a0 Who contributed to the Federalists Papers, was the Book of Mormon written by a single individual, did William Shakespeare really write all of those plays?\u00a0 The results so far have been quite good if not exactly headline grabbing.<\/p>\n<p>I gather that the authors\u2019 creation of this Bestsellerometer may be by far the most far reaching application.<\/p>\n<p>I will not do justice to their findings in these brief notes.\u00a0 This will only give you an inkling of the many fun and counter intuitive findings that make a bestseller.\u00a0 Read the book.<\/p>\n<p>\u00a0<\/p>\n<p><span style=\"font-size: 12pt;\"><strong>Theme \/ Plot \/ Style \/ Character<\/strong><\/span><\/p>\n<p>If you remember your college lit courses you were almost certainly instructed to evaluate works of fiction along these four categories: theme, plot, style, and character.\u00a0 But how these four major variable groupings come together to predict bestsellers is in no way obvious.\u00a0<\/p>\n<p>Don\u2019t readers for example care more about the genre?\u00a0 Aren\u2019t romance, mystery, sci-fi, or historical fiction and the trends and memes generated around them more powerful predictors?\u00a0 What about author reputation?\u00a0 Does anyone stand a chance whose name isn\u2019t Grisham, King, James, Larson, or their equally famous perpetual bestsellers?\u00a0 And isn\u2019t it all about the marketing budgets the big publishers put behind proven names?<\/p>\n<p>According to Archer and Jockers, none of these are true.\u00a0 That the ability of a first time author to breakout into instant bestseller status is based on factors they discover in the text itself, not in genre, reputation, or marketing budget.\u00a0 \u201c<em>The Girl Who Kicked the Hornet\u2019s Nest<\/em>\u201d, \u201c<em>Fifty Shades of Grey<\/em>\u201d, \u201c<em>Gone Girl<\/em>\u201d, and \u201c<em>The Da Vinci Code<\/em>\u201d are only a few examples of unexpected breakouts.<\/p>\n<p>\u00a0<\/p>\n<p><span style=\"font-size: 12pt;\"><strong>Theme<\/strong><\/span><\/p>\n<p>Theme or topic as the authors define it has two aspects that can be examined with NLP.\u00a0 Using the topic modeling technique Latent Dirichlet Allocation (LDA) topics were distilled down to about 500 categories.<\/p>\n<p>It\u2019s important to note that there are likely to be many topics per story.\u00a0 For example, although John Grisham is noted for his courtroom-centered novels, not every page is about a courtroom.\u00a0 Topics could range from sex, drugs, and rock and roll, to the more likely marriage, intimate conversations, family life, and so forth.<\/p>\n<p>Archer and Jockers discovered several interesting things here.\u00a0 First, if you\u2019re going to write a bestseller you obviously have to appeal to a very wide audience.\u00a0 This means topics with very wide appeal like family life, human connection, or our relationship to technology.<\/p>\n<p>Second, best sellers allocate about 30% to just one or two topics and 40% to more than three topics.\u00a0 Books that don\u2019t make the list spread themselves out over six or more topics.<\/p>\n<p>Topics that sell best are emotional and ethical topics.\u00a0 Inflammatory topics are kept to a minimum.\u00a0 Best predictors or success are: \u201chuman closeness and human connection: people communicating in moments of shared intimacy, shared chemistry, shared bonds\u201d.<\/p>\n<p>Of course every story needs a conflict to be resolved so topics need to support that.\u00a0 For example: \u201cchildren and guns, faith and sex, love and vampires\u201d.<\/p>\n<p>You thought \u201c50 shades of Grey\u201d was about kinky sex.\u00a0 Think again.\u00a0 NLP shows it hews to the rules for bestsellers by incorporating 21% human closeness, 13% intimate conversation (those two topics make more than 30%), 13% sex, seduction, and the female body (so the third topic takes us to 40%).<\/p>\n<p>\u00a0<\/p>\n<p><span style=\"font-size: 12pt;\"><strong>Plot<\/strong><\/span><\/p>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3515949235?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3515949235?profile=RESIZE_710x\" width=\"250\" class=\"align-right\"><\/a>Using time series analysis in conjunction with NLP it\u2019s possible to tell when the mood of the story turns from positive to negative and back again.\u00a0 In other words, where are the plot\u2019s highs and lows.<\/p>\n<p>Archer and Jockers find seven major plotline trajectories and to be a bestseller you need to follow one of these.\u00a0 Also, you better get to the first reversal early in the book to hook your readers.\u00a0 Here\u2019s the plotline for \u201c<em>50 Shades of Grey<\/em>\u201d as determined by NLP (the light grey line) overlaid with the authors\u2019 interpretation.<\/p>\n<p>\u00a0<\/p>\n<p><span style=\"font-size: 12pt;\"><strong>Style<\/strong><\/span><\/p>\n<p>Great writing style for most of us is like great art, we know it when we see it.\u00a0 Turns out however that writing style as interpreted by NLP is a powerful predictor.\u00a0 The authors found that only 148 features based on only the most common filler words and punctuations, no nouns, adjectives, syntax or sentence data included could predict bestsellers 68% of the time.<\/p>\n<p>This points to another interesting difference with common approaches to NLP.\u00a0 While in typical NLP we go to great lengths to remove or ignore punctuation, filler words and the like, in stylometrics everything counts and gets counted.<\/p>\n<p>Not only do we need to count every comma, colon, and exclamation mark, but every use of filler words like \u2018the\u2019, \u2018of\u2019, \u2018a\u2019, \u2018and\u2019, \u2018but\u2019, syntax, sentence length, parts of speech, and counts for common verbs, nouns, adjectives, and adverbs.\u00a0 Noun to adjective ratio among others is important.<\/p>\n<p>Using this technique alone is at the heart of author attribution.\u00a0 Apparently like fingerprints, our writing styles based on word use and punctuation is quite unique.<\/p>\n<p>As for bestsellers, the authors observe that the word \u201cdo\u201d is \u201ctwice as likely to appear in a bestseller than in a book that never hit the list.\u00a0 The word \u2018very\u2019 is only about half a common in bestsellers as in books that don\u2019t make it.\u201d<\/p>\n<p>\u00a0<\/p>\n<p><span style=\"font-size: 12pt;\"><strong>Character<\/strong><\/span><\/p>\n<p>You would think that defining character with NLP would be the most elusive challenge but the authors found predictive variables in looking at the way characters behave, based on their action verbs.<\/p>\n<p>Taking direct action is much more powerful than thinking, or considering, or pondering.<\/p>\n<p>\u201cRegardless of whether the character is male or female, bestselling protagonists have and express their needs.\u00a0 The want things and express it.\u00a0 They know control and express agency. Verbs are clean and self-assured.\u201d\u00a0 Verbs like grab, do, think, ask, look, hold, tells, likes, sees, hears, smiles, reaches define bestselling characters. \u00a0Characters who demand, who seem, who wait, who interrupt, do not.<\/p>\n<p>\u00a0<\/p>\n<p><span style=\"font-size: 12pt;\"><strong>About the Data Science<\/strong><\/span><\/p>\n<p>Archer and Jockers started this project with their first model and more limited corpus on which to train in 2008 achieving 70% to 80% accuracy.\u00a0 That\u2019s pretty amazing in itself considering that Hadoop and other NoSQL databases were fresh out of the box at the time.<\/p>\n<p>The current book is based on work that occurred mostly in 2015 which many of you will recognize as a period when our ML techniques were still improving fast.<\/p>\n<p>Jockers, the data scientist, describes using KNN (K Nearest Neighbor), NSC (Nearest Shrunken Centroids), and SVMs (Support Vector Machines) for their ML classification model.\u00a0 Best results by a wide margin were had with KNN.<\/p>\n<p>I was rather hoping the results might have been revisited with some more modern or powerful techniques.\u00a0 GBM, XGboost, or genetic programs come to mind.\u00a0 However, their 90% accuracy rate is certainly good.<\/p>\n<p>And if you were wondering, yes there was one and only one book out of 5,000 in their corpus that received a perfect 100 score predicting it would be a bestseller, \u201c<em>The Circle<\/em>\u201d by Dave Eggers.\u00a0 It\u2019s on my to do list.\u00a0 Haven\u2019t read it yet.\u00a0<\/p>\n<p>Meanwhile, hope this story about different approaches to creating features from text more broadly applied than our typical NLP, has spurred your imagination for your next text heavy project.\u00a0 If you want to pursue stylometrics, Matt Jockers has a book out \u201c<em>Text Analysis with R for Students of Literature<\/em>\u201d.<\/p>\n<p>\u00a0<\/p>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blog\/list?user=0h5qapp2gbuf8\"><em><u>Other articles by Bill Vorhies<\/u><\/em><\/a><\/p>\n<p>\u00a0<\/p>\n<p>About the author:\u00a0 Bill is Contributing Editor for Data Science Central.\u00a0 Bill is also President &#038; Chief Data Scientist at Data-Magnum and has practiced as a data scientist since 2001.\u00a0 His articles have been read more than 2 million times.<\/p>\n<p>He can be reached at:<\/p>\n<p><a href=\"mailto:Bill@DataScienceCentral.com\">Bill@DataScienceCentral.com<\/a> <span>or<\/span> <a href=\"mailto:Bill@Data-Magnum.com\">Bill@Data-Magnum.com<\/a><\/p>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:882054\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: William Vorhies Summary:\u00a0 99% of our application of NLP has to do with chatbots or translation.\u00a0 This is a very interesting story about expanding [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/09\/04\/nlp-picks-bestsellers-a-lesson-in-using-nlp-for-hidden-feature-extraction\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":468,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2531"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=2531"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2531\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/471"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=2531"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=2531"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=2531"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}