{"id":4096,"date":"2020-11-16T06:36:32","date_gmt":"2020-11-16T06:36:32","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/11\/16\/from-data-collection-to-text-interpretation-an-interview-on-exploring-techniques-and-use-cases-for-text-mining\/"},"modified":"2020-11-16T06:36:32","modified_gmt":"2020-11-16T06:36:32","slug":"from-data-collection-to-text-interpretation-an-interview-on-exploring-techniques-and-use-cases-for-text-mining","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/11\/16\/from-data-collection-to-text-interpretation-an-interview-on-exploring-techniques-and-use-cases-for-text-mining\/","title":{"rendered":"From Data Collection to Text Interpretation. An interview on exploring techniques and use cases for text mining"},"content":{"rendered":"<p>Author: Rosaria Silipo<\/p>\n<div>\n<p><span style=\"font-weight: 400;\"><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/8162389289?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/8162389289?profile=RESIZE_710x\" class=\"align-full\"><\/a>Meet two text mining experts in today&rsquo;s interview, which explores some of the common issues faced by data scientists in text analytics. <a href=\"https:\/\/www.linkedin.com\/in\/dursun-delen-27890729\/\" target=\"_blank\" rel=\"noopener noreferrer\">Prof. Dursun Delen<\/a> and <a href=\"https:\/\/www.linkedin.com\/in\/scottfincher\/\" target=\"_blank\" rel=\"noopener noreferrer\">Scott Fincher<\/a> are the teachers of the &ldquo;<\/span><a href=\"https:\/\/www.knime.com\/about\/events\/l4-tp-introduction-to-text-processing-online-nov-16-2020\"><span style=\"font-weight: 400;\">[L4-TP] Introduction to Text Processing<\/span><\/a><span style=\"font-weight: 400;\">&rdquo; course regularly run by<\/span> <a href=\"https:\/\/www.knime.com\/\"><span style=\"font-weight: 400;\">KNIME<\/span><\/a><span style=\"font-weight: 400;\">. This course is based on the<\/span> <a href=\"https:\/\/www.knime.com\/knime-text-processing\"><span style=\"font-weight: 400;\">KNIME Text Processing Extension<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Prof.<\/span> <a href=\"https:\/\/www.linkedin.com\/in\/dursun-delen-27890729\/\"><span style=\"font-weight: 400;\">Dursun Delen<\/span><\/a> <span style=\"font-weight: 400;\">is the Holder of William S. Spears Endowed Chair in Business Administration, Patterson Family Chair in Business Analytics, Dir<\/span><span style=\"font-weight: 400;\">ector of Research for the Center for Health Systems Innovation, and also Regents Professor of Management Science and Information Systems in the Spears School of Business at Oklahoma State University (OSU).&nbsp;<\/span><\/p>\n<p><a href=\"https:\/\/www.linkedin.com\/in\/scottfincher\/\"><span style=\"font-weight: 400;\">Scott Fincher<\/span><\/a> <span style=\"font-weight: 400;\">is a data scientist on the Evangelism team at KNIME and one of the biggest contributors to the<\/span> <a href=\"https:\/\/forum.knime.com\/\"><span style=\"font-weight: 400;\">KNIME Forum<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Text Mining is currently experiencing a surge in popularity, mainly due to the development of more advanced chatbots, advances in deep learning architectures applied to free text generation, and the abundance of text data generated every day from web applications, e-commerce, and social media.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In this interview, I would like to dig deeper into some common problems that data scientists face when analyzing text documents. I will be talking to Dursun and Scott about blending data sources, language specific processing, minimum amounts of data, deployment, and other commonly posed questions.<\/span><\/p>\n<h2><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/8162392052?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/8162392052?profile=RESIZE_710x\" class=\"align-center\"><\/a><span style=\"font-weight: 400;\">Collecti<\/span><span style=\"font-weight: 400;\">ng the Data<\/span><\/h2>\n<p><b><i>[Rosaria]<\/i><\/b> <i><span style=\"font-weight: 400;\">I wo<\/span><\/i><i><span style=\"font-weight: 400;\">uld like to start from the beginning of all data science projects. Where can I get data for a text mining project?&nbsp; In particular, how<\/span><\/i> <i><span style=\"font-weight: 400;\">would you proceed if the information is not contained in one location only, but distributed across many websites and blog posts?<\/span><\/i><\/p>\n<p><b>[Scott]<\/b> <span style=\"font-weight: 400;\">There are many repositories for text data, often organized around specific topics. One great data repository for beginners is probably<\/span> <a href=\"https:\/\/www.kaggle.com\/\"><span style=\"font-weight: 400;\">Kaggle<\/span><\/a><span style=\"font-weight: 400;\">. There are a number of datasets in there that can be used to take the first steps in text mining.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Now for the second question. Depending on where your da<\/span><span style=\"font-weight: 400;\">ta is located, you have several options for bringing them into a KNIME workflow.&nbsp;<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For flat files, you can use the<\/span> <a href=\"https:\/\/kni.me\/n\/8Mn_4LtrzrbzyAti\"><span style=\"font-weight: 400;\">File Reader<\/span><\/a> <span style=\"font-weight: 400;\">node as a start for individual files, and the<\/span> <a href=\"https:\/\/kni.me\/n\/laHYOV8XTfTg9D6n\"><span style=\"font-weight: 400;\">Tika Parser<\/span><\/a> <span style=\"font-weight: 400;\">node for reading large groups of documents &#8211; for example if you have folders full of Word or PDF files. The Tika Parser node, especially, is very flexible and can read a large variety of data files and formats.&nbsp;<\/span><\/p>\n<p><span style=\"font-weight: 400;\">If your data is stored in a database, KNIME Analytics Platform has nodes to access most of them via the<\/span> <a href=\"https:\/\/kni.me\/e\/0mIkLShcIfCXS7Y-\"><span style=\"font-weight: 400;\">Database Extension<\/span><\/a><span style=\"font-weight: 400;\">.&nbsp;<\/span><\/p>\n<p><span style=\"font-weight: 400;\">You can also access data stored on web sites using the<\/span> <a href=\"https:\/\/kni.me\/n\/_8Xu5w4lsg_0hzf3\"><span style=\"font-weight: 400;\">Webpage Retriever<\/span><\/a><span style=\"font-weight: 400;\">, or via<\/span> <i><span style=\"font-weight: 400;\">REST API<\/span><\/i> <span style=\"font-weight: 400;\">with a<\/span> <a href=\"https:\/\/kni.me\/n\/jrLgNd_aIVOrfPsF\"><span style=\"font-weight: 400;\">GET Request<\/span><\/a> <span style=\"font-weight: 400;\">node. There are even nodes for accessing tweets directly from the<\/span> <i><span style=\"font-weight: 400;\">Twitter API<\/span><\/i><span style=\"font-weight: 400;\">. One of the strengths of KNIME Analytics Platform is its ability to pull data in and blend it from a wide variety of sources.<\/span><\/p>\n<\/p>\n<p><b><i>[Rosaria]<\/i><\/b> <span style=\"font-weight: 400;\">&nbsp;<\/span><i><span style=\"font-weight: 400;\">Do you have examples for grabbing data, for example from pdf and docx files or from web scraping?<\/span><\/i><\/p>\n<p><b>[Scott]<\/b> <span style=\"font-weight: 400;\">Sure! We have several workflows available to help you get started. As an example, we have a workflow that demonstrates accessing data from both local MS Word files and remote web pages, and how you might combine the two together. It&rsquo;s available on the&nbsp;<\/span> <a href=\"https:\/\/hub.knime.com\/\"><span style=\"font-weight: 400;\">KNIME Hub<\/span><\/a> <span style=\"font-weight: 400;\">as &ldquo;<\/span><a href=\"https:\/\/kni.me\/w\/igO2cb4VhPHvmiQF\"><span style=\"font-weight: 400;\">Will they blend? MS Words meets Web Crawling<\/span><\/a><span style=\"font-weight: 400;\">&rdquo; (<\/span><a href=\"https:\/\/kni.me\/w\/igO2cb4VhPHvmiQF\"><span style=\"font-weight: 400;\">https:\/\/kni.me\/w\/igO2cb4VhPHvmiQF<\/span><\/a><span style=\"font-weight: 400;\">), along with lots of other examples. The<\/span> <a href=\"https:\/\/hub.knime.com\/\"><span style=\"font-weight: 400;\">KNIME Hub<\/span><\/a> <span style=\"font-weight: 400;\">is a place to access workflows, components, and extensions built by both KNIME and the wider user community. You can search on keywords for particular use cases, algorithms, or whatever is of interest to you. Maybe the best part is that, if you come up with a really great solution that you think would benefit everyone, you can upload it yourself and share it with the community!<\/span><\/p>\n<\/p>\n<p><b><i>[Rosaria]<\/i><\/b> <span style=\"font-weight: 400;\">&nbsp;<\/span><i><span style=\"font-weight: 400;\">And what about connecting to repositories on the cloud, like S3?<\/span><\/i><\/p>\n<p><b>[Scott]<\/b> <span style=\"font-weight: 400;\">We also have several nodes available to help you easily access cloud resources, whether you prefer Amazon, Azure, or Google Cloud. Because accessing different sources is such a common need, we recently updated and published a collection of blog posts in the<\/span> <a href=\"https:\/\/www.knime.com\/knimepress\/download-will-they-blend\"><span style=\"font-weight: 400;\">&ldquo;Will they blend?&rdquo;<\/span><\/a> <span style=\"font-weight: 400;\">booklet on our<\/span> <a href=\"https:\/\/www.knime.com\/knimepress\"><span style=\"font-weight: 400;\">KNIME Press<\/span><\/a> <span style=\"font-weight: 400;\">site. This book is freely available, and focuses on interesting ways to combine all kinds of datasets from a variety of sources, with plenty of example workflows for you to download and explore.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Text Processing and Dictionaries<\/span><\/h2>\n<p><b><i>[Rosaria]<\/i><\/b> <span style=\"font-weight: 400;\">&nbsp;<\/span><i><span style=\"font-weight: 400;\">Let&rsquo;s move on to the next step in a text mining project: text processing.&nbsp; What does &ldquo;tokenization&rdquo; refer to?&nbsp;<\/span><\/i><\/p>\n<p><b>[Dursun]<\/b> <span style=\"font-weight: 400;\">Tokenization is the process of separating a textual document into its low-level units, such as words, numbers, characters, etc.&nbsp;<\/span><\/p>\n<\/p>\n<p><b><i>[Rosaria]<\/i><\/b> <span style=\"font-weight: 400;\">&nbsp;<\/span><i><span style=\"font-weight: 400;\">What tokenizer should I use then?<\/span><\/i><\/p>\n<p><b>[Dursun]<\/b> <span style=\"font-weight: 400;\">The basic tokenizer algorithms are available in the<\/span> <a href=\"https:\/\/kni.me\/n\/WvmYbuy-7UmcfIpO\"><span style=\"font-weight: 400;\">Strings to Document<\/span><\/a> <span style=\"font-weight: 400;\">node (OpenNLP English Word Tokenizer, OpenNLP WhiteSpace Tokenizer, Stanford NLP Tokenizer, etc.). Some of these simple tokenizers are English language specific, some are language agnostic white-space driven. More sophisticated tokenization algorithms can be found only for some languages. My recommendation is use the basic ones, if nothing else is available.<\/span><\/p>\n<\/p>\n<p><b><i>[Rosaria]<\/i><\/b> <i><span style=\"font-weight: 400;\">Is there a node that can remove punctuation characters?&nbsp;<\/span><\/i><\/p>\n<p><b>[Scott]<\/b> <span style=\"font-weight: 400;\">Yes &#8211; it&rsquo;s the<\/span> <a href=\"https:\/\/kni.me\/n\/MHwpGtMX1Fgfz31v\"><span style=\"font-weight: 400;\">Punctuation Erasure<\/span><\/a> <span style=\"font-weight: 400;\">node. This is one of the easiest nodes to use in the<\/span> <a href=\"https:\/\/kni.me\/e\/PH_ptBLLdL1Mich2\"><span style=\"font-weight: 400;\">KNIME Textprocessing<\/span><\/a> <span style=\"font-weight: 400;\">extension because there&rsquo;s almost no configuration for it &#8211; you just apply it to your documents and continue with your processing. You may find sometimes that you still have hyphens, for example, in some of your tokenized words even after punctuation removal &#8211; if that&rsquo;s the case, you may need to revisit your tokenizer algorithm selection.<\/span><\/p>\n<\/p>\n<p><b><i>[Rosaria]<\/i><\/b> <i><span style=\"font-weight: 400;\">How do you create a dictionary, like for example a stop word dictionary?<\/span><\/i><\/p>\n<p><b>[Scott]<\/b> <span style=\"font-weight: 400;\">The<\/span> <a href=\"https:\/\/kni.me\/n\/m_pfQ69Osv0v6D15\"><span style=\"font-weight: 400;\">Stop Word Filter<\/span><\/a> <span style=\"font-weight: 400;\">node comes with its own internal dictionary for several languages, but if you want to provide your own, that&rsquo;s no problem. Just create a table consisting of a single column, with one stop word per row, and feed that into the optional input port of the node. You can use this same approach to create custom dictionaries for other nodes that might need them &#8211; for example, the<\/span> <a href=\"https:\/\/kni.me\/n\/ID_gQ3dJyH7lByBn\"><span style=\"font-weight: 400;\">Dictionary Tagger<\/span><\/a> <span style=\"font-weight: 400;\">node.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Use Cases<\/span><\/h2>\n<p><b><i>[Rosaria]<\/i><\/b> <i><span style=\"font-weight: 400;\">This is a classic question: What are the most common use cases requiring text mining?<\/span><\/i><\/p>\n<p><b>[Dursun]<\/b> <span style=\"font-weight: 400;\">Probably the most popular use cases for text mining nowadays are<\/span> <b>Sentiment Analysis<\/b> <span style=\"font-weight: 400;\">and<\/span> <b>Topic Detection<\/b><span style=\"font-weight: 400;\">. Use of text mining to extract knowledge from<\/span> <b>social media<\/b> <span style=\"font-weight: 400;\">(tweets) or from<\/span> <b>customer reviews of products and services<\/b> <span style=\"font-weight: 400;\">are very popular. Also of interest is<\/span> <b>classification of documents<\/b> <span style=\"font-weight: 400;\">&#8211; this is similar to the supervised version of sentiment analysis, but the classes are not just limited to degrees of positive and negative feeling. One area where text mining is showing up in academic circles is<\/span> <b>Literature Mining<\/b><span style=\"font-weight: 400;\">. Here, the idea is to discover latent topics from a collection of papers and then look at their dominance overtime and at their relationship to journals and disciplines.&nbsp;<\/span><\/p>\n<\/p>\n<p><b><i>[Rosaria]<\/i><\/b> <i><span style=\"font-weight: 400;\">Would you recommend a few use cases for beginners?<\/span><\/i><\/p>\n<p><b>[Scott]<\/b> <span style=\"font-weight: 400;\">Sure!<\/span> <b>Sentiment analysis<\/b> <span style=\"font-weight: 400;\">is a good starting point. Conceptually it&rsquo;s fairly easy to grasp, but at the same time can be treated with different degrees of complexity: running a<\/span> <a href=\"https:\/\/kni.me\/w\/zp_hhUROHNXToZHX\"><span style=\"font-weight: 400;\">lexicon based analysis<\/span><\/a><span style=\"font-weight: 400;\">,&nbsp; implementing a classic<\/span> <a href=\"https:\/\/kni.me\/w\/ZHAExldZ5M7q6hdG\"><span style=\"font-weight: 400;\">supervised machine learning approach<\/span><\/a><span style=\"font-weight: 400;\">, or for the more experts using a<\/span> <a href=\"https:\/\/kni.me\/w\/NHJpmqsAJ3Ib-thH\"><span style=\"font-weight: 400;\">deep learning network<\/span><\/a><span style=\"font-weight: 400;\">. These three linked examples all analyze IMDB movie reviews and try to classify them as positive or negative.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Another use case that could be approached by a beginner is<\/span> <b>topic extraction<\/b><span style=\"font-weight: 400;\">, maybe with the help of the LDA algorithm. Here is an example from the KNIME Hub titled &ldquo;<\/span><a href=\"https:\/\/kni.me\/w\/c3zQ6nHBoAI-ALUP\"><span style=\"font-weight: 400;\">Topic Extraction<\/span><\/a><span style=\"font-weight: 400;\">&rdquo;, using the Parallel LDA node to mine topics from Tripadvisor restaurant reviews.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">If you&rsquo;re interested in importing data from Twitter and doing some exploratory visualization using a<\/span> <b>word cloud<\/b><span style=\"font-weight: 400;\">, this other simple workflow, titled &ldquo;<\/span><a href=\"https:\/\/kni.me\/w\/eKanRrgP51ARSrT5\"><span style=\"font-weight: 400;\">Interactive Tag Cloud from Twitter Search<\/span><\/a><span style=\"font-weight: 400;\">&rdquo;, can show you how that might be done.<\/span><\/p>\n<\/p>\n<p><b><i>[Rosaria]<\/i><\/b> <i><span style=\"font-weight: 400;\">Let&rsquo;s take topic detection, then. What is the most common algorithm used to extract topics (unsupervised) from a text?<\/span><\/i><\/p>\n<p><b>[Dursun]<\/b> <span style=\"font-weight: 400;\">This is probably LDA, where LDA stands for Latent Dirichlet Allocation. LDA produces a pre-determined number of latent (or hidden) topics from a collection of documents. The output is two-fold: first, a distribution of topics that defines a document, and second, a distribution of words\/terms that defines a topic. The<\/span> <a href=\"https:\/\/kni.me\/n\/w7Vr1wY8Bu8Gfpv7\"><span style=\"font-weight: 400;\">Topic Extractor (Parallel LDA)<\/span><\/a> <span style=\"font-weight: 400;\">node produces both of these outputs as separate tables.<\/span><\/p>\n<\/p>\n<p><b>[Rosaria]<\/b> <i><span style=\"font-weight: 400;\">Talking about LDA, here&rsquo;s a question that comes up often. How large does the document collection need to be to apply LDA?&nbsp;<\/span><\/i><\/p>\n<p><b>[Dursun]<\/b> <span style=\"font-weight: 400;\">The short answer is &ldquo;<\/span><i><span style=\"font-weight: 400;\">large enough but not too large<\/span><\/i><span style=\"font-weight: 400;\">&rdquo;. If the computational power is not an issue, and the documents are coming from a representative application domain, then larger is better. As a rule of thumb, in text mining and in topic detection, large quantities of smaller size documents tend to produce more reliable and consistent results.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">&nbsp;<\/span><\/p>\n<p><b>[Rosaria]<\/b> <i><span style=\"font-weight: 400;\">Are there alternatives to LDA, in case LDA has limited success?<\/span><\/i><\/p>\n<p><b>[Dursun]<\/b> <span style=\"font-weight: 400;\">LDA has been the latest and greatest of all topic detection methods until recently. Nowadays there are newer methods like Word2Vec, WordEmbedding and Deep Learning (using RNNs\/LSTMs) that take text mining and topic modeling to a new dimension, by including the contextual\/positional information from the sequential nature of language. About 15 years ago, I used simple clustering on raw term-document matrices (TDM) and then repeated it on SVD (Singular Value Decomposition) values for literature mining. It produced pretty good results for that time and the<\/span> <a href=\"https:\/\/www.sciencedirect.com\/science\/article\/abs\/pii\/S0957417407000486\"><span style=\"font-weight: 400;\">study was published in a high impact academic journal<\/span><\/a><span style=\"font-weight: 400;\">.&nbsp;&nbsp;<\/span><\/p>\n<\/p>\n<p><b>[Rosaria]<\/b> <i><span style=\"font-weight: 400;\">Let&rsquo;s move to other use cases. For example, what would be an approach to mining contract language to understand terms and conditions?<\/span><\/i><\/p>\n<p><b>[Dursun]<\/b> <span style=\"font-weight: 400;\">I have not done anything in the domain of contracting. I heard of interesting applications in patent mining, text mining of law\/case records, literature mining in genomics\/biomedicine, and in COVID-19. Within 3 months, almost&nbsp; 200K of published articles on COVID-19 have appeared. You cannot manually process such a large collection of articles. With a few colleagues of mine, we are text mining this large corpus for extraction\/discovery of meta-information.&nbsp;&nbsp;&nbsp;<\/span><\/p>\n<\/p>\n<p><b>[Rosaria]<\/b> <i><span style=\"font-weight: 400;\">Any advice for classification of an email corpus?<\/span><\/i><\/p>\n<p><b>[Dursun]<\/b> <span style=\"font-weight: 400;\">Email was the first practical application for text mining. Spam filtering, subject and priority detection, topic based classification and labeling of email have been the main applications. There were also more sophisticated attempts, where text-based deception detection is used to discern truthful emails from fraudulent ones.&nbsp;<\/span><\/p>\n<\/p>\n<p><b><i>[Rosaria]<\/i><\/b> <i><span style=\"font-weight: 400;\">What is the best approach to interpret the extracted topics? For example, by checking the movie topics, can I extract their genre?<\/span><\/i><\/p>\n<p><b>[Scott]<\/b> <span style=\"font-weight: 400;\">This is the art part of text mining and topic modeling. The common practice is to look at the most dominant (or highest weighted) words in the word distribution of a topic and label it with some meaningful description. Here, knowledge by domain experts is the milestone to aggregate the high-level concept from a few dominant keywords.<\/span><\/p>\n<\/p>\n<p><b><i>[Rosaria]<\/i><\/b> <i><span style=\"font-weight: 400;\">Let&rsquo;s move onto sentiment analysis. What are the classic approaches for sentiment analysis? And what are the advantages\/disadvantages for each one of them?<\/span><\/i><\/p>\n<p><b>[Scott]<\/b> <span style=\"font-weight: 400;\">The simplest method is one we might call the lexicon-based approach, where we take lists of positive and negative words and use them to tag the words in our documents. Then we can count the tagged words to calculate an index score that would indicate how negative or positive the documents are in a relative sense. The advantage of this approach consists in not needing labeled (or supervised) data ahead of time &#8211; we can simply apply our lists as tags and calculate. The downside is that, because of its simplicity, it doesn&rsquo;t always produce the most accurate results.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">If we do have labeled data, we can use a machine learning approach, where we preprocess our documents and transform them into a numerical representation &#8211; a term-document matrix. Once we have this, any of the standard Machine Learning (ML) algorithms can be used for classification, like decision trees or XGBoost. This method often performs relatively well, so long as labeled data is available.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Given enough computing resources, we could even take a deep learning approach to classification, by building out a multi-layer neural network. This requires some dedicated processing of the data and fairly large datasets to perform well. In some cases, provided your project meets the criteria, this can perform exceptionally well, but it will definitely take longer to set up and execute than a standard ML approach.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Examples of all three of these methods for text classification are available for you to explore on the<\/span> <a href=\"https:\/\/hub.knime.com\/\"><span style=\"font-weight: 400;\">KNIME Hub<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\"><a href=\"https:\/\/kni.me\/w\/zp_hhUROHNXToZHX\"><span style=\"font-weight: 400;\">Lexicon Based Approach for Sentiment Analysis<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Machine Learning Approaches:<\/span> <a href=\"https:\/\/kni.me\/w\/ZHAExldZ5M7q6hdG\"><span style=\"font-weight: 400;\">Decision Trees<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Deep Learning Approaches:<\/span> <a href=\"https:\/\/kni.me\/w\/NHJpmqsAJ3Ib-thH\"><span style=\"font-weight: 400;\">multi-layer neural network<\/span><\/a><\/li>\n<\/ol>\n<p><b><i>[Rosaria]<\/i><\/b> <i><span style=\"font-weight: 400;\">The evergreen question. What to choose: statistical traditional methods or deep learning methods? How can we choose the best method?<\/span><\/i><\/p>\n<p><b>[Scott]<\/b> <b>&nbsp;<\/b><span style=\"font-weight: 400;\">Tough one! You never know until you try &#8211; each dataset and corpus is different. In practice, you often find that the extra data pre-processing and computational expense required for deep learning doesn&rsquo;t justify the marginal improvement in model performance. But if you really do need that few extra percentage points of accuracy and like a challenge, give it a shot!<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Languages and Specific Domains<\/span><\/h2>\n<p><b><i>[Rosaria]<\/i><\/b> <i><span style=\"font-weight: 400;\">&nbsp;Some frequently asked questions are usually about applications for specific domains, like in specific languages. Here is one such question. Most work in text mining is for the English language. If the text is in Spanish, would LDA still work? Or do I have to translate it first?<\/span><\/i><\/p>\n<p><b>[Dursun]<\/b> <span style=\"font-weight: 400;\">I have never used text mining on languages other than English. That said, since text mining is mainly syntactic, that is, it does not get the semantic meaning of the words or sentences, if the right enrichment and preprocessing nodes are applied to identify the value added words and exclude language specific stop words and punctuations, then LDA should work fine.&nbsp;&nbsp;<\/span><\/p>\n<\/p>\n<p><b><i>[Rosaria]<\/i><\/b> <i><span style=\"font-weight: 400;\">How can you best apply text mining practices in healthcare\/electronic health records for better patient-care?&nbsp;<\/span><\/i><\/p>\n<p><b>[Dursun]<\/b> <span style=\"font-weight: 400;\">I have been conducting analytics research in the fields of healthcare\/medicine for over 15 years. My first highly cited paper in this domain was published in 2005 (<\/span><a href=\"https:\/\/pubmed.ncbi.nlm.nih.gov\/15894176\/\"><span style=\"font-weight: 400;\">Predicting breast cancer survivability: a comparison of three data mining methods<\/span><\/a><span style=\"font-weight: 400;\">). Although I have used all kinds of healthcare\/EHR data (publically available, proprietary&#8211;obtained from Cerner over 70M unique patient records), I have never used textual data, including doctor or nurse notes. The main reason is that this type of data is very hard to sanitize from private patient information (e.g., required for HIPAA and other governmental regulations). One exception is a new project that I am working on with a medical doctor from NY.&nbsp; We are mining the doctors&rsquo; and nurses&rsquo; clinical notes from emergency rooms visits to identify patterns to accurately predict the patients who are likely to come back to the ER within 72 hours (similar to readmission, but it is commonly called &ldquo;bounceback&rdquo;).&nbsp;&nbsp;<\/span><\/p>\n<\/p>\n<p><b><i>[Rosaria]<\/i><\/b> <i><span style=\"font-weight: 400;\">&nbsp;Is there an easy way to do domain specific named entity recognition?<\/span><\/i><\/p>\n<p><b>[Dursun]<\/b> <span style=\"font-weight: 400;\">Ultimately you will need a domain specific dictionary.&nbsp;<\/span><\/p>\n<\/p>\n<p><b><i>[Rosaria]<\/i><\/b> <i><span style=\"font-weight: 400;\">How do you address NLP when your dataset contains many languages?<\/span><\/i><\/p>\n<p><b>[Dursun]<\/b> <span style=\"font-weight: 400;\">You could perhaps do language translation first so that the whole collection of textual content would be in the same language (most likely in English),&nbsp; and then mine it using NLP functions. Or, you can just use it as is, since text mining and most NLP tasks are syntactic in nature. This means that they are agnostic to the semantic nature of the language or meaning. Then, perhaps the interpretation of the output information will be harder because it will include terms and concepts from multiple languages. In our COVID-19 literature mining project, we have been facing this exact problem. Some literature was published in languages other than English. We tried both approaches, and finally, we settled on converting and using all of the textual content in English.&nbsp;<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Deployment<\/span><\/h2>\n<p><b><i>[Rosaria]<\/i><\/b> <i><span style=\"font-weight: 400;\">Is there a way in KNIME for quick and safe deployment of the text mining workflows?<\/span><\/i><\/p>\n<p><b>[Scott]<\/b> <span style=\"font-weight: 400;\">We&rsquo;ve recently released the<\/span> <a href=\"https:\/\/www.knime.com\/integrated-deployment\"><span style=\"font-weight: 400;\">Integrated Deployment<\/span><\/a> <span style=\"font-weight: 400;\">feature, which allows you to automatically create production workflows safely from the prototype workflow. Essentially, Integrated Deployment captures those portions of the prototype workflow needed for deployment, and these captured portions are automatically replicated and sewn back together to create the deployment workflow. In this way, the prototype workflow and the deployment workflow are always in sync, and you don&rsquo;t have to manually copy-and-paste nodes from one workflow to the other. For more information, you can check out our ongoing<\/span> <a href=\"https:\/\/www.knime.com\/integrated-deployment-knime-blog-series\"><span style=\"font-weight: 400;\">Integrated Deployment post series<\/span><\/a><span style=\"font-weight: 400;\">, along with the corresponding examples on the KNIME Hub.<\/span><\/p>\n<\/p>\n<p><b><i>[Rosaria]<\/i><\/b> <i><span style=\"font-weight: 400;\">Is it possible to export workflows as REST services or web-applications?&nbsp;<\/span><\/i><\/p>\n<p><b>[Scott]<\/b> <span style=\"font-weight: 400;\">Once your deployment workflow is ready, you can run it standalone on<\/span> <a href=\"https:\/\/www.knime.com\/knime-server\"><span style=\"font-weight: 400;\">KNIME Server<\/span><\/a> <span style=\"font-weight: 400;\">either manually, or via scheduling or triggering options. You can even call KNIME workflows from other external applications using the KNIME Server REST API.<\/span><\/p>\n<\/p>\n<p><b><i>[Rosaria]<\/i><\/b> <i><span style=\"font-weight: 400;\">How stable is a text mining model over time?&nbsp;<\/span><\/i><\/p>\n<p><b>[Dursun]<\/b> <span style=\"font-weight: 400;\">Well, the world changes, so at some point the<\/span> <a href=\"https:\/\/technopreneurph.wordpress.com\/2020\/08\/06\/what-happens-to-ai-after-graduating-from-the-lab-by-rosaria-silipo\/\"><span style=\"font-weight: 400;\">model will not fit anymore<\/span><\/a><span style=\"font-weight: 400;\">. Some models may work fine for a few months or even years, and then they start to falter. In the case of analyzing movie related data and reviews, considering to predict the financial success of a movie project as an investment (the average movie costs $200M to make) with a high percentage of failure (65% of the movies losing money), the model accuracy is important &#8211; hence the models should be built with the latest data and ML techniques.&nbsp;<\/span><\/p>\n<\/p>\n<p><b><i>[Rosaria]<\/i><\/b> <i><span style=\"font-weight: 400;\">How can I adapt the model then to the new world?&nbsp;<\/span><\/i><\/p>\n<p><b>[Scott]<\/b> <span style=\"font-weight: 400;\">With KNIME software you have options for model monitoring, so that, for example if you see a model dropping below an accuracy threshold you&rsquo;ve set, you could trigger automated retraining to try to improve performance. You can also set up a champion\/challenger paradigm, using different algorithms or hyperparameters, to test a variety of models against your current best performer to see if it can be beaten. Sometimes, no amount of automated retraining will improve performance sufficiently, so you can also set up alerts to let your data scientist know when manual intervention is needed. All of this is done the usual KNIME way &#8211; with workflows and components!<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Conclusions<\/span><\/h2>\n<p><b><i>[Rosaria]<\/i><\/b> <i><span style=\"font-weight: 400;\">A few more questions to conclude. Do you have any advice for beginners who are now starting to learn about text mining?<\/span><\/i><\/p>\n<p><b>[Scott]<\/b> <span style=\"font-weight: 400;\">To start, try to get a good grasp on the terminology. Even within the wider world of data science, text mining has its own specific idiosyncrasies. Also, spend some time to get as familiar as you can with the common pre-processing steps in a text mining process, since you will need to be implementing these over and over again. Perhaps begin with a simple sentiment analysis use case before moving on to other more complex projects.<\/span><\/p>\n<p><b>[Dursun]<\/b> <b>&nbsp;<\/b><span style=\"font-weight: 400;\">Text mining has its own set of terms that may sound like foreign language to a beginner, and hence, a reading of the foundational concepts and theories is needed. Then, since as the saying goes &ldquo;nothing can replace hands-on experience,&rdquo; one should start investigating the existing models and then build his\/her own models. KNIME Analytics Platform is a great tool to quickly learn the process, test out workflow concepts, and apply them to your own data.&nbsp;&nbsp;&nbsp;<\/span><\/p>\n<\/p>\n<p><b><i>[Rosaria]<\/i><\/b> <i><span style=\"font-weight: 400;\">Any forecasted updates for the KNIME Textprocessing Extension?<\/span><\/i><\/p>\n<p><b>[Scott]<\/b> <span style=\"font-weight: 400;\">We have a few things in the works, but we&rsquo;ll keep the specifics of those under wraps until they&rsquo;re ready for release &#8211; or at least ready for the next KNIME Labs extension update! But in the meantime we are very interested to hear from users of KNIME about what&rsquo;s working for you and what isn&rsquo;t in the Textprocessing extension &#8211; what are your pain points? What features do you find yourself wishing are available? Direct feedback from our users helps us prioritize future development, so come tell us your thoughts in the<\/span> <a href=\"https:\/\/forum.knime.com\/\"><span style=\"font-weight: 400;\">KNIME forum<\/span><\/a><span style=\"font-weight: 400;\">. And who knows &#8211; you may find your username in a set of upcoming patch notes if we incorporate your suggestion!<\/span><\/p>\n<\/p>\n<p><b><i>[Rosaria]<\/i><\/b> <i><span style=\"font-weight: 400;\">Last question. What&#8217;s the process to go about getting KNIME beginner certification?<\/span><\/i><\/p>\n<p><b>[Rosaria]<\/b> <span style=\"font-weight: 400;\">This is the perfect question to end this panel discussion. The next&nbsp;<\/span><span style=\"font-weight: 400;\"><a href=\"https:\/\/www.knime.com\/certification-program\" target=\"_blank\" rel=\"noopener noreferrer\">certification exam<\/a> for level L1 and L2 of KNIME Software usage will take place on Nov 18-19.&nbsp; To prepare for the certification exams, you can either take an<\/span> <a href=\"https:\/\/www.knime.com\/learning\/events?tab=online-course\"><span style=\"font-weight: 400;\">instructor-led course<\/span><\/a> <span style=\"font-weight: 400;\">or one of our<\/span> <a href=\"https:\/\/www.knime.com\/knime-self-paced-courses\"><span style=\"font-weight: 400;\">e-learning self-paced courses<\/span><\/a><span style=\"font-weight: 400;\">.&nbsp; The content is the same. The difference is just in the instructor looking over your shoulders or not.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">And with this we conclude this panel discussion. Thanks to Dursun and Scott for the time and the answers.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">See you at the next interview!<\/span><\/p>\n<p><span style=\"font-weight: 400;\">&#8212;&#8212;&#8212;&#8211;<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Some of the questions and answers reported in this interview originate from the webinar &ldquo;<a href=\"https:\/\/youtu.be\/6UbbLHpXTws\">Text Mining &#8211; Panel Discussion<\/a>&rdquo; run by <a href=\"https:\/\/www.knime.com\/\">KNIME<\/a> on Sep 14, 2020.<\/span><\/p>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:1001481\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Rosaria Silipo Meet two text mining experts in today&rsquo;s interview, which explores some of the common issues faced by data scientists in text analytics. [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/11\/16\/from-data-collection-to-text-interpretation-an-interview-on-exploring-techniques-and-use-cases-for-text-mining\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":475,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4096"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=4096"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4096\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/467"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=4096"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=4096"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=4096"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}