{"id":4952,"date":"2021-08-25T06:34:54","date_gmt":"2021-08-25T06:34:54","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2021\/08\/25\/solving-the-parsing-dilemma\/"},"modified":"2021-08-25T06:34:54","modified_gmt":"2021-08-25T06:34:54","slug":"solving-the-parsing-dilemma","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2021\/08\/25\/solving-the-parsing-dilemma\/","title":{"rendered":"Solving the Parsing Dilemma"},"content":{"rendered":"<p>Author: Julius Cerniauskas<\/p>\n<div>\n<p><span style=\"font-weight: 400;\"><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9407300264?profile=original\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9407300264?profile=RESIZE_710x\" width=\"720\" class=\"align-full\"><\/a><\/span><\/p>\n<\/p>\n<p><span style=\"font-weight: 400;\">There\u2019s a much maligned topic in web scraping &#8211; data parsing. Building scrapers would be a lot easier if the data presented through HTML wasn\u2019t intended for browsers. However, that is the case, which means that the data extraction process has to go through several hoops before delivering results.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Parsing is part of the process. Unfortunately, it\u2019s one of the most resource-intensive parts of the entire web scraping chain. In fact, developing a parser for a specific website is not enough. Maintaining it over time is required. Even then, that might not be the end as some complex websites might need numerous parsers to work the data out of the source.<\/span><\/p>\n<\/p>\n<h2><span style=\"font-weight: 400;\">The dilemma<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Any sufficiently large scraping project has to develop their own parsers. That means dedicated time and resources to a, comparatively, low-skill task. Most of the time, developing and maintaining parsers is a task for junior developers.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">However, junior developers are a highly valuable resource. Spending time maintaining and writing parsers usually barely improves their skills. In fact, it might even bring a certain level of annoyance.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">On the other hand, parsing is a critical part of the scraping process. Most of the time, the data acquired is messy and unusable without intervention. Since the end goal of all web scraping, whether for personal or commercial use, is to provide data for analysis, parsing is a necessity.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In short, we have an essentially necessary process that takes up a significant portion of resources and time while not being significantly challenging or useful to the individual. In other words, it\u2019s a resource sink. Solving such a challenge would free up a lot of highly skilled hands and brains to do greater work.<\/span><\/p>\n<\/p>\n<h2><span style=\"font-weight: 400;\">A look towards automation<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">If you were to approach any sensible CXO or businessperson in general with an idea to save significant time for developers, they would accept the suggestion with open arms. There\u2019s rarely anything better than saving resources through automation.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">However, automating parsing isn\u2019t as simple as it may seem. Partly, the reason is the frequent maintenance required. Usually, the requirement arises because websites change their layouts. If they do so, the parser breaks.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Yet, predicting future layout and coding changes is simply impossible. Therefore, no rule-based approach is truly viable. Classical programming is of little help here. Manual work, as mentioned previously, is a huge time and resource sink.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">There\u2019s one option remaining that has built up a lot of hype over the past decade or so. That is machine learning applications. Parsing seems to be the perfect way to test the mettle of machine learning engineers.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Since all of HTML has a similar structure across certain categories of pages, the visual changes are decidedly small. Additionally, layout changes aren\u2019t usually massive overhauls of an entire website. They\u2019re mostly incremental UX and UI improvements that are implemented. While that may add to the annoyance of a developer, it\u2019s a great candidate for a stochastic algorithm looking for similarities between trained data and new data.<\/span><\/p>\n<\/p>\n<h2><span style=\"font-weight: 400;\">Preparing for adaptive parsing<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Before engaging into any machine learning project, at least these questions should be answered beforehand:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">What will be the limits of the model?<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">What type of learning will be needed?<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">What type (labeled\/unlabeled) data will be used?<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">How will the data be acquired?<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">Luckily, for our<\/span> <a href=\"https:\/\/oxylabs.io\/blog\/ml-based-adaptive-parser-release\"><span style=\"font-weight: 400;\">Adaptive Parser project at Oxylabs<\/span><\/a><span style=\"font-weight: 400;\">, we had the easiest answers to the last three questions. Since we already knew what we were looking at and for (data from specific pages), we could use labeled data. That meant supervised learning, one of the most practical and easy to execute models, can be used.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">However, the true difficulty lies in answering the first question as the rest, at least partly, depend on it. Since all resources are finite, the machine learning model should be as narrow as required and as wide as possible. For us, it meant looking at how our clients are using our solutions (e.g. Real-Time Crawler) and making a decision based on data.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">As we discovered through our research, e-commerce product pages were the most painful ones to parse. Generally, the source can be a bit wonky for parsing purposes. Additionally, there\u2019s usually almost identical fields that are only sometimes available (e.g. \u201cnew price\u201d\/\u201cold price\u201d).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">These fields can be confusing to machine learning models as well due to their similarity. However, answering the question about limits lets us set proper expectations for accuracy and the amount of data required. Clearly, we\u2019ll need quite a bit of labeled data as we will have<\/span> <i><span style=\"font-weight: 400;\">at least<\/span><\/i> <span style=\"font-weight: 400;\">one problematic field.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Answering the final question was somewhat easier. We already knew where to pick up our examples. In fact, we could quite quickly collect a large amount of e-commerce pages. However, the strenuous part is labeling. It\u2019s quite easy to get your hands on large amounts of unlabeled data.\u00a0<\/span><\/p>\n<\/p>\n<h2><span style=\"font-weight: 400;\">Labeling data and training<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Every supervised learning dataset has to be labeled. In our case that meant providing labels for most fields in every e-commerce page and it had to be done at least partly manually. If it could be automated, someone would have already created an adaptive parser.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In order to save time and in-house resources, we took a two-pronged approach. First, we hired a few helping hands that would label fields from our soon-to-be training set. Second, we spent some time developing a GUI-based labeling application to speed up the process. The idea is simple &#8211; we spend more financial resources on manual repetitive tasks to save up time for cognitive tasks for our machine learning engineers.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">After getting our hands on enough labeled data to start training our Adaptive Parser, the process is really a lot of trial and error with some strategizing peppered in between. Sometimes, the model will struggle with specific parts and some logic-based nudging will be required (or it will at least speed up the process).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Many months and hundreds of tests later, we have a solution that is able to automatically parse fields in e-commerce product pages, which can adapt to changes with reasonable accuracy. Of course, now maintenance will be the challenge, but we have shown that it\u2019s possible to automate parsing.<\/span><\/p>\n<\/p>\n<h2><span style=\"font-weight: 400;\">Conclusion<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Automating parsing in web scraping isn\u2019t just about saving resources. It\u2019s also about increasing the speed, efficiency, and accuracy of data over time. All of these factors influence the way businesses engage with external data. Primarily, there\u2019s less time dedicated to working around the data and more time to working with data.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">More discussions on the pressing topics around web scraping, industry trends and expert tips will be shared in an annual web scraping conference Oxycon. It will take place online on August 25-26th and the<\/span> <a href=\"https:\/\/oxylabs.io\/oxycon\"><span style=\"font-weight: 400;\">registration<\/span><\/a> <span style=\"font-weight: 400;\">is free of charge.<\/span><\/p>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:1062256\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Julius Cerniauskas There\u2019s a much maligned topic in web scraping &#8211; data parsing. Building scrapers would be a lot easier if the data presented [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2021\/08\/25\/solving-the-parsing-dilemma\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":473,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4952"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=4952"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4952\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/472"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=4952"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=4952"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=4952"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}