{"id":1099,"date":"2018-09-28T06:42:47","date_gmt":"2018-09-28T06:42:47","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2018\/09\/28\/data-engineers-nobody-puts-baby-in-a-corner\/"},"modified":"2018-09-28T06:42:47","modified_gmt":"2018-09-28T06:42:47","slug":"data-engineers-nobody-puts-baby-in-a-corner","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2018\/09\/28\/data-engineers-nobody-puts-baby-in-a-corner\/","title":{"rendered":"Data Engineers: Nobody Puts Baby in a Corner!"},"content":{"rendered":"<p>Author: Bill Schmarzo<\/p>\n<div>\n<p>Oh,the lowly data engineer.<span>\u00a0<\/span> <span>Harvard Business Review declared the role of the data scientist<\/span> as \u201c<span><a href=\"https:\/\/hbr.org\/2012\/10\/data-scientist-the-sexiest-job-of-the-21st-century\">the sexiest job in the 21<sup>st<\/sup>century<\/a><\/span>.\u201d<span>\u00a0<\/span>But the data engineer labors away in near obscurity acquiring, transforming, enriching, munging and preparing data for the data scientist to do their black magic.<\/p>\n<p>In addition to building data pipelines \u2013 who do you think operationalizes the data science?<span>\u00a0<\/span> Again, it\u2019s the data engineer.<span>\u00a0<\/span><\/p>\n<p>We have helped data engineers deploy machine learning models and operationalize data science for some time.<span>\u00a0<\/span> Using the Pentaho Data Pipeline, we enable the lowly data engineers to leverage existing data and feature engineering efforts, thereby significantly reducing time-to-deployment. With embeddable APIs, organizations can also include the full power of Pentaho within existing applications.<\/p>\n<p>Good news for the Data Engineers, you can now be even faster and better organized at getting things done! Now you have a tool to dance the tango and mambo with the data scientists like Johnny and Baby. At Hitachi NEXT today in San Diego I had a chance to attend some of the Pentaho sessions and visit their booths at the exhibit hall.<span>\u00a0<\/span> I was excited to see new capabilities such as integration with <span>Jupyter<\/span>notebooks &#8211; an advanced data science development tool, orchestration of analytic models written using TensorFlow and Keras machine learning libraries, and simplified analytic model management.<\/p>\n<p>Here is what I learned:<\/p>\n<h1>1) Integration with Jupyter Notebooks<\/h1>\n<p>Data scientists are most comfortable working in their IDEs and spend a lot of time writing scripts to prepare data to feed the models they are exploring.<span>\u00a0<\/span>To address this, we have validated a best practice for data engineers to access, cleanse, integrate and deliver data as a service for use by data scientists.<\/p>\n<p>Rather than manually create and maintain one-off scripts to access, massage and wrangle data assets, data scientists can now focus on the more intellectually rewarding part of their jobs &#8211; model exploration. They can <span>concentrate<\/span>on developing insightful and accurate analysis in the familiar IDE of a Jupyter Notebook, an incredibly powerful and popular tool for <span>contemporary<\/span>data science, and leave data preparation and integration to data engineers. Using the drag and drop interface in Pentaho Data Integration (PDI), data engineers can create transformations that result in governed data sources that they can <span>register<\/span>in an enterprise data catalog to promote reuse across data engineering and data science teams \u2013 fostering a more collaborative working relationship.<\/p>\n<p>Data <span>Scientists can<\/span>access fresh production data rather than older test data for further model exploration and tuning to keep accuracy high. Once the data scientist is ready for their model to be operationalized in a production environment, the data engineer can make minor modifications to the pipelines created in a development environment to make them production ready.<\/p>\n<p><a href=\"http:\/\/api.ning.com\/files\/IH6i6n81EshPpIf2WSfOc-vqeLfouOQ6v2zfs2s3WXDdwFyzmbrsPoTr6uZS8z61CqwufR8pOvTVxIH1F8TmiG6myWB7Foax\/1.png\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/IH6i6n81EshPpIf2WSfOc-vqeLfouOQ6v2zfs2s3WXDdwFyzmbrsPoTr6uZS8z61CqwufR8pOvTVxIH1F8TmiG6myWB7Foax\/1.png?width=750\" width=\"750\" class=\"align-full\"><\/a><\/p>\n<p><strong>Figure 1. Integrate PDI Transformations with Jupyter Notebooks<\/strong><\/p>\n<h1><span>2) \u00a0Orchestrating TensorFlow and Keras Models<\/span><\/h1>\n<p>While data engineers have <span>in-depth<\/span>knowledge of and expertise in data warehousing, SQL, NoSQL, <span>and<\/span>Hadoop technologies, in most cases they do not have the Python or R coding skills.<span>\u00a0<\/span>They most likely do not have the advanced math and statistics skills required to tune machine learning and deep learning models to <span>get<\/span>the most accurate models into production faster.<span>\u00a0<\/span> We recognized this, <span>and<\/span>have now added <span>an enterprise-grade<\/span>transformation step that helps data engineers embed deep learning ML models into data pipelines without coding knowledge.<\/p>\n<p><a href=\"http:\/\/api.ning.com\/files\/IH6i6n81Esjc*AfLfC0nYar*B--*-CA3gGTWVMPFJmLCjjk*0vNyzF8qIPJaVsaijDP9extbgc5scY4ePavyv5NMiGAeS2ni\/2.png\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/IH6i6n81Esjc*AfLfC0nYar*B--*-CA3gGTWVMPFJmLCjjk*0vNyzF8qIPJaVsaijDP9extbgc5scY4ePavyv5NMiGAeS2ni\/2.png?width=750\" width=\"750\" class=\"align-full\"><\/a><\/p>\n<p><strong>Figure 2. Using Python executor steps to <span>orchestrate<\/span>TensorFlow and Keras Models<\/strong><\/p>\n<h1><span>3) Improved Model Management<\/span><\/h1>\n<p>Typically models degrade in accuracy as soon as they hit production data. With our new Python execution step, users can make updates to models using production data.<span>\u00a0<\/span> Data Engineers can gain insight into model usage, run champion-challenger tests, review model accuracy statistics and easily swap in the models with the highest accuracy.<span>\u00a0<\/span> By keeping the most accurate models in production, organizations will make better decisions and reduce risk.<\/p>\n<p><a href=\"http:\/\/api.ning.com\/files\/IH6i6n81EsjhdbbBui*x0-zlc7*3XwtFlfmL-kEZbJ2XELtFUdJzajdb3xK06tozY-8aIuHX6qnBt3S80Hbpv8zm6gXN7Py7\/3.png\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/IH6i6n81EsjhdbbBui*x0-zlc7*3XwtFlfmL-kEZbJ2XELtFUdJzajdb3xK06tozY-8aIuHX6qnBt3S80Hbpv8zm6gXN7Py7\/3.png?width=750\" width=\"750\" class=\"align-full\"><\/a><strong>Figure 3. Model Management Reference Architecture<\/strong><\/p>\n<p>The reference architecture above <span>in Figure<\/span>3 outlines the steps involved <span>to manage a model in an enterprise setting effectively<\/span>. It begins with the data scientist looking for data to create the model and requesting the data engineer to provide him with a governed data source. The data engineer <span>establishes<\/span>this source, makes minor adjustments to the pipeline and reuses it when it comes time to operationalize the model. Both data scientists and data engineers collaborate closely to manage and catalog champion and challenger models, creating enterprise assets that can be reused in the future.<\/p>\n<h1>Summary:<span>\u00a0<\/span><\/h1>\n<p>Analytics and data science are the monetization engines of the future.\u00a0 We all know that data-driven and increasingly, model-driven, firms will\u00a0<span><a href=\"https:\/\/www.wsj.com\/articles\/models-will-run-the-world-1534716720\">run the world<\/a>.<\/span>\u00a0\u00a0However, taking analytics <span>mainstream<\/span>poses data operations challenges for data engineers.<span>\u00a0<\/span><\/p>\n<p>Having a platform that drives collaboration between your data engineers, data scientists and business stakeholders is one of the keys to helping organizations become more effective at using data to drive innovation, strong business outcomes and entirely new business models.<\/p>\n<p>Give the data engineers that they need to support the data science monetization efforts, and everyone will have the time of their lives!<\/p>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:763512\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Bill Schmarzo Oh,the lowly data engineer.\u00a0 Harvard Business Review declared the role of the data scientist as \u201cthe sexiest job in the 21stcentury.\u201d\u00a0But the [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2018\/09\/28\/data-engineers-nobody-puts-baby-in-a-corner\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":475,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1099"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=1099"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1099\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/462"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=1099"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=1099"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=1099"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}