{"id":4826,"date":"2021-07-13T06:33:58","date_gmt":"2021-07-13T06:33:58","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2021\/07\/13\/load-data-into-google-bigquery-on-dataproc-and-automl\/"},"modified":"2021-07-13T06:33:58","modified_gmt":"2021-07-13T06:33:58","slug":"load-data-into-google-bigquery-on-dataproc-and-automl","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2021\/07\/13\/load-data-into-google-bigquery-on-dataproc-and-automl\/","title":{"rendered":"Load Data Into Google BigQuery on Dataproc and AutoML"},"content":{"rendered":"<p>Author: Dash<\/p>\n<div>\n<h2><span><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9238637677?profile=original\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9238637677?profile=RESIZE_710x\" width=\"720\" class=\"align-full\"><\/a><\/span><\/h2>\n<h2><span>What is Dataproc?<\/span><\/h2>\n<p><span>Dataproc is a low-cost, Google Cloud Platform integrated, easy to use managed Spark and Hadoop service that can be leveraged for batch processing, streaming, and machine learning use cases.<\/span><\/p>\n<h2><span>What is Google BigQuery?<\/span><\/h2>\n<p><span>BigQuery is an enterprise grade data warehouse that enables high-performance SQL queries using the processing power of Google\u2019s infrastructure.<\/span><\/p>\n<h2><span>Load Data Into Google BigQuery and AutoML<\/span><\/h2>\n<h3>Use Case<\/h3>\n<p><span>In this blog, we will review ETL data pipeline in StreamSets Transformer,\u00a0<\/span><a href=\"https:\/\/streamsets.com\/products\/dataops-platform\/transformer-etl\/\"><span>a Spark ETL engine<\/span><\/a><span>, to ingest real-world data from Fire Department of New York (FDNY) stored in Google Cloud Storage (GCS), transform it, and store the curated data in Google BigQuery.\u00a0<\/span><\/p>\n<p><span>Once the transformed data is made available in Google BigQuery, it will be used in AutoML to train a machine learning model to predict the average incident response time for the FDNY.<\/span><\/p>\n<h3>Sample Data<\/h3>\n<p><span>The dataset is made available through the\u00a0<\/span><a href=\"https:\/\/opendata.cityofnewyork.us\/\"><span>NYC Open Data<\/span><\/a><span>\u00a0website. The 2009-2018 historical dataset contains average response times of the FDNY. The data is partitioned by incident type (False Alarm, Medical Emergency, and so on), borough, and the number of incidents during a particular month.<\/span><\/p>\n<p>Here\u2019s what the sample FDNY data looks like:<\/p>\n<h3>Data Pipeline Overview<\/h3>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9190941269?profile=original\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9190941269?profile=RESIZE_710x\" class=\"align-full\"><\/a><\/p>\n<p><b>Data Source And Dataset<\/b><\/p>\n<ul>\n<li><span>Data in CSV format is loaded from GCS using Google Cloud Storage (GCS) origin. To load data from GCS, all you need to provide is the path to the bucket, data format, and file name pattern.<\/span><\/li>\n<\/ul>\n<p><b>Data Transformations<\/b><\/p>\n<ul>\n<li>\n<span>Raw data is transformed using\u00a0<\/span><a href=\"https:\/\/streamsets.com\/documentation\/transformer\/latest\/help\/transformer\/Processors\/Filter.html#concept_fqx_mzb_chb\"><span>Filter<\/span><\/a><span>,\u00a0<\/span><a href=\"https:\/\/streamsets.com\/documentation\/transformer\/latest\/help\/transformer\/Processors\/FieldRemover.html#concept_svw_dxf_fhb\"><span>Field Remover<\/span><\/a><span>, and\u00a0<\/span><a href=\"https:\/\/streamsets.com\/documentation\/transformer\/latest\/help\/transformer\/Processors\/SparkSQLExp.html#concept_akj_gsz_mhb\"><span>Spark SQL Expression<\/span><\/a><span>\u00a0processors in a format that is suitable for machine learning. (See details below.)<\/span>\n<\/li>\n<\/ul>\n<p><b>Data Storage<\/b><\/p>\n<ul>\n<li>\n<span>Transformed data is stored in a\u00a0<\/span><a href=\"https:\/\/streamsets.com\/documentation\/transformer\/latest\/help\/transformer\/Origins\/GBigQuery.html#concept_dtx_hyf_14b\"><span>Google BigQuery<\/span><\/a><span>\u00a0table.\u00a0<\/span><i><span>Note<\/span><\/i><span>: if the table doesn\u2019t already exist, it will be created automatically by StreamSets Transformer.<\/span>\n<\/li>\n<\/ul>\n<p><b>Cluster Type<\/b><\/p>\n<ul>\n<li>\n<span>In this example, the data pipeline is designed to run on an existing or ephemeral\u00a0<\/span><a href=\"https:\/\/streamsets.com\/documentation\/transformer\/latest\/help\/transformer\/Clusters\/Dataproc.html#concept_rcv_34q_5mb\"><span>Google Dataproc<\/span><\/a><span>\u00a0cluster.\u00a0<\/span><i><span>Note<\/span><\/i><span>: Other supported cluster types in StreamSets Transformer include\u00a0<\/span><a href=\"https:\/\/streamsets.com\/documentation\/transformer\/latest\/help\/transformer\/Clusters\/Databricks.html#concept_bkm_31c_4hb\"><span>Databricks<\/span><\/a><span>,\u00a0<\/span><a href=\"https:\/\/streamsets.com\/documentation\/transformer\/latest\/help\/transformer\/Clusters\/EMR.html#concept_yjs_gzt_vkb\"><span>Amazon EMR<\/span><\/a><span>, Azure for HDInsight,\u00a0<\/span><a href=\"https:\/\/streamsets.com\/documentation\/transformer\/latest\/help\/transformer\/Clusters\/Hadoop.html#concept_lnz_xnj_2hb\"><span>Hadoop YARN<\/span><\/a><span>, and\u00a0<\/span><a href=\"https:\/\/streamsets.com\/documentation\/transformer\/latest\/help\/transformer\/Clusters\/SQLServerBDC.html#concept_w5n_frw_zjb\"><span>SQL Server 2019 Big Data Cluster<\/span><\/a><span>.<\/span>\n<\/li>\n<\/ul>\n<h2><span>Data Pipeline Preview<\/span><\/h2>\n<p><span>Before running the\u00a0<\/span><a href=\"https:\/\/streamsets.com\/products\/dataops-platform\/transformer-etl\/\"><span>Spark ETL<\/span><\/a><span>\u00a0pipeline in StreamSets Transformer, you can preview the pipeline against the configured Dataproc cluster to examine the data structure, data types, and verify the transformations at every stage. This is also a great way to\u00a0<\/span><a href=\"https:\/\/streamsets.com\/getting-started\/download-install-transformer-etl\/#1607012180138-5c9bcd0f-a9f2\"><span>debug data pipelines<\/span><\/a><span>. For more information on pipeline preview, refer to the\u00a0<a href=\"https:\/\/streamsets.com\/documentation\/transformer\/latest\/help\/transformer\/Preview\/Preview-Title.html#concept_zlk_byt_cmb\" target=\"_blank\" rel=\"noopener noreferrer\">documentation<\/a>.<\/span><\/p>\n<h3><span>Filter<\/span><\/h3>\n<p><span>Using a\u00a0<\/span><a href=\"https:\/\/streamsets.com\/documentation\/transformer\/latest\/help\/transformer\/Processors\/Filter.html#concept_fqx_mzb_chb\"><span>Filter<\/span><\/a><span>\u00a0processor we will filter out incidents where\u00a0<\/span><em><span>INCIDENTCLASSIFICATION<\/span><\/em><i><span>\u00a0== \u201c<\/span><\/i><span>All Fire\/Emergency Incidents<\/span><i><span>\u201c<\/span><\/i><span>\u00a0or\u00a0<\/span><em><span>INCIDENTBOROUGH<\/span><\/em><i><span>\u00a0== \u201c<\/span><\/i><span>Citywide<\/span><i><span>\u201d.<\/span><\/i><\/p>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9190942475?profile=original\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9190942475?profile=RESIZE_710x\" class=\"align-full\"><\/a><\/p>\n<h3><span>Remove Future Information<\/span><\/h3>\n<p><span>Because this is a historical dataset and we\u2019re using it to train a machine learning model, we need to remove information that would not be known at the beginning of the month. In this case, that is\u00a0<em>INCIDENTCOUNT<\/em><i>.\u00a0<\/i>To remove this field from every record, we\u2019ll use a\u00a0<a href=\"https:\/\/streamsets.com\/documentation\/transformer\/latest\/help\/transformer\/Processors\/FieldRemover.html#concept_svw_dxf_fhb\">Field Remover<\/a>\u00a0processor.<\/span><\/p>\n<h3><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9190942898?profile=original\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9190942898?profile=RESIZE_710x\" class=\"align-full\"><\/a><\/h3>\n<h3><span>Data Transformations<\/span><\/h3>\n<p><span>Labels or target variables in machine learning models are of numeric data type. In this case, the field value of\u00a0<\/span><em><span>AVERAGERESPONSETIME<\/span><\/em><span>\u00a0is transformed in the following steps:<\/span><\/p>\n<ul>\n<li>\n<span>Remove \u201c:\u201d using\u00a0<\/span><a href=\"https:\/\/streamsets.com\/documentation\/transformer\/latest\/help\/transformer\/Processors\/SparkSQLExp.html#concept_akj_gsz_mhb\"><span>Spark SQL<\/span><\/a><span>\u00a0expression \u2014\u00a0<\/span><strong><em>replace(AVERAGERESPONSETIME,\u201d:\u201d,\u201d\u201d)<\/em><\/strong>\n<\/li>\n<li>Convert from time to seconds and from string datatype to integer using<span>\u00a0<\/span><a href=\"https:\/\/streamsets.com\/documentation\/transformer\/latest\/help\/transformer\/Processors\/SparkSQLExp.html#concept_akj_gsz_mhb\">Spark SQL<\/a><span>\u00a0<\/span>expression \u2014<span>\u00a0<\/span><strong><em>round((AVERAGERESPONSETIME \/ 100) * 60 + (AVERAGERESPONSETIME % 100))<\/em><\/strong>\n<\/li>\n<\/ul>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9190945874?profile=original\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9190945874?profile=RESIZE_710x\" class=\"align-full\"><\/a><\/p>\n<h2><span>Data Pipeline Execution<\/span><\/h2>\n<p><span>Running the\u00a0<\/span><a href=\"https:\/\/streamsets.com\/products\/dataops-platform\/transformer-etl\/\"><span>StreamSets Transformer<\/span><\/a><span>\u00a0data pipeline displays various metrics in real-time. For example, batch processing time taken by each stage as shown below. This is a great way to start looking into fine tuning the processing and transformations.<\/span><\/p>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9190946288?profile=original\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9190946288?profile=RESIZE_710x\" class=\"align-full\"><\/a><\/p>\n<h2><span>Google BigQuery<\/span><\/h2>\n<p><span>Once the pipeline runs successfully, the Google BigQuery table is auto-created, if it doesn\u2019t already exists, and the transformed data is inserted into the table. This dataset is then readily available for querying as shown below.<\/span><\/p>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9190946674?profile=original\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9190946674?profile=RESIZE_710x\" class=\"align-full\"><\/a><\/p>\n<h2><span>AutoML<\/span><\/h2>\n<p><span>The transformed data stored can then be imported directly from the BigQuery table for training a machine learning model in\u00a0<a href=\"https:\/\/cloud.google.com\/automl\/docs\" target=\"_blank\" rel=\"noopener noreferrer\">AutoML<\/a>.\u00a0<\/span><\/p>\n<p>Using AutoML you can build on Google\u2019s machine learning capabilities and create custom machine learning models.<\/p>\n<h3><span>Import Data<\/span><\/h3>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9190947257?profile=original\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9190947257?profile=RESIZE_710x\" class=\"align-full\"><\/a><\/p>\n<h3><span>Select Target Column<\/span><\/h3>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9190947294?profile=original\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9190947294?profile=RESIZE_710x\" class=\"align-full\"><\/a><\/p>\n<h3><span>Train Machine Learning Model<\/span><\/h3>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9190948055?profile=original\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9190948055?profile=RESIZE_710x\" class=\"align-full\"><\/a><\/p>\n<div class=\"vcex-module vcex-divider vcex-divider-solid wpex-max-w-100 wpex-mx-auto wpex-block wpex-h-0 wpex-border-b wpex-border-solid wpex-border-gray-200\"><\/div>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9190948463?profile=original\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/9190948463?profile=RESIZE_710x\" class=\"align-full\"><\/a><\/p>\n<\/p>\n<p><span>That\u2019s it! We went from loading raw, real-world data into Google BigQuery to creating a machine learning model in AutoML without any coding or scripting!<\/span><\/p>\n<h2><span>Build Your Spark ETL and ML Data Pipelines<\/span><\/h2>\n<p><span>It goes without saying that training models, evaluating them, model versioning, and serving different versions of the model are non-trivial undertakings and that is not the focus of this post. That said, however, StreamSets Transformer makes it really easy to load data into Google BigQuery and AutoML.<\/span><\/p>\n<p><span>Checkout these helpful resources and get started quickly with running your\u00a0<\/span><a href=\"https:\/\/streamsets.com\/getting-started\/download-install-transformer-etl\/#1603730763837-927d0686-eb52\"><span>Spark ETL data pipelines<\/span><\/a><span>.\u00a0<\/span><\/p>\n<p><span>Learn more about\u00a0<\/span><a href=\"https:\/\/streamsets.com\/solutions\/streamsets-for-google\/\"><span>StreamSets For Google Cloud Platform<\/span><\/a><span>.<\/span><\/p>\n<p>\u00a0<\/p>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:1055710\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Dash What is Dataproc? Dataproc is a low-cost, Google Cloud Platform integrated, easy to use managed Spark and Hadoop service that can be leveraged [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2021\/07\/13\/load-data-into-google-bigquery-on-dataproc-and-automl\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":463,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4826"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=4826"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4826\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/462"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=4826"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=4826"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=4826"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}