{"id":3761,"date":"2020-08-13T06:33:52","date_gmt":"2020-08-13T06:33:52","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/08\/13\/architectures-every-data-scientist-and-big-data-engineer-should-know\/"},"modified":"2020-08-13T06:33:52","modified_gmt":"2020-08-13T06:33:52","slug":"architectures-every-data-scientist-and-big-data-engineer-should-know","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/08\/13\/architectures-every-data-scientist-and-big-data-engineer-should-know\/","title":{"rendered":"Architectures Every Data Scientist And Big Data Engineer Should Know"},"content":{"rendered":"<p>Author: Sharmistha Chatterjee<\/p>\n<div>\n<div>\n<div>\n<div>\n<div>\n<div class=\"block-editor-block-list__layout\">\n<div id=\"block-7462bd35-ce67-491d-abd7-f9d871a40b63\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h3 class=\"rich-text block-editor-rich-text__editable\"><em><a href=\"https:\/\/images.unsplash.com\/photo-1493946740644-2d8a1f1a6aff?ixlib=rb-1.2.1&amp;ixid=eyJhcHBfaWQiOjEyMDd9&amp;auto=format&amp;fit=crop&amp;w=800&amp;q=60\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" src=\"https:\/\/images.unsplash.com\/photo-1493946740644-2d8a1f1a6aff?ixlib=rb-1.2.1&amp;ixid=eyJhcHBfaWQiOjEyMDd9&amp;auto=format&amp;fit=crop&amp;w=800&amp;q=60&amp;profile=RESIZE_710x\" class=\"align-full\"><\/a><\/em><\/h3>\n<h3 class=\"rich-text block-editor-rich-text__editable\"><em><a href=\"https:\/\/unsplash.com\/photos\/tjX_sniNzgQ\">Source<\/a><\/em><\/h3>\n<h3 class=\"rich-text block-editor-rich-text__editable\"><em>Comprehensive and Comparative List of Feature Store Architectures for Data Scientists and Big Data Professionals<\/em><\/h3>\n<\/div>\n<div id=\"block-c988746c-6e78-4acd-90d5-8ea60a4c4ee6\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">\n<\/div>\n<div id=\"block-d4b519d9-0ebf-4dff-8a8e-b44bc7d34939\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h2 class=\"rich-text block-editor-rich-text__editable\"><span style=\"font-size: 18pt;\"><strong>Introduction &amp; Motivation &#8211; Why Feature Store<\/strong><\/span><\/h2>\n<\/div>\n<div id=\"block-4019cc0a-2966-4b2d-81a0-6bacfdaf0318\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">Feature store has become an important unit of organizations developing predictive services across any industry domain. Some of the earlier challenges in deploying ML solutions at scale involves :<\/p>\n<\/div>\n<div id=\"block-4f293b25-38c6-44cf-bb86-119f834518bd\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<ul class=\"rich-text block-editor-rich-text__editable\">\n<li><strong>Developing<\/strong> and <strong>maintaining customized systems<\/strong> by individual teams with little or no coordination.<\/li>\n<li>No <strong>collaborative<\/strong> system for sharing features for similar type ML models (models from a similar domain or models addressing. same business use-cases or customer domains).<\/li>\n<li>Increased <strong>cognitive burden<\/strong> without the proper scope of scalability<\/li>\n<li>Limited <strong>integration<\/strong> with big-data ecosystems.<\/li>\n<li>Limited scope for <strong>model retraining, comparison, model governance, and traceability<\/strong>, limiting agile development life-cycle.<\/li>\n<li>Difficult to t<strong>rack and retrain model<\/strong> which exhibits seasonality<\/li>\n<\/ul>\n<\/div>\n<div id=\"block-8dc07b0f-d93f-47bd-afd5-08c6d4014c74\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">To overcome the above limitations, Architects. Data scientists, Big Data, and Analytics professionals have felt the necessity to walk under one roof with one unified framework to facilitate easier collaboration, sharing of data, results, reports.<\/p>\n<\/div>\n<div id=\"block-9d264d4a-e649-498d-825b-9fd11f8f0510\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">Departments, teams and organizations shared some of the similar notions of Feature Engineering:<\/p>\n<\/div>\n<div id=\"block-38078d2b-91c6-4885-be59-3e96ae670720\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<ul class=\"rich-text block-editor-rich-text__editable\">\n<li>Feature Engineering is <strong>expensive<\/strong> and <strong>amortization<\/strong> happens over time and across models.<\/li>\n<li>The increase in cost is <strong>non-linear\/exponential with the increase<\/strong> in the number of features.<\/li>\n<li>Triggers\/Alerts due to addition\/removal of feature is high.<\/li>\n<li>Most often dependencies are not <strong>documented\/tracked<\/strong> which results in <strong>an increase<\/strong> of <strong>implicit<\/strong> and <strong>explicit dependencies<\/strong> getting added over time.<\/li>\n<\/ul>\n<\/div>\n<div id=\"block-bc1d18d9-1961-463a-bf1d-e6aa04d9f6b6\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">While sharing a similar opinion, it became easier to come together and create a <strong>Unified Framework<\/strong> called <strong>Feature Store.<\/strong> This would enhance the speed of ML model <strong>deployment life-cycle<\/strong> along with the creation of proper documents, <strong>required version analysis,<\/strong> and <strong>model performance<\/strong> in order to save time and effort.<\/p>\n<\/div>\n<div id=\"block-3e675eb1-b8c7-4f06-b295-d2954f476d37\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">In this blog, we highlight on the features supported by different Feature Store frameworks, that are primarily developed by different leading industry giants.<\/p>\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">\n<\/div>\n<div id=\"block-8f0e107d-5210-4a91-8c4d-baedd9080381\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h2 class=\"rich-text block-editor-rich-text__editable\"><span style=\"font-size: 18pt;\"><strong>Advantages of Feature Store<\/strong><\/span><\/h2>\n<\/div>\n<div id=\"block-cf4afd3e-855f-4dad-abc4-57c38c04e163\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<ul class=\"rich-text block-editor-rich-text__editable\">\n<li>Ability to <strong>re-use and discover features<\/strong> between teams across the organization.<\/li>\n<li>Features should be <strong>governed<\/strong> by adding features like <strong>access control and versioning<\/strong>.<\/li>\n<li>Ability to <strong>precompute<\/strong> and <strong>automatically backfill features<\/strong> &#8212; including online computation and offline aggregation<\/li>\n<li>Helping to create a <strong>collaborative environment<\/strong> between data scientists and big data engineers<\/li>\n<li><strong>Save effort<\/strong> and <strong>cost<\/strong> by sharing not only features but also related <strong>artifacts<\/strong>, documents, marketing insights of models developed from these features.<\/li>\n<li>Enable <strong>consistency<\/strong> between training and serving.<\/li>\n<\/ul>\n<\/div>\n<div id=\"block-002b76ec-043c-4ddf-9ba8-a73078815013\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h2 class=\"rich-text block-editor-rich-text__editable\"><span style=\"font-size: 18pt;\"><strong>Michaelengelo From Uber<\/strong><\/span><\/h2>\n<\/div>\n<div id=\"block-0aceed66-324c-4d03-8874-0470b669fc57\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<div>\n<div class=\"components-resizable-box__container\"><img decoding=\"async\" src=\"https:\/\/techairesearch.com\/wp-content\/uploads\/2020\/08\/image-2-1024x574.png\" alt=\"This image has an empty alt attribute; its file name is image-2-1024x574.png\"><\/div>\n<div class=\"__resizable_base__\"><\/div>\n<\/div>\n<p><a href=\"https:\/\/eng.uber.com\/michelangelo-machine-learning-platform\/\">Source<\/a><\/p>\n<div class=\"components-drop-zone\"><\/div>\n<\/div>\n<div id=\"block-c41f41bb-e143-40c3-a378-bf2d55d58f78\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\"><strong>Michaelangelo<\/strong> &#8211; a framework developed by <strong>Uber<\/strong> that allows feature integration\/joining in both offline and online pipelines. Here <strong>Hive (Offline)<\/strong> and <strong>Cassandra (Online)<\/strong> acts as the main storage unit for raw\/transformed features. It provides a <strong>horizontally scalable multi-tenant<\/strong> architecture for multiple models with suitable scaling and monitoring. Training jobs can be configured and managed through a web UI or an API, <a rel=\"noreferrer noopener\" href=\"http:\/\/jupyter.org\/\" target=\"_blank\">via Jupyter notebook.<\/a><\/p>\n<\/div>\n<div id=\"block-d78ceb7c-44e7-4d1b-9487-3a6d1fb9d055\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">It further provides options to define hierarchical partitioning schema to train models per partition, that can be deployed as a single logical model. This provides easy <strong>bootstrapping<\/strong> and helps to overcome challenges when several models need to be trained based on the hierarchical structure of the data.<\/p>\n<\/div>\n<div id=\"block-e051c1af-26f3-47dd-8d1b-ff99aee7218c\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">At runtime during serving, it finds root to the best model for each node. Further its best known for its ability to support continuous learning, providing integration with <strong>AutoML<\/strong>, along with its support for <strong>distributed deep learning<\/strong>.<\/p>\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">\n<\/div>\n<div id=\"block-65442490-55a4-44b1-9ef2-1a124f0d5e97\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h2 class=\"rich-text block-editor-rich-text__editable\"><span style=\"font-size: 18pt;\"><strong>Feast Feature Store<\/strong><\/span><\/h2>\n<\/div>\n<div id=\"block-c1692248-f4dd-4c32-8c23-a823475aec2d\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<div>\n<div class=\"components-resizable-box__container\"><img decoding=\"async\" src=\"https:\/\/techairesearch.com\/wp-content\/uploads\/2020\/08\/image-1024x378.png\" alt=\"This image has an empty alt attribute; its file name is image-1024x378.png\"><\/div>\n<div class=\"__resizable_base__\"><\/div>\n<\/div>\n<p><a href=\"http:\/\/featurestore.org\/\">Source<\/a><\/p>\n<div class=\"components-drop-zone\"><\/div>\n<\/div>\n<div id=\"block-13dd1073-145c-4059-ae8c-8e93d837e96c\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">Google released <strong>Feast<\/strong> which is primarily built around <strong>Google Cloud services: Big Query (offline) and Big Table (online) and Redis (low-latency)<\/strong>, with <strong>Apache Beam<\/strong> for feature engineering. It allows a clear separation between big data and model development. This online predictive service allows feature sharing among teams with strong <strong>consistency<\/strong> between model training and serving.<\/p>\n<\/div>\n<div id=\"block-2c503343-fe19-4fdd-9a3c-7c98b81b35d1\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">Further Feast comes with centralized <strong>feature management, discovery, feature validation,<\/strong> and feature aggregation. The feature columns reside inside wide-entity tables. In addition, the composite entities separate individual features.<\/p>\n<\/div>\n<div id=\"block-1530f206-db21-4d4b-9729-bd1847d09392\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h2 class=\"rich-text block-editor-rich-text__editable\"><\/h2>\n<h2 class=\"rich-text block-editor-rich-text__editable\"><span style=\"font-size: 18pt;\"><strong>Wix Feature Store<\/strong><\/span><\/h2>\n<\/div>\n<div id=\"block-39caf684-d992-42c4-811d-6838be05a8b5\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<div>\n<div class=\"components-resizable-box__container\"><img decoding=\"async\" src=\"https:\/\/techairesearch.com\/wp-content\/uploads\/2020\/08\/image-7-1024x501.png\" alt=\"This image has an empty alt attribute; its file name is image-7-1024x501.png\"><\/div>\n<div class=\"__resizable_base__\"><\/div>\n<\/div>\n<p><a href=\"http:\/\/featurestore.org\/\">Source<\/a><\/p>\n<div class=\"components-drop-zone\"><\/div>\n<\/div>\n<div id=\"block-d3eb61a2-b3a8-4f49-98cd-df5e91dcad18\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\"><strong>Wix<\/strong> provides a platform for feature-sharing across different ML models for both <strong>batch<\/strong> and <strong>real-time<\/strong> datasets. It supports a pre-configured set of feature families on the site and user-level for both training and serving models. The different stages of data management, model training and deployment are marked and show in the figure above. It further uses <strong>S3<\/strong> to store real-time extracted features.<\/p>\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">\n<\/div>\n<div id=\"block-2dcf13a3-7a49-4357-a36b-cb130df6cdf0\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h2 class=\"rich-text block-editor-rich-text__editable\"><span style=\"font-size: 18pt;\"><strong>FeatureStore from Comcast<\/strong><\/span><\/h2>\n<\/div>\n<div id=\"block-ea5a7fa8-3cc8-4e40-b09f-04f3f6c4d368\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<div>\n<div class=\"components-resizable-box__container\"><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/techairesearch.com\/wp-content\/uploads\/2020\/08\/image-1.png?fit=525,286&amp;ssl=1\" alt=\"This image has an empty alt attribute; its file name is image-1.png\"><\/div>\n<div class=\"__resizable_base__\"><\/div>\n<\/div>\n<p><a href=\"http:\/\/featurestore.org\/\">Source<\/a><\/p>\n<div class=\"components-drop-zone\"><\/div>\n<\/div>\n<div id=\"block-a70a596d-d4ee-46fe-b7e6-47f42e969753\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">The <strong>Feature Store<\/strong> developed by <strong>Comcast<\/strong> helps data scientists to reuse versioned features, upload online (real-time)\/streaming data, and <strong>review feature metrics<\/strong> by models. The product is available in multiple pluggable feature store components. The built-in model repository contains artifacts related to <strong>data pre-processing (normalization, scaling)<\/strong> displaying the required mapping to the features needed to execute the model. Further, the architecture is built using Spark on Alluxio (open source data orchestration layer that brings data close to compute for big data and AI\/ML workloads in the cloud), <strong>S3, HDFS, RDBMS, Kafka, Kinesis<\/strong>. The Model deployment with <strong>Kubeflow<\/strong> helps to build a resilient, highly available distributed systems with support for <strong>rate-limiting, shadow deployments, and auto-scaling<\/strong>.<\/p>\n<\/div>\n<div id=\"block-12885303-ee41-4592-b1e4-434a072222ca\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">The integration with Data Lake with suitable API s helps data scientists to use SQL and create <strong>training\/validation\/test datasets<\/strong> that can be <strong>versioned<\/strong> and integrated into the full model pipeline. In addition, the framework comes with the support of <strong>Seldon Inference Graphs for A\/B Testing, Ensembles, Multi-armed bandits, Custom combinations<\/strong>. The end to end system not only provides traceability from use-case, models, features, model to features mapping, versioned datasets, model training codebase, model deployment containers, and prediction\/outcome sinks, it is also known for integration with <strong>Feature-Store, Container Repository,&nbsp;and&nbsp;Git to integrate data, code and run-time artifacts for CI\/CD integration<\/strong>.<\/p>\n<\/div>\n<div id=\"block-b1addd32-f60e-4b5b-9400-f8349ab15154\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">Just like any other architecture, it has continuous Feature Aggregation on <strong>streaming data + on-demand<\/strong> features. The Online Feature Store uses the following sequences before giving a prediction:<\/p>\n<\/div>\n<div id=\"block-d9e9f367-54bd-46e6-9950-6f86512bbd2e\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<ul class=\"rich-text block-editor-rich-text__editable\">\n<li>Payload only contains Model Name &amp; Account Number<\/li>\n<li>Model Metadata informs which features are needed for the model<\/li>\n<li>Pull required by features by Account Number<\/li>\n<li>Pass a full set of assembled features for model execution<\/li>\n<\/ul>\n<\/div>\n<div id=\"block-c26f2ea4-2296-4e08-9628-bb1916dc57f4\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<div>\n<div class=\"components-resizable-box__container\"><img decoding=\"async\" src=\"https:\/\/techairesearch.com\/wp-content\/uploads\/2020\/08\/Screenshot-2020-08-02-at-1.01.55-AM-1024x857.png\" alt=\"This image has an empty alt attribute; its file name is Screenshot-2020-08-02-at-1.01.55-AM-1024x857.png\"><\/div>\n<div class=\"__resizable_base__\"><\/div>\n<\/div>\n<p><a href=\"https:\/\/hopsworks.readthedocs.io\/en\/1.1\/featurestore\/featurestore.html\">Source<\/a><\/p>\n<div class=\"components-drop-zone\"><\/div>\n<\/div>\n<div id=\"block-c9193693-3699-4f19-a1e3-44de0b7acf2e\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\"><a href=\"https:\/\/www.slideshare.net\/dowlingjim\/the-feature-store-in-hopsworks\">HopWorks Enterprise Edition<\/a> is a multi-tenant architecture that integrates AWS Sagemaker, Databricks, Kubernetes, and Jupyter Notebook. It also supports integration with Authentication frameworks like <strong>LDAP, Kerberos, and Oauth2<\/strong>.<\/p>\n<\/div>\n<div id=\"block-3d3be320-981d-4a10-b325-254c3cd7783f\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">The Batch \/ Live Streaming functionality is facilitated by <strong>Apache Beam, Apache Flink, and Apache Spark,<\/strong> whereas the model governance and monitoring pipeline are built using Kafka and Spark Streaming.<\/p>\n<\/div>\n<div id=\"block-b3515070-0c01-4a7b-986e-894bef25be11\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">The architecture is composed of several building blocks namely<\/p>\n<\/div>\n<div id=\"block-ce538ceb-7bc8-4c65-932c-b903df01d75b\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<ul class=\"rich-text block-editor-rich-text__editable\">\n<li>The Feature Store API &#8211; For reading\/writing to\/from the feature store<\/li>\n<li>The Feature Store Registry &#8211; User-Interface to discover features<\/li>\n<li>Feature Metadata &#8211; Documentation, Analysis and Versioning<\/li>\n<li>Feature Engineering Job &#8211; For computation<\/li>\n<li>Storage Layer &#8211; For feature storage<\/li>\n<\/ul>\n<\/div>\n<div id=\"block-2e93262d-4eb7-4f2b-b264-84f0db630061\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h2 class=\"rich-text block-editor-rich-text__editable\"><span style=\"font-size: 18pt;\"><a href=\"https:\/\/netflixtechblog.com\/distributed-time-travel-for-feature-generation-389cccdd3907\">Netflix Feature <strong>Store<\/strong><\/a><\/span><\/h2>\n<\/div>\n<div id=\"block-5f624004-d286-4083-b3f0-916baed9c219\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<div>\n<div class=\"components-resizable-box__container\"><img decoding=\"async\" src=\"https:\/\/techairesearch.com\/wp-content\/uploads\/2020\/08\/image-3-1024x596.png\" alt=\"This image has an empty alt attribute; its file name is image-3-1024x596.png\"><\/div>\n<div class=\"__resizable_base__\"><\/div>\n<\/div>\n<p><a href=\"http:\/\/featurestore.org\/\">Source<\/a><\/p>\n<div class=\"components-drop-zone\"><\/div>\n<\/div>\n<div id=\"block-015a3e68-7e4a-4abc-a9c3-de7f44690483\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<div>\n<div class=\"components-resizable-box__container\"><img decoding=\"async\" src=\"https:\/\/techairesearch.com\/wp-content\/uploads\/2020\/08\/image-4-1024x482.png\" alt=\"This image has an empty alt attribute; its file name is image-4-1024x482.png\"><\/div>\n<div class=\"__resizable_base__\"><\/div>\n<\/div>\n<p><a href=\"https:\/\/research.netflix.com\/research-area\/machine-learning-platform\">Source<\/a><\/p>\n<div class=\"components-drop-zone\"><\/div>\n<\/div>\n<div id=\"block-16502fbf-1c6f-4716-bf3a-6118bc023c77\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">The feature store developed by Netflix supports both online and offline model training and development. The online micro-services enables the framework to collect the data elements required by the feature encoders in a model. It further passes this downstream for future use by offline predictions. The <strong>Fact Logging<\/strong> service of Netflix logs <strong>user-related, video-related, and computation specific features<\/strong> in a <strong>serialized<\/strong> format in appropriate storage units (<strong>S3<\/strong>).<\/p>\n<\/div>\n<div id=\"block-25c6a28e-1b20-481b-8af7-c1adbd13aa7e\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">The unique point of this architecture is the presence of components that help to:<\/p>\n<\/div>\n<div id=\"block-9a7e9184-149c-444c-a743-c4da335b8bca\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<ul class=\"rich-text block-editor-rich-text__editable\">\n<li>Develop\/Create contexts to snapshot<\/li>\n<li>Snapshot data of various micro-services for the selected context<\/li>\n<li>Build APIs to serve this data for a given time coordinate in the past<\/li>\n<\/ul>\n<\/div>\n<div id=\"block-e49d87f7-af93-471d-81da-e1107e120f1e\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">As snapshotting data for all contexts (e.g all member profiles, devices, times of day) would incur overhead and cost, Netflix relies on selecting samples of contexts to <strong>snapshot<\/strong> periodically (at regular intervals &#8211; daily\/twice daily), though different algorithms. It achieves this through Spark, by training data on different distributions, and by using <strong>stratified samples<\/strong> based on properties such as viewing patterns, devices, time spent on the service, region, etc.<\/p>\n<\/div>\n<div id=\"block-b583223f-550c-49c6-837c-8bbd78c70d1a\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">Netflix embraces a fine-grained Service Oriented Architecture for cloud-based deployment model.<\/p>\n<\/div>\n<div id=\"block-c17a75d1-4735-4791-aa3a-ecc992a3c138\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h2 class=\"rich-text block-editor-rich-text__editable\"><span style=\"font-size: 18pt;\"><strong>FBLearner from Facebook<\/strong><\/span><\/h2>\n<\/div>\n<div id=\"block-eff50492-eb1f-450d-9d09-7bf3952a64fb\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<div>\n<div class=\"components-resizable-box__container\"><img decoding=\"async\" src=\"https:\/\/techairesearch.com\/wp-content\/uploads\/2020\/07\/Screenshot-2020-07-27-at-4.44.02-PM-1024x418.png\" alt=\"This image has an empty alt attribute; its file name is Screenshot-2020-07-27-at-4.44.02-PM-1024x418.png\"><\/div>\n<div class=\"__resizable_base__\"><\/div>\n<\/div>\n<p><a href=\"https:\/\/www.matroid.com\/scaledml\/2018\/yangqing.pdf\">Source<\/a><\/p>\n<div class=\"components-drop-zone\"><\/div>\n<\/div>\n<div id=\"block-32277d10-aa27-4f7d-ba3d-e5d0efcbf672\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">The FBLearner designed by Facebook is a framework for <strong>AI WorkFlow<\/strong> with <strong>Model Management and Deployment<\/strong>. It is mainly composed of 3 components &#8211; FB Learner Feature Store (runs on CPU), <strong>FB Learner Flow (runs on CPU +GPU), and FB Learner Predictor (runs on CPU). I<\/strong>t supports building all kinds of <strong>deep learning models (Caffe2, Pytorch, Tensorflow, MxNet, CNTK)<\/strong> and models can be stored in ONNX format (<a rel=\"noreferrer noopener\" href=\"https:\/\/onnx.ai\/supported-tools\" target=\"_blank\">standardizes portability<\/a>&nbsp;across converters, runtimes, compilers, and visualizers. supports and to) across different hardware\/software platforms.<\/p>\n<\/div>\n<div id=\"block-0119d959-b0c7-4c7c-abcc-a261b37d7510\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">The above broad categories can be seen as creating logical units from hardware to application software.<\/p>\n<\/div>\n<div id=\"block-34f385d3-bfc0-4615-8f67-a95bd6a2ef8e\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<ul class=\"rich-text block-editor-rich-text__editable\">\n<li><strong>Frameworks<\/strong> <strong>(FB Learner Feature Store)<\/strong>&nbsp;needed to create, migrate and train models<\/li>\n<li><strong>Platforms<\/strong> <strong>(FB Learner Flow)<\/strong> for model deployment and management and<\/li>\n<li><strong>Infrastructure<\/strong>&nbsp;<strong>(FB Learner Predictor)<\/strong> needed to compute workloads and store data&nbsp;<\/li>\n<\/ul>\n<\/div>\n<div id=\"block-34becb74-3f3e-41a9-8013-a9d40ab55f22\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">Facebook also uses a principle to split development and deployment (production) environments.<\/p>\n<\/div>\n<div id=\"block-d84c3291-30b7-4011-b4f1-4952a95194b8\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h2 class=\"rich-text block-editor-rich-text__editable\"><span style=\"font-size: 18pt;\"><strong>Pinterest Feature Store<\/strong><\/span><\/h2>\n<\/div>\n<div id=\"block-a524d3c2-7f44-40bf-8821-0482b9c99afa\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<div>\n<div class=\"components-resizable-box__container\"><img decoding=\"async\" src=\"https:\/\/techairesearch.com\/wp-content\/uploads\/2020\/08\/Screenshot-2020-08-04-at-12.17.53-AM-1024x713.png\" alt=\"This image has an empty alt attribute; its file name is Screenshot-2020-08-04-at-12.17.53-AM-1024x713.png\"><\/div>\n<div class=\"__resizable_base__\"><\/div>\n<\/div>\n<p><a href=\"https:\/\/www.slideshare.net\/Alluxio\/pinterest-big-data-machine-learning-platform-at-pinterest\">Source<\/a><\/p>\n<div class=\"components-drop-zone\"><\/div>\n<\/div>\n<div id=\"block-dde75bd4-d384-45eb-8c42-4eb110d563e5\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">Pinterest&#8217;s &#8211; Big Data Machine Learning is a classic example of high speed and quality which is <strong>scalable<\/strong>, <strong>reliable<\/strong>, and <strong>secure<\/strong>. This <strong>Metadata-driven framework<\/strong> is built using open-source technology with individual building blocks that help in reusability. It also provides <strong>governance: enforcement &amp; tracking<\/strong>.<\/p>\n<\/div>\n<div id=\"block-53ace897-fe2b-445c-bba9-a084bb3e6b90\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">The uniqueness of this architecture lies in <strong>capturing relationships and interactions (clicks made by users)<\/strong> between <strong>pins (how objects<\/strong> are <strong>organized<\/strong> into <strong>collections)<\/strong>.<\/p>\n<\/div>\n<div id=\"block-43b976c9-011e-4c2d-a8cf-0e66a80ba4d6\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">The below figure illustrates the different components in model governance and development architecture<\/p>\n<\/div>\n<div id=\"block-1811c543-5fc4-4a49-882f-ec69963d5871\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<div>\n<div class=\"components-resizable-box__container\"><img decoding=\"async\" src=\"https:\/\/techairesearch.com\/wp-content\/uploads\/2020\/08\/Screenshot-2020-08-07-at-1.10.41-AM-1024x455.png\" alt=\"This image has an empty alt attribute; its file name is Screenshot-2020-08-07-at-1.10.41-AM-1024x455.png\"><\/div>\n<div class=\"__resizable_base__\"><\/div>\n<\/div>\n<p><a href=\"https:\/\/www.slideshare.net\/Alluxio\/pinterest-big-data-machine-learning-platform-at-pinterest\">Source<\/a><\/p>\n<div class=\"components-drop-zone\"><\/div>\n<\/div>\n<div id=\"block-2edda24f-f689-41b9-a7c4-45885b39873f\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h2 class=\"rich-text block-editor-rich-text__editable\"><span style=\"font-size: 18pt;\"><strong>Zipline from Airbnb<\/strong><\/span><\/h2>\n<\/div>\n<div id=\"block-bbd1259f-b0f0-4c77-b919-e80583866280\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<div>\n<div class=\"components-resizable-box__container\"><img decoding=\"async\" src=\"https:\/\/techairesearch.com\/wp-content\/uploads\/2020\/08\/image-5-1024x534.png\" alt=\"This image has an empty alt attribute; its file name is image-5-1024x534.png\"><\/div>\n<div class=\"__resizable_base__\"><\/div>\n<\/div>\n<p><a href=\"https:\/\/databricks.com\/session_eu19\/zipline-airbnbs-declarative-feature-engineering-framework\">Source<\/a><\/p>\n<div class=\"components-drop-zone\"><\/div>\n<\/div>\n<div id=\"block-71938147-bd5b-493e-8c0e-e0cf287503ba\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">The predictive system ZipLine created by Airbnb relies on a <strong>scoring service<\/strong> based on <strong>features<\/strong> gathered in due t<strong>ime<\/strong> and <strong>space<\/strong>. The scoring log (acts as debug\/audit log) is computed\/updated daily to ensure feature consistency and single feature definition both during training ML model and deploying them at production. In addition, it ensures <strong>Data Quality monitoring, feature back-filling<\/strong>, and making <strong>features searchable<\/strong> and <strong>sharable.<\/strong><\/p>\n<\/div>\n<div id=\"block-5868c5d1-b26b-4f99-b8bd-603337a86a0b\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">The architecture integrated with data sources &#8212; Hive Table, databases and Jitney&#8217;s Event Bus apart from Apache Spark (batch) and Flink (streaming) with Lambda as serving point.The uniqueness of this platform lies in :<\/p>\n<\/div>\n<div id=\"block-56b5bff2-f681-44b2-b763-8c733ccd059d\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<ul class=\"rich-text block-editor-rich-text__editable\">\n<li>Reduction of custom pipeline creations<\/li>\n<li>Reducing data leaks in custom aggregations<\/li>\n<li>Feature distribution observability<\/li>\n<li>Improved model iteration workflow<\/li>\n<\/ul>\n<\/div>\n<div id=\"block-bf08b10e-a9d8-4a01-8d8f-f17da878783a\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h2 class=\"rich-text block-editor-rich-text__editable\"><\/h2>\n<h2 class=\"rich-text block-editor-rich-text__editable\"><\/h2>\n<h2 class=\"rich-text block-editor-rich-text__editable\"><span style=\"font-size: 18pt;\"><strong>TFX<\/strong><\/span><\/h2>\n<\/div>\n<div id=\"block-3258f290-69f6-4fc3-b54f-84675cb0d5f6\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<div>\n<div class=\"components-resizable-box__container\"><img decoding=\"async\" src=\"https:\/\/techairesearch.com\/wp-content\/uploads\/2020\/07\/Screenshot-2020-07-27-at-4.47.51-PM-1024x331.png\" alt=\"This image has an empty alt attribute; its file name is Screenshot-2020-07-27-at-4.47.51-PM-1024x331.png\"><\/div>\n<div class=\"__resizable_base__\"><\/div>\n<\/div>\n<p><a href=\"http:\/\/stevenwhang.com\/tfx_paper.pdf\">Source<\/a><\/p>\n<div class=\"components-drop-zone\"><\/div>\n<\/div>\n<div id=\"block-b8022a86-b2ff-4283-91ef-ab5e50aa4f2a\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\"><strong>TensorFlow Extended (TFX)<\/strong>, a TensorFlow based general-purpose machine learning platform provides <strong>orchestration<\/strong> of many components&mdash;a learner for generating models based on training data, modules for analyzing and validating both data as well as models, and finally infrastructure for serving models in production. The platform is particularly known for training, validation, visualization, and deployment of fresh newly trained models in production continuously relatively quickly. The individual components can share utilities that allow them to communicate and share assets. Due to fast training data and <strong>deserialization<\/strong> teams and community can share their data, models, tools, visualizations, optimizations, and other techniques<\/p>\n<\/div>\n<div id=\"block-b5fed034-dcce-4edd-b7d2-563161b90db5\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">The components are further known for gathering statistics over feature values: for <strong>continuous features,<\/strong> the<strong>statistics include quantiles, equi-width histograms<\/strong>, the mean and standard deviation, whereas for discrete features they include the <strong>top-K values<\/strong> by <strong>frequency<\/strong>. In addition, the components support the computation of model metrics on slices of data e.g., on negative and positive examples in a binary classification problem) and <strong>cross-feature statistics<\/strong> like <strong>correlation<\/strong> and <strong>covariance<\/strong> between features. These statistics give insights to users on the shape of each dataset.<\/p>\n<\/div>\n<div id=\"block-38a1e95e-be77-456d-82e1-2b2a4b5d0c10\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">Further, the architecture also provides configuration free validation-setup enabled for all users, <strong>multi-tenancy<\/strong> to serve <strong>multiple machine-learned models concurrently<\/strong>, <strong>soft model-isolation<\/strong> to <strong>increase<\/strong> model performance.<\/p>\n<\/div>\n<div id=\"block-c5979880-d025-4333-8dbf-e2c1511f4a6f\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h2 class=\"rich-text block-editor-rich-text__editable\"><\/h2>\n<h2 class=\"rich-text block-editor-rich-text__editable\"><span style=\"font-size: 18pt;\"><strong>Apache Airflow<\/strong><\/span><\/h2>\n<\/div>\n<div id=\"block-9b3feefb-158d-4690-b45b-cc48b5ec1e70\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<div>\n<div class=\"components-resizable-box__container\"><img decoding=\"async\" src=\"https:\/\/techairesearch.com\/wp-content\/uploads\/2020\/07\/Screenshot-2020-07-27-at-4.51.58-PM-1-1024x496.png\" alt=\"This image has an empty alt attribute; its file name is Screenshot-2020-07-27-at-4.51.58-PM-1-1024x496.png\"><\/div>\n<div class=\"__resizable_base__\"><\/div>\n<\/div>\n<p>Apache Airflow : <a href=\"https:\/\/mlsys.org\/Conferences\/2019\/doc\/2019\/demo_7.pdf\">Source<\/a><\/p>\n<div class=\"components-drop-zone\"><\/div>\n<\/div>\n<div id=\"block-83030234-fa6e-4c5d-a10f-9285e08206af\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\"><strong>Apache Airflow&#8217;s<\/strong> entire architecture is based on the concept of <strong>DAG<\/strong> (Directed Acyclic Graph), which takes into account the dependencies within them. Its principal responsibility to ensure all things happen at the right time and in the right order. The <strong>DAG<\/strong>s define a single logical workflow and they are defined in python files.<\/p>\n<\/div>\n<div id=\"block-366e4858-2aba-4073-828b-e387fc0d82c5\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">Further, it supports <strong>Airflow Operators<\/strong> which states what steps are executed over time (e.g. download or transfer operators- GoogleCloudStorageDownloadOperator ). One such Operator is the GoogleCloudStorageObjectSensor which pauses execution until aa key appears in S3.<\/p>\n<\/div>\n<div id=\"block-13c4066e-ec44-4fd8-9dfe-d23cb4668089\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">Apache Airflow guarantees <strong>Idempotence<\/strong> (ensuring subsequent execution of any step produces the same end-result, irrespective of the number of times.), <strong>Atomicity<\/strong>, and <strong>Metadata Exchange<\/strong>. Data exchange between different components of this distributed architecture is facilitated using <strong>XCOM (cross-communication)<\/strong> that provided an exchange of small metadata. However, for large volumes of data, it supports shared network storage, data lake (S3) or URI based exchange through <strong>XCOM<\/strong>.<\/p>\n<\/div>\n<div id=\"block-d0bf26a3-b401-4e92-a09c-32b2c85172c9\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">Parameterized representations of operators help DAG to run tasks that spawn a TaskInstance at a particular instant of time. Further, the instances within Apache AirFlow DAG are grouped into a <strong>DagRun<\/strong>.<\/p>\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">\n<\/div>\n<div id=\"block-2439be28-5892-45ca-b53f-c4019e18281a\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h2 class=\"rich-text block-editor-rich-text__editable\"><span style=\"font-size: 18pt;\"><strong>Zomato Feature Store<\/strong><\/span><\/h2>\n<\/div>\n<div id=\"block-bce64e91-3633-4872-b021-ed7ffabf63b6\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<div>\n<div class=\"components-resizable-box__container\"><img decoding=\"async\" src=\"https:\/\/techairesearch.com\/wp-content\/uploads\/2020\/08\/image-8-1024x677.png\" alt=\"This image has an empty alt attribute; its file name is image-8-1024x677.png\"><\/div>\n<div class=\"__resizable_base__\"><\/div>\n<\/div>\n<p><a href=\"http:\/\/featurestore.org\/\">Source<\/a><\/p>\n<div class=\"components-drop-zone\"><\/div>\n<\/div>\n<div id=\"block-2cf3918d-f282-4c6a-808a-addd9e97c397\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">Zomato&#8217;s restaurant business heavily relies on stream data processing to compute running orders at the restaurant at any given point. The architecture use <strong>Apache Flink<\/strong> that provides job level isolation for each ML model as features from each ML model maintain their separate space for <strong>research, analysis, logging<\/strong> and do not interact with features from other ML models.<\/p>\n<\/div>\n<div id=\"block-b4912416-94c4-470c-8920-8a8552d7ae9f\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">In addition to streaming and online feature extraction, the life-cycle management of ML models is provided by <strong>MLFlow<\/strong>. The ML models are served to the external world via <strong>API Gateway<\/strong> by means of <strong>AWS Sagemaker endpoints<\/strong>.<\/p>\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">\n<\/div>\n<div id=\"block-dc316ea6-dd94-4052-be5a-b0e48ea26ba0\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h2 class=\"rich-text block-editor-rich-text__editable\"><span style=\"font-size: 18pt;\"><strong>Overton from Apple<\/strong><\/span><\/h2>\n<\/div>\n<div id=\"block-625c74e3-55e6-4804-b616-22e8d58e9dae\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<div>\n<div class=\"components-resizable-box__container\"><img decoding=\"async\" src=\"https:\/\/techairesearch.com\/wp-content\/uploads\/2020\/07\/Screenshot-2020-07-30-at-11.28.13-PM-1-1024x476.png\" alt=\"This image has an empty alt attribute; its file name is Screenshot-2020-07-30-at-11.28.13-PM-1-1024x476.png\"><\/div>\n<div class=\"__resizable_base__\"><\/div>\n<\/div>\n<p><a href=\"https:\/\/arxiv.org\/pdf\/1909.05372.pdf\">Source<\/a><\/p>\n<div class=\"components-drop-zone\"><\/div>\n<\/div>\n<div id=\"block-f06da9da-38d7-4732-a2bf-5aa551c9ce7d\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\"><strong>Overton<\/strong> automates the life cycle of <strong>model construction, deployment, and monitoring<\/strong> by providing a set of novel high-level, declarative abstractions. It supports <strong>multi-task learning<\/strong> to <strong>concurrently<\/strong> predict several ML models in both <strong>real-time<\/strong> and <strong>backend<\/strong> production applications.<\/p>\n<\/div>\n<div id=\"block-fc7dc979-be29-4bfd-8738-9a7badb1760a\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">Further, the architecture allows separation between model and data with two components the tasks, which capture the tasks the model needs to accomplish, and payloads that represent sources of data, such as tokens or entity embeddings.<\/p>\n<\/div>\n<div id=\"block-96108612-eed4-4daf-9ff4-6b06064ada4c\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">The model training is governed by a <strong>schema<\/strong> file, which acts as a guide to compile a TensorFlow model and to describe its output for <strong>downstream<\/strong> use. Overton also embeds raw data into a payload, which is then used as input to a task or to another payload. The payloads are either <strong>singletons (e.g., a query)<\/strong>, <strong>sequences (e.g. a query tokenized into words or characters),<\/strong> and <strong>sets (e.g., a set of candidate entities)<\/strong>.<\/p>\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">\n<\/div>\n<div id=\"block-bf1ecedf-a234-4053-b59a-9e62b22ba379\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h2 class=\"rich-text block-editor-rich-text__editable\"><span style=\"font-size: 18pt;\"><strong>StreamSQL Feature Store<\/strong><\/span><\/h2>\n<\/div>\n<div id=\"block-21f9c165-1625-4666-8af0-ab52c5b72afe\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<div>\n<div class=\"components-resizable-box__container\"><img decoding=\"async\" src=\"https:\/\/techairesearch.com\/wp-content\/uploads\/2020\/08\/image-10-1024x552.png\" alt=\"This image has an empty alt attribute; its file name is image-10-1024x552.png\"><\/div>\n<div class=\"__resizable_base__\"><\/div>\n<\/div>\n<p><a href=\"http:\/\/featurestore.org\/\">Source<\/a><\/p>\n<div class=\"components-drop-zone\"><\/div>\n<\/div>\n<div id=\"block-8d52e824-225d-4ec9-a250-89b2e68233fa\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\"><strong>StreamSQL<\/strong> Feature store is alow latency based model development framework with <strong>high throughput serving<\/strong>. It allows new model features to be deployed confidently with <strong>versioning<\/strong> with much with ease. With the use of feature definitions, consistent feature deployment is ensured across training, in serving and across production.<\/p>\n<\/div>\n<div id=\"block-1f81af52-0c29-49cf-98f3-331f8ea959f8\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">The architecture is also known for its ability to increase model performance by integrating features from 3rd party. It combines batch and stream processing with an <strong>immutable ledger<\/strong>, where each event is appended to the end of the ledger. Further, the framework at any point allows the addition of new data sources\/transformations (from <strong>Flink and Spark. Files, tables, and stream<\/strong>), modify or create a new set of features and even <strong>analyze\/discover<\/strong> features from feature registry.<\/p>\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">\n<\/div>\n<div id=\"block-28d952d8-b968-436b-950a-4393b80c7a4b\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h2 class=\"rich-text block-editor-rich-text__editable\"><span style=\"font-size: 18pt;\"><strong>Hybrid Feature Store<\/strong><\/span><\/h2>\n<\/div>\n<div id=\"block-bdf7905c-4e4d-433a-84bc-777bcc881e26\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<div>\n<div class=\"components-resizable-box__container\"><img decoding=\"async\" src=\"https:\/\/techairesearch.com\/wp-content\/uploads\/2020\/08\/Screenshot-2020-08-10-at-2.05.43-PM-1024x476.png\" alt=\"This image has an empty alt attribute; its file name is Screenshot-2020-08-10-at-2.05.43-PM-1024x476.png\"><\/div>\n<div class=\"__resizable_base__\"><\/div>\n<\/div>\n<p><a href=\"http:\/\/featurestore.org\/\">Source<\/a><\/p>\n<div class=\"components-drop-zone\"><\/div>\n<\/div>\n<div id=\"block-1c933cda-4099-4aa4-b6e8-4146f05fe1b8\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">The above figure illustrates a Hybrid Feature Store with Data Pipeline, BI Platforms (Tableau) using Apache Airflow, S3, Hopsworks Feature Store, and Data Lakes from Cloudera. The platform is capable of ingesting raw data, event or SQL data at the input.<\/p>\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">\n<\/div>\n<div id=\"block-81af669c-1c2b-46c3-8aed-04b664348b5d\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h2 class=\"rich-text block-editor-rich-text__editable\"><span style=\"font-size: 18pt;\"><strong>Feature Store from Tecton<\/strong><\/span><\/h2>\n<\/div>\n<div id=\"block-c7d503f2-6b9b-4542-87ad-d0ae7022498e\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<div>\n<div class=\"components-resizable-box__container\"><img decoding=\"async\" src=\"https:\/\/i2.wp.com\/techairesearch.com\/wp-content\/uploads\/2020\/08\/image-12.png?fit=525,317&amp;ssl=1\" alt=\"This image has an empty alt attribute; its file name is image-12.png\"><\/div>\n<div class=\"__resizable_base__\"><\/div>\n<\/div>\n<p><a href=\"http:\/\/featurestore.org\/\">Source<\/a><\/p>\n<div class=\"components-drop-zone\"><\/div>\n<\/div>\n<div id=\"block-18c87a70-93c2-4ed9-bfa4-e858b0b6b761\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">Tecton has come up with a unified architecture to <strong>develop, deploy, curate\/govern<\/strong> and <strong>monitor<\/strong> a platform built to standardize <strong>high-quality features, labels, and data sets<\/strong> for ML models in production, ensuring the safe operation of models over time, with proper <strong>reproducibility, lineage, and logging<\/strong>.<\/p>\n<\/div>\n<div id=\"block-b9a42ae4-0048-4a12-9fda-affcb6735a8d\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">The Tecton platform consists of:<\/p>\n<\/div>\n<div id=\"block-65b64fae-1422-4c64-bc40-94056dfbc3d0\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<ul class=\"rich-text block-editor-rich-text__editable\">\n<li><strong>Feature Pipelines<\/strong>&nbsp;for transforming your raw data into features or labels<\/li>\n<li>A&nbsp;<strong>Feature Store<\/strong>&nbsp;for storing historical feature and label data<\/li>\n<li>A&nbsp;<strong>Feature Server<\/strong>&nbsp;for serving the latest feature values in production<\/li>\n<li>An&nbsp;<strong>SDK<\/strong>&nbsp;for retrieving training data and manipulating feature pipelines<\/li>\n<li>A&nbsp;<strong>Web UI<\/strong>&nbsp;for managing and tracking features, labels, and data sets<\/li>\n<li>A&nbsp;<strong>Monitoring Engine<\/strong>&nbsp;for detecting data quality or drift issues and alerting<\/li>\n<\/ul>\n<\/div>\n<div id=\"block-e49966ae-29a3-49e0-a0f5-13830b308fb1\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h2 class=\"rich-text block-editor-rich-text__editable\"><span style=\"font-size: 18pt;\"><strong>Feature Store from Scribble data<\/strong><\/span><\/h2>\n<\/div>\n<div id=\"block-b5a4b87a-ff2f-4e9e-968f-7d8ff382a7c4\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<div>\n<div class=\"components-resizable-box__container\"><img decoding=\"async\" src=\"https:\/\/techairesearch.com\/wp-content\/uploads\/2020\/08\/image-11-1024x581.png\" alt=\"This image has an empty alt attribute; its file name is image-11-1024x581.png\"><\/div>\n<div class=\"__resizable_base__\"><\/div>\n<\/div>\n<p><a href=\"http:\/\/featurestore.org\/\">Source<\/a><\/p>\n<div class=\"components-drop-zone\"><\/div>\n<\/div>\n<div id=\"block-3e378a44-1b7d-4402-a144-7d4ec227c6a2\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">The Feature Store provided by Scribble Data puts lots of stress on Input <strong>Data Correctness<\/strong> and <strong>Completeness<\/strong> (gaps, duplicates, exceptions, invalid values), as it is known to play an impact on ML models&#8217; prediction. Hence it recommends a <strong>continuous check\/early morning system<\/strong> to prevent poor quality data from coming into the system. On the reactive side, the system undertakes a continuous process to improve ML operations over time.<\/p>\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">\n<\/div>\n<div id=\"block-21157fa2-c5cd-4b79-9a6a-3b7a3a31e27b\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h2 class=\"rich-text block-editor-rich-text__editable\"><span style=\"font-size: 18pt;\"><strong>Conclusion<\/strong><\/span><\/h2>\n<\/div>\n<div id=\"block-a4a3bc61-3323-4834-9c2a-1687a2d7c4e0\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">Here we have discussed different architectural frameworks using Big Data (some of them are Open Source tools), ML model training, and serving tools, along with orchestration layer (such as Kubernetes). Each of the component is equally important and they go hand in hand to create a real-time end to end predictive system.<\/p>\n<p class=\"rich-text block-editor-rich-text__editable wp-block-paragraph\">\n<\/div>\n<div id=\"block-db015d97-4e10-4461-a768-699591e421c2\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<h2 class=\"rich-text block-editor-rich-text__editable\"><span style=\"font-size: 18pt;\"><strong>References<\/strong><\/span><\/h2>\n<\/div>\n<div id=\"block-0af2200f-9f21-4dae-b645-33b78f050ff5\" class=\"wp-block block-editor-block-list__block has-selected-ui\">\n<ol class=\"rich-text block-editor-rich-text__editable\">\n<li>FBLearner &#8211; <a href=\"https:\/\/www.matroid.com\/scaledml\/2018\/yangqing.pdf\">https:\/\/www.matroid.com\/scaledml\/2018\/yangqing.pdf<\/a><\/li>\n<li>FBlearner <a href=\"https:\/\/medium.com\/@jamal.robinson\/how-facebook-scales-artificial-intelligence-machine-learning-693706ae296f\">https:\/\/medium.com\/@jamal.robinson\/how-facebook-scales-artificial-intelligence-machine-learning-693706ae296f<\/a><\/li>\n<li>MetaFlow by Netflix <a href=\"https:\/\/netflixtechblog.com\/open-sourcing-metaflow-a-human-centric-framework-for-data-science-fa72e04a5d9\">https:\/\/netflixtechblog.com\/open-sourcing-metaflow-a-human-centric-framework-for-data-science-fa72e04a5d9<\/a><\/li>\n<li>Tensorflow Extended <a href=\"http:\/\/stevenwhang.com\/tfx_paper.pdf\">http:\/\/stevenwhang.com\/tfx_paper.pdf<\/a><\/li>\n<li>Apache Airflow: <a href=\"https:\/\/mlsys.org\/Conferences\/2019\/doc\/2019\/demo_7.pdf\">https:\/\/mlsys.org\/Conferences\/2019\/doc\/2019\/demo_7.pdf<\/a><\/li>\n<li>Survey Monkey:<a href=\"http:\/\/snurran.sics.se\/surveymonkey.pdf\">http:\/\/snurran.sics.se\/surveymonkey.pdf<\/a><\/li>\n<li>Overton: A Data System for Monitoring and Improving Machine-Learned Products:<a href=\"https:\/\/arxiv.org\/pdf\/1909.05372.pdf\">https:\/\/arxiv.org\/pdf\/1909.05372.pdf<\/a><\/li>\n<li><a href=\"https:\/\/www.slideshare.net\/Alluxio\/pinterest-big-data-machine-learning-platform-at-pinterest\">https:\/\/www.slideshare.net\/Alluxio\/pinterest-big-data-machine-learning-platform-at-pinterest<\/a><\/li>\n<li><a href=\"https:\/\/www.bigabid.com\/blog\/data-the-importance-of-having-a-feature-store\">https:\/\/www.bigabid.com\/blog\/data-the-importance-of-having-a-feature-store<\/a><\/li>\n<li><a href=\"https:\/\/towardsdatascience.com\/mlops-with-a-feature-store-816cfa5966e9\">https:\/\/towardsdatascience.com\/mlops-with-a-feature-store-816cfa5966e9<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/EthicalML\/awesome-production-machine-learning#feature-stores\">https:\/\/github.com\/EthicalML\/awesome-production-machine-learning#feature-stores<\/a><\/li>\n<li><a href=\"http:\/\/featurestore.org\/\">http:\/\/featurestore.org\/<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/logicalclocks\/hopsworks\">https:\/\/github.com\/logicalclocks\/hopsworks<\/a><\/li>\n<li><a href=\"https:\/\/gist.github.com\/mserranom\/10aaac360617d58e00f1c380db22592e\">https:\/\/gist.github.com\/mserranom\/10aaac360617d58e00f1c380db22592e<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/quantopian\/zipline\">https:\/\/github.com\/quantopian\/zipline<\/a><\/li>\n<li><a href=\"https:\/\/mlsys.org\/Conferences\/2019\/doc\/2019\/demo_7.pdf\">https:\/\/mlsys.org\/Conferences\/2019\/doc\/2019\/demo_7.pdf<\/a><\/li>\n<li><a rel=\"noreferrer noopener\" href=\"https:\/\/uploads-ssl.webflow.com\/5c9b9758feba5a6f9e8a6dda\/5d92b35b15962a46c7ce9c5f_feature%20store%20whitepaper%201-0.pdf\" target=\"_blank\">The Hopsworks Feature Store<\/a><\/li>\n<li><a rel=\"noreferrer noopener\" href=\"https:\/\/content.logicalclocks.com\/hubfs\/research\/sysml_2019_demo_paper.pdf\" target=\"_blank\">Ormenisan et al, Horizontally scalable ML pipelines with a Feature Store<\/a><\/li>\n<li><a rel=\"noreferrer noopener\" href=\"https:\/\/storage.googleapis.com\/pub-tools-public-publication-data\/pdf\/45742.pdf\" target=\"_blank\">Sculley et al, What&rsquo;s your ML Test Score? A rubric for ML production systems<\/a><\/li>\n<li><a rel=\"noreferrer noopener\" href=\"http:\/\/stevenwhang.com\/tfx_paper.pdf\" target=\"_blank\">Baylor et al, TFX: A TensorFlow-Based Production-Scale Machine Learning Platform<\/a><\/li>\n<li><a rel=\"noreferrer noopener\" href=\"https:\/\/databricks.com\/blog\/2019\/09\/18\/productionizing-machine-learning-from-deployment-to-drift-detection.html\" target=\"_blank\">Mewald et al, Drift detection for production machine learning<\/a><\/li>\n<li><a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/cdfoundation\/sig-mlops\" target=\"_blank\">CDF Special Interest Group &mdash; MLOps<\/a><\/li>\n<li><a rel=\"noreferrer noopener\" href=\"https:\/\/martinfowler.com\/articles\/cd4ml.html\" target=\"_blank\">Continuous Delivery for Machine Learning<\/a><\/li>\n<li><a rel=\"noreferrer noopener\" href=\"https:\/\/www.gitops.tech\/\" target=\"_blank\">GitOps&zwj;<\/a><\/li>\n<li>Metaflow -Netflix <a href=\"https:\/\/github.com\/Netflix\/metaflow\/tree\/master\/test\">https:\/\/github.com\/Netflix\/metaflow\/tree\/master\/test<\/a><\/li>\n<li>HopWorks <a href=\"https:\/\/www.slideshare.net\/dowlingjim\/the-feature-store-in-hopsworks\">https:\/\/www.slideshare.net\/dowlingjim\/the-feature-store-in-hopsworks<\/a><\/li>\n<li><a href=\"https:\/\/www.tecton.ai\/blog\/data-platform-ml\/\">https:\/\/www.tecton.ai\/blog\/data-platform-ml\/<\/a><\/li>\n<\/ol>\n<\/div>\n<div class=\"block-list-appender\">\n<div class=\"wp-block block-editor-default-block-appender\">\n<div class=\"components-dropdown block-editor-inserter\"><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:976437\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Sharmistha Chatterjee Source Comprehensive and Comparative List of Feature Store Architectures for Data Scientists and Big Data Professionals Introduction &amp; Motivation &#8211; Why Feature [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/08\/13\/architectures-every-data-scientist-and-big-data-engineer-should-know\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":473,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3761"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=3761"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3761\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/461"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=3761"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=3761"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=3761"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}