{"id":5461,"date":"2022-03-03T06:29:23","date_gmt":"2022-03-03T06:29:23","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2022\/03\/03\/data-science-notebook-life-hacks-i-learned-from-ploomber\/"},"modified":"2022-03-03T06:29:23","modified_gmt":"2022-03-03T06:29:23","slug":"data-science-notebook-life-hacks-i-learned-from-ploomber","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2022\/03\/03\/data-science-notebook-life-hacks-i-learned-from-ploomber\/","title":{"rendered":"Data Science Notebook Life-Hacks I Learned From Ploomber"},"content":{"rendered":"<p>Author: N\u00f3ra Balogh<\/p>\n<div>\n<div id=\"attachment_13289\" style=\"width: 1034px\" class=\"wp-caption aligncenter\">\n<a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/hannah-wei-aso6SYJZGps-unsplash-scaled.jpg\"><img decoding=\"async\" aria-describedby=\"caption-attachment-13289\" loading=\"lazy\" class=\"wp-image-13289 size-large\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/hannah-wei-aso6SYJZGps-unsplash-1024x683.jpg\" alt=\"\" width=\"1024\" height=\"683\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/hannah-wei-aso6SYJZGps-unsplash-1024x683.jpg 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/hannah-wei-aso6SYJZGps-unsplash-300x200.jpg 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/hannah-wei-aso6SYJZGps-unsplash-768x512.jpg 768w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/hannah-wei-aso6SYJZGps-unsplash-1536x1024.jpg 1536w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/hannah-wei-aso6SYJZGps-unsplash-2048x1365.jpg 2048w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/hannah-wei-aso6SYJZGps-unsplash-600x400.jpg 600w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/a><\/p>\n<p id=\"caption-attachment-13289\" class=\"wp-caption-text\">Photo by <a href=\"https:\/\/unsplash.com\/@herlifeinpixels\" target=\"_blank\" rel=\"noopener\">Hannah Wei<\/a> on <a href=\"https:\/\/unsplash.com\/photos\/aso6SYJZGps\" target=\"_blank\" rel=\"noopener\">Unsplash<\/a>.<\/p>\n<\/div>\n<p style=\"text-align: right;\"><em>Sponsored Post<\/em><\/p>\n<p>Me, a data scientist, and Jupyter notebooks. Well, our relationship started back then when I began to learn Python. Jupyter notebooks were my refuge when I wanted to make sure that my code works. Nowadays, I teach coding and do several data science projects and still, notebooks are the best tools for interactive coding and experimentation. Unfortunately, when trying to use notebooks in data science projects, things can get out of control quickly. As a result of experimentation, monolithic notebooks emerge, which are hard to maintain and modify. And yes, it\u2019s very time-consuming to work twice: experiment and then transform your code to Python scripts. Not to mention, it\u2019s painful to test such code, and version control is also a problem. This is the point when you must think, there has to be a better way! Lucky me, the answer is not in avoiding my beloved Jupyter notebooks.<\/p>\n<p><span style=\"font-weight: 400;\">Follow me and get to know some awesome ideas from <\/span><a href=\"https:\/\/blancas.io\/\"><span style=\"font-weight: 400;\">Eduardo Blancas<\/span><\/a><span style=\"font-weight: 400;\"> and his project, called <\/span><a href=\"https:\/\/ploomber.io\/\"><span style=\"font-weight: 400;\">Ploomber<\/span><\/a><span style=\"font-weight: 400;\"> on how to do better data science projects and how to use and create Jupyter notebooks wisely, even in production.<\/span><\/p>\n<h1><span style=\"font-weight: 400;\">Popular Jupyter notebooks<\/span><\/h1>\n<p><span style=\"font-weight: 400;\">Jupyter is a free and open-source web tool, where one can write code in cells, which then is sent to the back-end \u2018kernel\u2019 and you immediately get the results. One of my colleagues says it\u2019s like an old-school messenger application with code. \u00a0 Jupyter notebook\u2019s popularity exploded in the past few years, thanks to the ability to combine software code, computational output, explanatory text, and multimedia resources in a single document [1]. Among other things, notebooks could be used for scientific computing, data exploration, tutorials, and interactive manuals. What is more, notebooks can speak dozens of languages (it got its name from Julia, Python, and R). <\/span><a href=\"https:\/\/github.com\/parente\/nbestimate\"><span style=\"font-weight: 400;\">One analysis<\/span><\/a><span style=\"font-weight: 400;\"> of the code-sharing site GitHub counted more than 7.5 million public Jupyter notebooks in January 2022.\u00a0 As a data scientist, I mainly use Jupyter notebooks for data wrangling with Python and R, and I also teach students Python basics via Jupyter notebooks.<\/span><\/p>\n<h1><span style=\"font-weight: 400;\">What\u2019s the problem with notebooks?<\/span><\/h1>\n<p><span style=\"font-weight: 400;\">Despite their popularity,\u00a0 many data scientists (including me) face problems with Jupyter notebooks [2]. I could not summarize better, so I quote the words of <\/span><a href=\"https:\/\/www.youtube.com\/watch?v=7jiPeIFXb6U&amp;ab_channel=O%27Reilly\"><span style=\"font-weight: 400;\">Joel Grus<\/span><\/a><span style=\"font-weight: 400;\">, who explained some problems with notebooks [1].<\/span><\/p>\n<p><span style=\"font-weight: 400;\">\u201cI have seen programmers get frustrated when notebooks don\u2019t behave as expected, usually because they inadvertently run code cells out of order. Jupyter notebooks also encourage poor coding practice by making it difficult to organize code logically, break it into reusable modules and develop tests to ensure the code is working properly.\u201d<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Notebooks are hard to debug and test, and I also spent a lot of time in my career refactoring the code into some scripts, functions that can be used in production. There are also problems with version control, as notebooks are JSON files and git outputs an unreadable comparison between versions, making it hard to follow the changes made [2]. <\/span><a href=\"https:\/\/ploomber.io\/blog\/nbs-production\/\"><span style=\"font-weight: 400;\">Here<\/span><\/a><span style=\"font-weight: 400;\"> you can find a more detailed summary and explanation about the problems of Jupyter notebooks.\u00a0<\/span><\/p>\n<h1><span style=\"font-weight: 400;\">The quest for modularization<\/span><\/h1>\n<p><span style=\"font-weight: 400;\">The problems listed above could have been enough to lead me to find P<\/span><a href=\"https:\/\/ploomber.io\/\"><span style=\"font-weight: 400;\">loomber<\/span><\/a><span style=\"font-weight: 400;\">, but I discovered this awesome project through my quest for modularization. What I needed was a tool, to easily create and run tasks or code snippets in the defined order without asking my data engineer colleagues for help. What I needed is called a pipeline. With a pipeline, one can split up tasks for smaller components and automate them. Pipelines can come in many shapes and sizes. One can create pipelines even in sklearn and pandas [3].<\/span><\/p>\n<p><a href=\"https:\/\/ploomber.io\/\"><span style=\"font-weight: 400;\">Ploomber<\/span><\/a><span style=\"font-weight: 400;\"> is an open-source project initiated by Eduardo Blancas to create Python pipelines. I found it an easy-to-use tool, with which I could quickly define my tasks with execution order and break my analysis into modular parts. Ploomber comes with several <\/span><a href=\"https:\/\/github.com\/ploomber\/projects\"><span style=\"font-weight: 400;\">sample projects<\/span><\/a><span style=\"font-weight: 400;\"> where you can find great examples of the tool. I also share my experiments with Ploomber in <\/span><a href=\"https:\/\/github.com\/norabalogh\/ploomber-pipeline-demo\"><span style=\"font-weight: 400;\">this repo<\/span><\/a><span style=\"font-weight: 400;\">. What I especially like about Ploomber is the <\/span><a href=\"https:\/\/ploomber.io\/blog\/\"><span style=\"font-weight: 400;\">blog<\/span><\/a><span style=\"font-weight: 400;\"> and the <\/span><a href=\"https:\/\/ploomber.io\/community\"><span style=\"font-weight: 400;\">community on slack<\/span><\/a><span style=\"font-weight: 400;\">, where I could ask anything about this project.<\/span><\/p>\n<h1><span style=\"font-weight: 400;\">Life-hacks from Eduardo Blancas<\/span><\/h1>\n<p><span style=\"font-weight: 400;\">Okay, I found a great project to modularize my data science projects, but how did it help with my constant struggle with notebooks?\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Well, Ploomber comes with <\/span><a href=\"https:\/\/github.com\/mwouts\/jupytext\"><span style=\"font-weight: 400;\">Jupytext<\/span><\/a><span style=\"font-weight: 400;\">, a package that allows us to save notebooks as py files, but interact with them as notebooks. The version-control problem was solved.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Then comes the refactoring and modularization problem. One does not have to get rid of notebooks because Ploomber can handle notebooks as pipeline units. This way, I just have to clean my notebooks and spare time converting them to a completely different code structure and architecture. It is also possible to mix notebooks and scripts in pipeline tasks. There\u2019s a <\/span><a href=\"https:\/\/ploomber.io\/blog\/refactor-nb-i\/\"><span style=\"font-weight: 400;\">blog post series<\/span><\/a><span style=\"font-weight: 400;\"> about how to break down monolithic notebooks into smaller parts. What I always tell students and also Eduardo suggests, is to write your notebook so, to always be able to restart your kernel and run all of your code from the top to the bottom. Sometimes, it takes a notebook a long time to run with a lot of data, then just set a sample parameter to get a subset to test that your code runs.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Besides modularization life-hacks,\u00a0 another very important takeaway I read on <\/span><a href=\"https:\/\/ploomber.io\/blog\/clean-nbs\/\"><span style=\"font-weight: 400;\">Ploomber\u2019s blog<\/span><\/a><span style=\"font-weight: 400;\"> and apply myself at work is to lock the dependencies of the project and package it to be able to import code from other notebooks.\u00a0 I have encountered package-version problems in a few projects so far, so I can assure you that it can spare you a few hours.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">A project of multiple shorter, cleaner notebooks instead of a few monolithic ones makes it easier to reproduce, understand and modify the code. Besides, it also makes it possible to design a <\/span><a href=\"https:\/\/ploomber.io\/blog\/ml-testing-i\/\"><span style=\"font-weight: 400;\">testing strategy<\/span><\/a><span style=\"font-weight: 400;\"> to test ML codes. Several posts about why machine learning projects fail, mention the difficulty of <\/span><a href=\"https:\/\/www.kdnuggets.com\/2021\/01\/top-5-reasons-why-machine-learning-projects-fail.html\"><span style=\"font-weight: 400;\">updating code<\/span><\/a><span style=\"font-weight: 400;\"> and the time-consuming maintenance problems. With shorter, cleaner code, locked dependencies, and appropriate version control, maintenance and collaboration become easier and faster.<\/span><\/p>\n<h1><span style=\"font-weight: 400;\">Summary<\/span><\/h1>\n<p><span style=\"font-weight: 400;\">The ideas above are just some main thoughts I found useful on Ploomber\u2019s blog. Since then, I have had a toolbox on how to split up notebooks into modular parts and how to use and convert them into a pipeline in smaller projects. I like to share and teach ideas on how to do better notebooks and code, and these coding practices are worth considering.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">If you\u2019re interested in further details of Ploomber and how to work more efficiently with notebooks, make sure to check outEduardo Blancas talk about his project at the <\/span><a href=\"https:\/\/reinforceconf.com\/\"><span style=\"font-weight: 400;\">Reinforce AI Conference<\/span><\/a><span style=\"font-weight: 400;\"> this March! Who could tell us more than the CEO and Co-founder of Ploomber himself?<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">References<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">[1]<\/span> <a href=\"https:\/\/www.nature.com\/articles\/d41586-018-07196-1#author-0\"><span style=\"font-weight: 400;\">Jeffrey M. Perkel<\/span><\/a><span style=\"font-weight: 400;\"> (2018). <\/span><a href=\"https:\/\/www.nature.com\/articles\/d41586-018-07196-1#author-0\"><span style=\"font-weight: 400;\">Why Jupyter is data scientists\u2019 computational notebook of choice<\/span><\/a><span style=\"font-weight: 400;\">. Nature 563, 145-146.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">[2] Eduardo Blancas (2021). <\/span><a href=\"https:\/\/ploomber.io\/blog\/nbs-production\/\"><span style=\"font-weight: 400;\">Why (and how) to put notebooks in production<\/span><\/a><span style=\"font-weight: 400;\">. Ploomber.io blog.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">[3] Anouk Dutr\u00e9e (2021). <\/span><a href=\"https:\/\/towardsdatascience.com\/data-pipelines-what-why-and-which-ones-1f674ba49946\"><span style=\"font-weight: 400;\">Data pipelines: What, why and which ones<\/span><\/a><span style=\"font-weight: 400;\">. Towards Data Science blog.<\/span><\/p>\n<p>\u00a0<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/data-science-notebook-life-hacks-i-learned-from-ploomber\/\">Data Science Notebook Life-Hacks I Learned From Ploomber<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/data-science-notebook-life-hacks-i-learned-from-ploomber\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: N\u00f3ra Balogh Photo by Hannah Wei on Unsplash. Sponsored Post Me, a data scientist, and Jupyter notebooks. Well, our relationship started back then when [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2022\/03\/03\/data-science-notebook-life-hacks-i-learned-from-ploomber\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":5462,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/5461"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=5461"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/5461\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/5462"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=5461"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=5461"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=5461"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}