{"id":5438,"date":"2022-02-22T15:44:45","date_gmt":"2022-02-22T15:44:45","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2022\/02\/22\/easier-experimenting-in-python\/"},"modified":"2022-02-22T15:44:45","modified_gmt":"2022-02-22T15:44:45","slug":"easier-experimenting-in-python","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2022\/02\/22\/easier-experimenting-in-python\/","title":{"rendered":"Easier experimenting in Python"},"content":{"rendered":"<p>Author: Adrian Tam<\/p>\n<div>\n<p>When we work on a machine learning project, quite often we need to experiment with multiple alternatives. Some features in Python allows us to try out different options without much effort. In this tutorial, we are going to see some tips to make our experiments faster.<\/p>\n<p>After finishing this tutorial, you will learn<\/p>\n<ul>\n<li>How to leverage on duck-typing feature to easily swapping functions and objects<\/li>\n<li>How making components drop-in replacement with each other can help experiments faster<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_13220\" style=\"width: 810px\" class=\"wp-caption aligncenter\">\n<img decoding=\"async\" aria-describedby=\"caption-attachment-13220\" class=\"size-full wp-image-13220\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/02\/jake-givens-iR8m2RRo-z4-unsplash-scaled.jpg\" alt=\"\" width=\"800\"><\/p>\n<p id=\"caption-attachment-13220\" class=\"wp-caption-text\">Easier experimenting in Python. Photo by <a href=\"https:\/\/unsplash.com\/photos\/iR8m2RRo-z4\">Jake Givens<\/a>. Some rights reserved<\/p>\n<\/div>\n<h2 id=\"Overview\">Overview<\/h2>\n<p>This tutorial is in three parts, they are<\/p>\n<ul>\n<li>Workflow of a machine learning project<\/li>\n<li>Functions as objects<\/li>\n<li>Caveats<\/li>\n<\/ul>\n<h2 id=\"Workflow-of-a-machine-learning-project\">Workflow of a machine learning project<\/h2>\n<p>Consider a very simple machine learning project, as follows:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">from pandas import read_csv\r\nfrom sklearn.model_selection import train_test_split\r\nfrom sklearn.svm import SVC\r\n\r\n# Load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/iris.csv\"\r\nnames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']\r\ndataset = read_csv(url, names=names)\r\n\r\n# Split-out validation dataset\r\narray = dataset.values\r\nX = array[:,0:4]\r\ny = array[:,4]\r\nX_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)\r\n\r\n# Train\r\nclf = SVC()\r\nclf.fit(X_train, y_train)\r\n\r\n# Test\r\nscore = clf.score(X_val, y_val)\r\nprint(\"Validation accuracy\", score)<\/pre>\n<p>This is a typical machine learning project workflow. We have a stage of preprocessing of data, then training a model, and afterwards, evaluate our result. But in each step, we may want to try something different. For example, we may wonder if normalizing the data would make it better. So we may rewrite the code above into the following:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">from pandas import read_csv\r\nfrom sklearn.model_selection import train_test_split\r\nfrom sklearn.svm import SVC\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.preprocessing import StandardScaler\r\n\r\n# Load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/iris.csv\"\r\nnames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']\r\ndataset = read_csv(url, names=names)\r\n\r\n# Split-out validation dataset\r\narray = dataset.values\r\nX = array[:,0:4]\r\ny = array[:,4]\r\nX_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)\r\n\r\n# Train\r\nclf = Pipeline([('scaler',StandardScaler()), ('classifier',SVC())])\r\nclf.fit(X_train, y_train)\r\n\r\n# Test\r\nscore = clf.score(X_val, y_val)\r\nprint(\"Validation accuracy\", score)<\/pre>\n<p>So far so good. But what if we keep experimenting with different dataset, different models, or different score functions? Each time, we keep flipping between using a scaler and not would mean a lot of code change, and quite easy to make mistakes.<\/p>\n<p>Because Python supports duck-typing, we can see that the following two classifier models implemented the same interface:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">clf = SVC()\r\nclf = Pipeline([('scaler',StandardScaler()), ('classifier',SVC())])<\/pre>\n<p>therefore, we can simply select between these two version and keep everything intact. We can say these two models are\u00a0<strong>drop-in replacement<\/strong>\u00a0of each other.<\/p>\n<p>Making use of this property, we can create a toggle variable to control the design choice we make:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">USE_SCALER = True\r\n\r\nif USE_SCALER:\r\n    clf = Pipeline([('scaler',StandardScaler()), ('classifier',SVC())])\r\nelse:\r\n    clf = SVC()<\/pre>\n<p>by toggling the variable\u00a0<code>USE_SCALER<\/code>\u00a0between\u00a0<code>True<\/code>\u00a0and\u00a0<code>False<\/code>, we can select whether a scaler should be applied. A more complex example would be to select among different scaler and the classifier models, such as<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">SCALER = \"standard\"\r\nCLASSIFIER = \"svc\"\r\n\r\nif CLASSIFIER == \"svc\":\r\n    model = SVC()\r\nelif CLASSIFIER == \"cart\":\r\n    model = DecisionTreeClassifier()\r\nelse:\r\n    raise NotImplementedError\r\n\r\nif SCALER == \"standard\":\r\n    clf = Pipeline([('scaler',StandardScaler()), ('classifier',model)])\r\nelif SCALER == \"maxmin\":\r\n    clf = Pipeline([('scaler',MaxMinScaler()), ('classifier',model)])\r\nelif SCALER == None:\r\n    clf = model\r\nelse:\r\n    raise NotImplementedError<\/pre>\n<p>A complete example is as follows:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">from pandas import read_csv\r\nfrom sklearn.model_selection import train_test_split\r\nfrom sklearn.svm import SVC\r\nfrom sklearn.tree import DecisionTreeClassifier\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.preprocessing import StandardScaler, MinMaxScaler\r\n\r\n# toggle between options\r\nSCALER = \"maxmin\"    # \"standard\", \"maxmin\", or None\r\nCLASSIFIER = \"cart\"  # \"svc\" or \"cart\"\r\n\r\n# Load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/iris.csv\"\r\nnames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']\r\ndataset = read_csv(url, names=names)\r\n\r\n# Split-out validation dataset\r\narray = dataset.values\r\nX = array[:,0:4]\r\ny = array[:,4]\r\nX_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)\r\n\r\n# Create model\r\nif CLASSIFIER == \"svc\":\r\n    model = SVC()\r\nelif CLASSIFIER == \"cart\":\r\n    model = DecisionTreeClassifier()\r\nelse:\r\n    raise NotImplementedError\r\n\r\nif SCALER == \"standard\":\r\n    clf = Pipeline([('scaler',StandardScaler()), ('classifier',model)])\r\nelif SCALER == \"maxmin\":\r\n    clf = Pipeline([('scaler',MinMaxScaler()), ('classifier',model)])\r\nelif SCALER == None:\r\n    clf = model\r\nelse:\r\n    raise NotImplementedError\r\n\r\n# Train\r\nclf.fit(X_train, y_train)\r\n\r\n# Test\r\nscore = clf.score(X_val, y_val)\r\nprint(\"Validation accuracy\", score)<\/pre>\n<\/p>\n<h2 id=\"Functions-as-objects\">Functions as objects<\/h2>\n<p>In Python, functions are first-class citizens. You can assign functions to a variable. Indeed, functions are objects in Python, so as classes (the classes themselves, not only incarnations of classes). Therefore, we can use the same technique as above to experiment amongst similar functions.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">import numpy as np\r\n\r\nDIST = \"normal\"\r\n\r\nif DIST == \"normal\":\r\n    rangen = np.random.normal\r\nelif DIST == \"uniform\":\r\n    rangen = np.random.uniform\r\nelse:\r\n    raise NotImplementedError\r\n\r\nrandom_data = rangen(size=(10,5))\r\nprint(random_data)<\/pre>\n<p>The above is similar to calling\u00a0<code>np.random.normal(size=(10,5))<\/code>\u00a0but we hold the function in a variable for the convenience of swaping one function with another. Note that since we call the functions with the same argument, we have to make sure all variations will accept it. In case it is not, we may need some additional lines of code to make a wrapper. For example, in case of generating Student\u2019s t distribution, we need an additional parameter for the degree of freedom:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">import numpy as np\r\n\r\nDIST = \"t\"\r\n\r\nif DIST == \"normal\":\r\n    rangen = np.random.normal\r\nelif DIST == \"uniform\":\r\n    rangen = np.random.uniform\r\nelif DIST == \"t\":\r\n    def t_wrapper(size):\r\n        # Student's t distribution with 3 degree of freedom\r\n        return np.random.standard_t(df=3, size=size)\r\n    rangen = t_wrapper\r\nelse:\r\n    raise NotImplementedError\r\n\r\nrandom_data = rangen(size=(10,5))\r\nprint(random_data)<\/pre>\n<p>This works because in the above,\u00a0<code>np.random.normal<\/code>,\u00a0<code>np.random.uniform<\/code>, and\u00a0<code>t_wrapper<\/code> as we defined are all drop-in replacement of each other.<\/p>\n<h2 id=\"Caveats\">Caveats<\/h2>\n<p>Machine learning differs from other programming projects because there are more uncertainties in the workflow. When you build a web page, or build a game, you have a picture in your mind on what to achieve. But there are some exploratory work in machine learning projects.<\/p>\n<p>You will probably use some source code control system like git or Mercurial to manage your source code development history in other projects. In machine learning projects, however, we are trying out different\u00a0<strong>combinations<\/strong> of many steps. Using git to manage the different variations may not fit, not to say sometimes overkill. Therefore using a toggle variable to control the flow should allow us try out different things faster. This is especially handy when we are working on our projects in Jupyter notebooks.<\/p>\n<p>However, as we put multiple versions of code together, we made the program clumsy and less readable. It is better to do some clean up after we confirmed with what to do. This will help us in maintenance into the future.<\/p>\n<h2 id=\"Further-reading\">Further reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h4 id=\"Books\">Books<\/h4>\n<ul>\n<li>Fluent Python, second edition, by Luciano Ramalho,\u00a0<a href=\"https:\/\/www.amazon.com\/dp\/1492056359\/\" target=\"_blank\" rel=\"noopener\">https:\/\/www.amazon.com\/dp\/1492056359\/<\/a>\n<\/li>\n<\/ul>\n<h2 id=\"Summary\">Summary<\/h2>\n<p>In this tutorial, you\u2019ve see how duck-typing property in Python help us create drop-in replacements. Specifically you learned<\/p>\n<ul>\n<li>Duck-typing can help us switch between alternatives easily in a machine learning workflow<\/li>\n<li>We can make use a toggle variable to experiment among alternatives<\/li>\n<\/ul>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/easier-experimenting-in-python\/\">Easier experimenting in Python<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/easier-experimenting-in-python\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Adrian Tam When we work on a machine learning project, quite often we need to experiment with multiple alternatives. Some features in Python allows [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2022\/02\/22\/easier-experimenting-in-python\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":5439,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/5438"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=5438"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/5438\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/5439"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=5438"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=5438"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=5438"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}