{"id":2936,"date":"2019-12-17T18:00:06","date_gmt":"2019-12-17T18:00:06","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/17\/results-for-standard-classification-and-regression-machine-learning-datasets\/"},"modified":"2019-12-17T18:00:06","modified_gmt":"2019-12-17T18:00:06","slug":"results-for-standard-classification-and-regression-machine-learning-datasets","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/17\/results-for-standard-classification-and-regression-machine-learning-datasets\/","title":{"rendered":"Results for Standard Classification and Regression Machine Learning Datasets"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>It is important that beginner machine learning practitioners practice on small real-world datasets.<\/p>\n<p>So-called standard machine learning datasets contain actual observations, fit into memory, and are well studied and well understood. As such, they can be used by beginner practitioners to quickly test, explore, and practice data preparation and modeling techniques.<\/p>\n<p>A practitioner can confirm whether they have the data skills required to achieve a good result on a standard machine learning dataset. A good result is a result that is above the 80th or 90th percentile result of what may be technically possible for a given dataset.<\/p>\n<p>The skills developed by practitioners on standard machine learning datasets can provide the foundation for tackling larger, more challenging projects.<\/p>\n<p>In this post, you will discover standard machine learning datasets for classification and regression and the baseline and good results that one may expect to achieve on each.<\/p>\n<p>After reading this post, you will know:<\/p>\n<ul>\n<li>The importance of standard machine learning datasets.<\/li>\n<li>How to systematically evaluate a model on a standard machine learning dataset.<\/li>\n<li>Standard datasets for classification and regression and the baseline and good performance expected on each.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_9253\" style=\"width: 810px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9253\" class=\"size-full wp-image-9253\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/Results-for-Standard-Classification-and-Regression-Machine-Learning-Datasets.jpg\" alt=\"Results for Standard Classification and Regression Machine Learning Datasets\" width=\"800\" height=\"531\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/Results-for-Standard-Classification-and-Regression-Machine-Learning-Datasets.jpg 800w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/Results-for-Standard-Classification-and-Regression-Machine-Learning-Datasets-300x199.jpg 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/Results-for-Standard-Classification-and-Regression-Machine-Learning-Datasets-768x510.jpg 768w\" sizes=\"(max-width: 800px) 100vw, 800px\"><\/p>\n<p id=\"caption-attachment-9253\" class=\"wp-caption-text\">Results for Standard Classification and Regression Machine Learning Datasets<br \/>Photo by <a href=\"https:\/\/flickr.com\/photos\/iwaswired\/4562196414\/\">Don Dearing<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Overview<\/h2>\n<p>This tutorial is divided into seven parts; they are:<\/p>\n<ol>\n<li>Value of Small Machine Learning Datasets<\/li>\n<li>Definition of a Standard Machine Learning Dataset<\/li>\n<li>Standard Machine Learning Datasets<\/li>\n<li>Good Results for Standard Datasets<\/li>\n<li>Model Evaluation Methodology<\/li>\n<li>Results for Classification Datasets\n<ol>\n<li>Binary Classification Datasets\n<ol>\n<li>Ionosphere<\/li>\n<li>Pima Indian Diabetes<\/li>\n<li>Sonar<\/li>\n<li>Wisconsin Breast Cancer<\/li>\n<li>Horse Colic<\/li>\n<\/ol>\n<\/li>\n<li>Multiclass Classification Datasets\n<ol>\n<li>Iris Flowers<\/li>\n<li>Glass<\/li>\n<li>Wine<\/li>\n<li>Wheat Seeds<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<\/li>\n<li>Results for Regression Datasets\n<ol>\n<li>Housing<\/li>\n<li>Auto Insurance<\/li>\n<li>Abalone<\/li>\n<li>Auto Imports<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<h2>Value of Small Machine Learning Datasets<\/h2>\n<p>There are a number of small machine learning datasets for classification and regression predictive modeling problems that are frequently reused.<\/p>\n<p>Sometimes the datasets are used as the basis for demonstrating a machine learning or data preparation technique. Other times, they are used as a basis for comparing different techniques.<\/p>\n<p>These datasets were collected and made publicly available in the early days of applied machine learning when data and real-world datasets were scarce. As such, they have become a standard or canonized from their wide adoption and reuse alone, not for any intrinsic interestingness in the problems.<\/p>\n<p>Finding a good model on one of these datasets does not mean you have \u201c<em>solved<\/em>\u201d the general problem. Also, some of the datasets may contain names or indicators that might be considered questionable or culturally insensitive (<em>which was very likely not the intent when the data was collected<\/em>). As such, they are also sometimes referred to as \u201c<em>toy<\/em>\u201d datasets.<\/p>\n<p>Such datasets are not really useful for points of comparison for machine learning algorithms, as most empirical experiments are nearly impossible to reproduce.<\/p>\n<p>Nevertheless, such datasets are valuable in the field of applied machine learning today. Even in the era of standard machine learning libraries, big data, and the abundance of data.<\/p>\n<p>There are three main reasons why they are valuable; they are:<\/p>\n<ol>\n<li>The datasets are <strong>real<\/strong>.<\/li>\n<li>The datasets are <strong>small<\/strong>.<\/li>\n<li>The datasets are <strong>understood<\/strong>.<\/li>\n<\/ol>\n<p><strong>Real datasets<\/strong> are useful as compared to <a href=\"https:\/\/machinelearningmastery.com\/generate-test-datasets-python-scikit-learn\/\">contrived datasets<\/a> because they are messy. There may be and are measurement errors, missing values, mislabeled examples, and more. Some or all of these issues must be searched for and addressed, and are some of the properties we may encounter when working on our own projects.<\/p>\n<p><strong>Small datasets<\/strong> are useful as compared to large datasets that may be many gigabytes in size. Small datasets can easily fit into memory and allow for the testing and exploration of many different data visualization, data preparation, and modeling algorithms easily and quickly. Speed of testing ideas and getting feedback is critical for beginners, and small datasets facilitate exactly this.<\/p>\n<p><strong>Understood datasets<\/strong> are useful as compared to new or newly created datasets. The features are well defined, the units of the features are specified, the source of the data is known, and the dataset has been well studied in tens, hundreds, and in some cases, thousands of research projects and papers. This provides a context in which results can be compared and evaluated, a property not available in entirely new domains.<\/p>\n<p>Given these properties, I strongly advocate machine learning beginners (and practitioners that are new to a specific technique) start with standard machine learning datasets.<\/p>\n<h2>Definition of a Standard Machine Learning Dataset<\/h2>\n<p>I would like to go one step further and define some more specific properties of a \u201c<em>standard<\/em>\u201d machine learning dataset.<\/p>\n<p>A standard machine learning dataset has the following properties.<\/p>\n<ul>\n<li>Less than 10,000 rows (samples).<\/li>\n<li>Less than 100 columns (features).<\/li>\n<li>Last column is the target variable.<\/li>\n<li>Stored in a single file with CSV format and without header line.<\/li>\n<li>Missing values marked with a question mark character (\u2018?\u2019)<\/li>\n<li>It is possible to achieve a better than naive result.<\/li>\n<\/ul>\n<p>Now that we have a clear definition of a dataset, let\u2019s look at what a \u201c<em>good<\/em>\u201d result means.<\/p>\n<h2>Standard Machine Learning Datasets<\/h2>\n<p>A dataset is a standard machine learning dataset if it is frequently used in books, research papers, tutorials, presentations, and more.<\/p>\n<p>The best repository for these so-called classical or standard machine learning datasets is the <a href=\"https:\/\/archive.ics.uci.edu\/ml\/index.php\">University of California at Irvine (UCI) machine learning repository<\/a>. This website categorizes datasets by type and provides a download of the data and additional information about each dataset and references relevant papers.<\/p>\n<p>I have chosen five or fewer datasets for each problem type as a starting point.<\/p>\n<p>All standard datasets used in this post are available on GitHub here:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/jbrownlee\/Datasets\">Machine Learning Mastery Datasets<\/a><\/li>\n<\/ul>\n<p>Download links are also provided for each dataset and for additional details about the dataset (the so-called a \u201c<em>.name<\/em>\u201d file).<\/p>\n<p>Each code example will automatically download a given dataset for you. If this is a problem, you can download the CSV file manually, place it in the same directory as the code example, then change the code example to use the filename instead of the URL.<\/p>\n<p>For example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# load dataset\r\ndataframe = read_csv('ionosphere.csv', header=None)<\/pre>\n<\/p>\n<h2>Good Results for Standard Datasets<\/h2>\n<p>A challenge for beginners when working with standard machine learning datasets is what represents a good result.<\/p>\n<p>In general, a model is skillful if it can demonstrate a performance that is better than a naive method, such as predicting the majority class in classification or the mean value in regression. This is called a baseline model or a baseline of performance that provides a relative measure of performance specific to a dataset. You can learn more about this here:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/how-to-know-if-your-machine-learning-model-has-good-performance\/\">How To Know if Your Machine Learning Model Has Good Performance<\/a><\/li>\n<\/ul>\n<p>Given that we now have a method for determining whether a model has skill on a dataset, beginners remain interested in the upper limits of performance for a given dataset. This is required information to know whether you are \u201c<em>getting good<\/em>\u201d at the process of applied machine learning.<\/p>\n<p>Good does not mean perfect predictions. All models will have prediction errors, and perfect predictions are not possible (tractable?) on real-world datasets.<\/p>\n<p>Defining \u201c<em>good<\/em>\u201d or \u201c<em>best<\/em>\u201d results for a dataset is challenging because it is dependent generally on the model evaluation methodology, and specifically on the versions of the dataset and libraries used in the evaluation.<\/p>\n<p>Good means \u201c<em>good-enough<\/em>\u201d given available resources. Often, this means a skill score that is above the 80th or 90th percentile of what might be possible for a dataset given unbounded skill, time, and computational resources.<\/p>\n<p>In this tutorial, you will discover how to calculate the baseline performance and \u201c<em>good<\/em>\u201d (near-best) performance that is possible on each dataset. You will also discover how to specify the data preparation and model used to achieve the performance.<\/p>\n<p>Rather than explain how to do this, a short Python code example is given that you can use to reproduce the baseline and good result.<\/p>\n<h2>Model Evaluation Methodology<\/h2>\n<p>The evaluation methodology is simple and fast, and generally recommended when working with small predictive modeling problems.<\/p>\n<p>The procedure is evaluated as follows:<\/p>\n<ul>\n<li>A model is evaluated using 10-fold cross-validation.<\/li>\n<li>The evaluation procedure is repeated three times.<\/li>\n<li>The random seed for the cross-validation split is the repeat number (1, 2, or 3).<\/li>\n<\/ul>\n<p>This results in 30 estimates of model performance from which a mean and standard deviation can be calculated to summarize the performance of a given model.<\/p>\n<p>Using the repeat number as the seed for each cross-validation split ensures that each algorithm evaluated on the dataset gets the same splits of the data, ensuring a fair direct comparison.<\/p>\n<p>Using the scikit-learn Python machine learning library, the example below can be used to evaluate a given model (or <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.pipeline.Pipeline.html\">Pipeline<\/a>). The <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.RepeatedStratifiedKFold.html\">RepeatedStratifiedKFold<\/a> class defines the number of folds and repeats for classification, and the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.cross_val_score.html\">cross_val_score() function<\/a> defines the score and performs the evaluation and returns a list of scores from which a mean and standard deviation can be calculated.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')<\/pre>\n<p>For regression we can use the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.RepeatedKFold.html\">RepeatedKFold<\/a> class and the MAE score.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\ncv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nscores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')<\/pre>\n<p>The \u201c<em>good<\/em>\u201d scores reported are the best that I can get out of my own personal set of \u201c<em>get a good result fast on a given dataset<\/em>\u201d scripts. I believe the scores represent good scores that can be achieved on each dataset, perhaps in the 90th or 95th percentile of what is possible for each dataset, if not better.<\/p>\n<p>That being said, I am not claiming that they are the best possible scores as I have not performed hyperparameter tuning for the well-performing models. I leave this as an exercise for interested practitioners. Best scores are not required if a practitioner can address a given dataset as getting a top percentile score is more than sufficient to demonstrate competence.<\/p>\n<p><strong>Note<\/strong>: I will update the results and models as I improve my own personal scripts and achieve better scores.<\/p>\n<p><strong>Can you get a better score for a dataset?<\/strong><br \/>\nI would love to know. Share your model and score in the comments below and I will try to reproduce it and update the post (<em>and give you full credit!<\/em>)<\/p>\n<p>Let\u2019s dive in.<\/p>\n<h2>Results for Classification Datasets<\/h2>\n<p>Classification is a predictive modeling problem that predicts one label given one or more input variables.<\/p>\n<p>The baseline model for classification tasks is a model that predicts the majority label. This can be achieved in scikit-learn using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.dummy.DummyClassifier.html\">DummyClassifier<\/a> class with the \u2018<em>most_frequent<\/em>\u2018 strategy; for example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nmodel = DummyClassifier(strategy='most_frequent')<\/pre>\n<p>The standard evaluation for classification models is classification accuracy, although this is not ideal for imbalanced and some multi-class problems. Nevertheless, for better or worse, this score will be used (for now).<\/p>\n<p>Accuracy is reported as a fraction between 0 (0% or no skill) and 1 (100% or perfect skill).<\/p>\n<p>There are two main types of classification tasks: binary and multi-class classification, divided based on the number of labels to be predicted for a given dataset as two or more than two respectively. Given the prevalence of classification tasks in machine learning, we will treat these two subtypes of classification problems separately.<\/p>\n<h3>Binary Classification Datasets<\/h3>\n<p>In this section, we will review the baseline and good performance on the following binary classification predictive modeling datasets:<\/p>\n<ol>\n<li>Ionosphere<\/li>\n<li>Pima Indian Diabetes<\/li>\n<li>Sonar<\/li>\n<li>Wisconsin Breast Cancer<\/li>\n<li>Horse Colic<\/li>\n<\/ol>\n<h4>Ionosphere<\/h4>\n<ul>\n<li>Download: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/ionosphere.csv\">ionosphere.csv<\/a><\/li>\n<li>Details: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/ionosphere.names\">ionosphere.names<\/a><\/li>\n<\/ul>\n<p>The complete code example for achieving baseline and a good result on this dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># baseline and good result for Ionosphere\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.preprocessing import StandardScaler\r\nfrom sklearn.preprocessing import MinMaxScaler\r\nfrom sklearn.dummy import DummyClassifier\r\nfrom sklearn.svm import SVC\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/ionosphere.csv'\r\ndataframe = read_csv(url, header=None)\r\ndata = dataframe.values\r\nX, y = data[:, :-1], data[:, -1]\r\nprint('Shape: %s, %s' % (X.shape,y.shape))\r\n# minimally prepare dataset\r\nX = X.astype('float32')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# evaluate naive\r\nnaive = DummyClassifier(strategy='most_frequent')\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# evaluate model\r\nmodel = SVC(kernel='rbf', gamma='scale', C=10)\r\nsteps = [('s',StandardScaler()), ('n',MinMaxScaler()), ('m',model)]\r\npipeline = Pipeline(steps=steps)\r\nm_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))<\/pre>\n<p>Running the example, you should see the following results.<\/p>\n<pre class=\"crayon-plain-tag\">Shape: (351, 34), (351,)\r\nBaseline: 0.641 (0.006)\r\nGood: 0.948 (0.033)<\/pre>\n<\/p>\n<h4>Pima Indian Diabetes<\/h4>\n<ul>\n<li>Download: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.csv\">pima-indians-diabetes.csv<\/a><\/li>\n<li>Details: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.names\">pima-indians-diabetes.name<\/a><\/li>\n<\/ul>\n<p>The complete code example for achieving baseline and a good result on this dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># baseline and good result for Pima Indian Diabetes\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.dummy import DummyClassifier\r\nfrom sklearn.linear_model import LogisticRegression\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.csv'\r\ndataframe = read_csv(url, header=None)\r\ndata = dataframe.values\r\nX, y = data[:, :-1], data[:, -1]\r\nprint('Shape: %s, %s' % (X.shape,y.shape))\r\n# minimally prepare dataset\r\nX = X.astype('float32')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# evaluate naive\r\nnaive = DummyClassifier(strategy='most_frequent')\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# evaluate model\r\nmodel = LogisticRegression(solver='newton-cg',penalty='l2',C=1)\r\nm_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))<\/pre>\n<p>Running the example, you should see the following results.<\/p>\n<p>Note: you may see some warnings, but they can be safely ignored.<\/p>\n<pre class=\"crayon-plain-tag\">Shape: (768, 8), (768,)\r\nBaseline: 0.651 (0.003)\r\nGood: 0.774 (0.055)<\/pre>\n<\/p>\n<h4>Sonar<\/h4>\n<ul>\n<li>Download: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\">sonar.csv<\/a><\/li>\n<li>Details: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.names\">sonar.names<\/a><\/li>\n<\/ul>\n<p>The complete code example for achieving baseline and a good result on this dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># baseline and good result for Sonar\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.preprocessing import PowerTransformer\r\nfrom sklearn.dummy import DummyClassifier\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv'\r\ndataframe = read_csv(url, header=None)\r\ndata = dataframe.values\r\nX, y = data[:, :-1], data[:, -1]\r\nprint('Shape: %s, %s' % (X.shape,y.shape))\r\n# minimally prepare dataset\r\nX = X.astype('float32')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# evaluate naive\r\nnaive = DummyClassifier(strategy='most_frequent')\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# evaluate model\r\nmodel = KNeighborsClassifier(n_neighbors=2, metric='minkowski', weights='distance')\r\nsteps = [('p',PowerTransformer()), ('m',model)]\r\npipeline = Pipeline(steps=steps)\r\nm_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))<\/pre>\n<p>Running the example, you should see the following results.<\/p>\n<pre class=\"crayon-plain-tag\">Shape: (208, 60), (208,)\r\nBaseline: 0.534 (0.012)\r\nGood: 0.882 (0.071)<\/pre>\n<\/p>\n<h4>Wisconsin Breast Cancer<\/h4>\n<ul>\n<li>Download: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/breast-cancer-wisconsin.csv\">breast-cancer-wisconsin.csv<\/a><\/li>\n<li>Details: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/breast-cancer-wisconsin.names\">breast-cancer-wisconsin.names<\/a><\/li>\n<\/ul>\n<p>The complete code example for achieving baseline and a good result on this dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># baseline and good result for Wisconsin Breast Cancer\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.preprocessing import PowerTransformer\r\nfrom sklearn.impute import SimpleImputer\r\nfrom sklearn.dummy import DummyClassifier\r\nfrom sklearn.svm import SVC\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/breast-cancer-wisconsin.csv'\r\ndataframe = read_csv(url, header=None, na_values='?')\r\ndata = dataframe.values\r\nX, y = data[:, :-1], data[:, -1]\r\nprint('Shape: %s, %s' % (X.shape,y.shape))\r\n# minimally prepare dataset\r\nX = X.astype('float32')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# evaluate naive\r\nnaive = DummyClassifier(strategy='most_frequent')\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# evaluate model\r\nmodel = SVC(kernel='sigmoid', gamma='scale', C=0.1)\r\nsteps = [('i',SimpleImputer(strategy='median')), ('p',PowerTransformer()), ('m',model)]\r\npipeline = Pipeline(steps=steps)\r\nm_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))<\/pre>\n<p>Running the example, you should see the following results.<\/p>\n<p>Note: you may see some warnings, but they can be safely ignored.<\/p>\n<pre class=\"crayon-plain-tag\">Shape: (699, 9), (699,)\r\nBaseline: 0.655 (0.003)\r\nGood: 0.973 (0.019)<\/pre>\n<\/p>\n<h4>Horse Colic<\/h4>\n<ul>\n<li>Download: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/horse-colic.csv\">horse-colic.csv<\/a><\/li>\n<li>Details: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/horse-colic.names\">horse-colic.names<\/a><\/li>\n<\/ul>\n<p>The complete code example for achieving baseline and a good result on this dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># baseline and good result for Horse Colic\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.impute import SimpleImputer\r\nfrom sklearn.dummy import DummyClassifier\r\nfrom xgboost import XGBClassifier\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/horse-colic.csv'\r\ndataframe = read_csv(url, header=None, na_values='?')\r\ndata = dataframe.values\r\nX, y = data[:, :-1], data[:, -1]\r\nprint('Shape: %s, %s' % (X.shape,y.shape))\r\n# minimally prepare dataset\r\nX = X.astype('float32')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# evaluate naive\r\nnaive = DummyClassifier(strategy='most_frequent')\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# evaluate model\r\nmodel = XGBClassifier(learning_rate=0.1, n_estimators=100, subsample=1, max_depth=3, colsample_bynode=0.4)\r\nsteps = [('i',SimpleImputer(strategy='median')), ('m',model)]\r\npipeline = Pipeline(steps=steps)\r\nm_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))<\/pre>\n<p>Running the example, you should see the following results.<\/p>\n<pre class=\"crayon-plain-tag\">Shape: (300, 27), (300,)\r\nBaseline: 0.670 (0.007)\r\nGood: 0.852 (0.048)<\/pre>\n<\/p>\n<h3>Multiclass Classification Datasets<\/h3>\n<p>In this section, we will review the baseline and good performance on the following multiclass classification predictive modeling datasets:<\/p>\n<ol>\n<li>Iris Flowers<\/li>\n<li>Glass<\/li>\n<li>Wine<\/li>\n<li>Wheat Seeds<\/li>\n<\/ol>\n<h4>Iris Flowers<\/h4>\n<ul>\n<li>Download: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/iris.csv\">iris.csv<\/a><\/li>\n<li>Details: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/iris.names\">iris.names<\/a><\/li>\n<\/ul>\n<p>The complete code example for achieving baseline and a good result on this dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># baseline and good result for Iris\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.preprocessing import PowerTransformer\r\nfrom sklearn.dummy import DummyClassifier\r\nfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysis\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/iris.csv'\r\ndataframe = read_csv(url, header=None)\r\ndata = dataframe.values\r\nX, y = data[:, :-1], data[:, -1]\r\nprint('Shape: %s, %s' % (X.shape,y.shape))\r\n# minimally prepare dataset\r\nX = X.astype('float32')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# evaluate naive\r\nnaive = DummyClassifier(strategy='most_frequent')\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# evaluate model\r\nmodel = LinearDiscriminantAnalysis()\r\nsteps = [('p',PowerTransformer()), ('m',model)]\r\npipeline = Pipeline(steps=steps)\r\nm_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))<\/pre>\n<p>Running the example, you should see the following results.<\/p>\n<pre class=\"crayon-plain-tag\">Shape: (150, 4), (150,)\r\nBaseline: 0.333 (0.000)\r\nGood: 0.980 (0.039)<\/pre>\n<\/p>\n<h4>Glass<\/h4>\n<ul>\n<li>Download: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/glass.csv\">glass.csv<\/a><\/li>\n<li>Details: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/glass.names\">glass.names<\/a><\/li>\n<\/ul>\n<p>The complete code example for achieving baseline and a good result on this dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># baseline and good result for Glass\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.dummy import DummyClassifier\r\nfrom sklearn.ensemble import RandomForestClassifier\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/glass.csv'\r\ndataframe = read_csv(url, header=None)\r\ndata = dataframe.values\r\nX, y = data[:, :-1], data[:, -1]\r\nprint('Shape: %s, %s' % (X.shape,y.shape))\r\n# minimally prepare dataset\r\nX = X.astype('float32')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# evaluate naive\r\nnaive = DummyClassifier(strategy='most_frequent')\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# evaluate model\r\nmodel = RandomForestClassifier(n_estimators=100,max_features=2)\r\nm_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))<\/pre>\n<p>Running the example, you should see the following results.<\/p>\n<pre class=\"crayon-plain-tag\">Shape: (214, 9), (214,)\r\nBaseline: 0.356 (0.013)\r\nGood: 0.744 (0.085)<\/pre>\n<\/p>\n<h4>Wine<\/h4>\n<ul>\n<li>Download: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/wine.csv\">wine.csv<\/a><\/li>\n<li>Details: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/wine.names\">wine.names<\/a><\/li>\n<\/ul>\n<p>The complete code example for achieving baseline and a good result on this dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># baseline and good result for Wine\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.preprocessing import StandardScaler\r\nfrom sklearn.preprocessing import MinMaxScaler\r\nfrom sklearn.dummy import DummyClassifier\r\nfrom sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/wine.csv'\r\ndataframe = read_csv(url, header=None)\r\ndata = dataframe.values\r\nX, y = data[:, :-1], data[:, -1]\r\nprint('Shape: %s, %s' % (X.shape,y.shape))\r\n# minimally prepare dataset\r\nX = X.astype('float32')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# evaluate naive\r\nnaive = DummyClassifier(strategy='most_frequent')\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# evaluate model\r\nmodel = QuadraticDiscriminantAnalysis()\r\nsteps = [('s',StandardScaler()), ('n',MinMaxScaler()), ('m',model)]\r\npipeline = Pipeline(steps=steps)\r\nm_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))<\/pre>\n<p>Running the example, you should see the following results.<\/p>\n<pre class=\"crayon-plain-tag\">Shape: (178, 13), (178,)\r\nBaseline: 0.399 (0.017)\r\nGood: 0.992 (0.020)<\/pre>\n<\/p>\n<h4>Wheat Seeds<\/h4>\n<ul>\n<li>Download: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/wheat-seeds.csv\">wheat-seeds.csv<\/a><\/li>\n<li>Details: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/wheat-seeds.names\">wheat-seeds.names<\/a><\/li>\n<\/ul>\n<p>The complete code example for achieving baseline and a good result on this dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># baseline and good result for Wine\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.preprocessing import StandardScaler\r\nfrom sklearn.dummy import DummyClassifier\r\nfrom sklearn.linear_model import RidgeClassifier\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/wheat-seeds.csv'\r\ndataframe = read_csv(url, header=None)\r\ndata = dataframe.values\r\nX, y = data[:, :-1], data[:, -1]\r\nprint('Shape: %s, %s' % (X.shape,y.shape))\r\n# minimally prepare dataset\r\nX = X.astype('float32')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# evaluate naive\r\nnaive = DummyClassifier(strategy='most_frequent')\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# evaluate model\r\nmodel = RidgeClassifier(alpha=0.2)\r\nsteps = [('s',StandardScaler()), ('m',model)]\r\npipeline = Pipeline(steps=steps)\r\nm_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))<\/pre>\n<p>Running the example, you should see the following results.<\/p>\n<pre class=\"crayon-plain-tag\">Shape: (210, 7), (210,)\r\nBaseline: 0.333 (0.000)\r\nGood: 0.973 (0.036)<\/pre>\n<\/p>\n<h2>Results for Regression Datasets<\/h2>\n<p>Regression is a predictive modeling problem that predicts a numerical value given one or more input variables.<\/p>\n<p>The baseline model for classification tasks is a model that predicts the mean or median value. This can be achieved in scikit-learn using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.dummy.DummyRegressor.html\">DummyRegressor<\/a> class using the \u2018<em>median<\/em>\u2018 strategy; for example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nmodel = DummyRegressor(strategy='median')<\/pre>\n<p>The standard evaluation for regression models is mean absolute error (MAE), although this is not ideal for all regression problems. Nevertheless, for better or worse, this score will be used (for now).<\/p>\n<p>MAE is reported as an error score between 0 (perfect skill) and a very large number or infinity (no skill).<\/p>\n<p>In this section, we will review the baseline and good performance on the following regression predictive modeling datasets:<\/p>\n<ol>\n<li>Housing<\/li>\n<li>Auto Insurance<\/li>\n<li>Abalone<\/li>\n<li>Auto Imports<\/li>\n<\/ol>\n<h4>Housing<\/h4>\n<ul>\n<li>Download: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/wine.csv\">housing.csv<\/a><\/li>\n<li>Details: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/housing.names\">housing.names<\/a><\/li>\n<\/ul>\n<p>The complete code example for achieving baseline and a good result on this dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># baseline and good result for Housing\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom numpy import absolute\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedKFold\r\nfrom sklearn.dummy import DummyRegressor\r\nfrom xgboost import XGBRegressor\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/housing.csv'\r\ndataframe = read_csv(url, header=None)\r\ndata = dataframe.values\r\nX, y = data[:, :-1], data[:, -1]\r\nprint('Shape: %s, %s' % (X.shape,y.shape))\r\n# minimally prepare dataset\r\nX = X.astype('float32')\r\ny = y.astype('float32')\r\n# evaluate naive\r\nnaive = DummyRegressor(strategy='median')\r\ncv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(naive, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')\r\nn_scores = absolute(n_scores)\r\nprint('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# evaluate model\r\nmodel = XGBRegressor(learning_rate=0.1, n_estimators=100, subsample=0.7, max_depth=9, colsample_bynode=0.6, objective='reg:squarederror')\r\nm_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')\r\nm_scores = absolute(m_scores)\r\nprint('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))<\/pre>\n<p>Running the example, you should see the following results.<\/p>\n<pre class=\"crayon-plain-tag\">Shape: (506, 13), (506,)\r\nBaseline: 6.660 (0.706)\r\nGood: 1.955 (0.279)<\/pre>\n<\/p>\n<h4>Auto Insurance<\/h4>\n<ul>\n<li>Download: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/auto-insurance.csv\">auto-insurance.csv<\/a><\/li>\n<li>Details: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/auto-insurance.names\">auto-insurance.names<\/a><\/li>\n<\/ul>\n<p>The complete code example for achieving baseline and a good result on this dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># baseline and good result for Auto Insurance\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom numpy import absolute\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.compose import TransformedTargetRegressor\r\nfrom sklearn.preprocessing import PowerTransformer\r\nfrom sklearn.dummy import DummyRegressor\r\nfrom sklearn.linear_model import HuberRegressor\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/auto-insurance.csv'\r\ndataframe = read_csv(url, header=None)\r\ndata = dataframe.values\r\nX, y = data[:, :-1], data[:, -1]\r\nprint('Shape: %s, %s' % (X.shape,y.shape))\r\n# minimally prepare dataset\r\nX = X.astype('float32')\r\ny = y.astype('float32')\r\n# evaluate naive\r\nnaive = DummyRegressor(strategy='median')\r\ncv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(naive, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')\r\nn_scores = absolute(n_scores)\r\nprint('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# evaluate model\r\nmodel = HuberRegressor(epsilon=1.0, alpha=0.001)\r\nsteps = [('p',PowerTransformer()), ('m',model)]\r\npipeline = Pipeline(steps=steps)\r\ntarget = TransformedTargetRegressor(regressor=pipeline, transformer=PowerTransformer())\r\nm_scores = cross_val_score(target, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')\r\nm_scores = absolute(m_scores)\r\nprint('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))<\/pre>\n<p>Running the example, you should see the following results.<\/p>\n<pre class=\"crayon-plain-tag\">Shape: (63, 1), (63,)\r\nBaseline: 66.624 (19.303)\r\nGood: 28.358 (9.747)<\/pre>\n<\/p>\n<h4>Abalone<\/h4>\n<ul>\n<li>Download: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/abalone.csv\">abalone.csv<\/a><\/li>\n<li>Details: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/abalone.names\">abalone.names<\/a><\/li>\n<\/ul>\n<p>The complete code example for achieving baseline and a good result on this dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># baseline and good result for Abalone\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom numpy import absolute\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.compose import TransformedTargetRegressor\r\nfrom sklearn.preprocessing import OneHotEncoder\r\nfrom sklearn.preprocessing import PowerTransformer\r\nfrom sklearn.compose import ColumnTransformer\r\nfrom sklearn.dummy import DummyRegressor\r\nfrom sklearn.svm import SVR\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/abalone.csv'\r\ndataframe = read_csv(url, header=None)\r\ndata = dataframe.values\r\nX, y = data[:, :-1], data[:, -1]\r\nprint('Shape: %s, %s' % (X.shape,y.shape))\r\n# minimally prepare dataset\r\ny = y.astype('float32')\r\n# evaluate naive\r\nnaive = DummyRegressor(strategy='median')\r\ntransform = ColumnTransformer(transformers=[('c', OneHotEncoder(), [0])], remainder='passthrough')\r\npipeline = Pipeline(steps=[('ColumnTransformer',transform), ('Model',naive)])\r\ncv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')\r\nn_scores = absolute(n_scores)\r\nprint('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# evaluate model\r\nmodel = SVR(kernel='rbf',gamma='scale',C=10)\r\ntarget = TransformedTargetRegressor(regressor=model, transformer=PowerTransformer(), check_inverse=False)\r\npipeline = Pipeline(steps=[('ColumnTransformer',transform), ('Model',target)])\r\nm_scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')\r\nm_scores = absolute(m_scores)\r\nprint('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))<\/pre>\n<p>Running the example, you should see the following results.<\/p>\n<pre class=\"crayon-plain-tag\">Shape: (4177, 8), (4177,)\r\nBaseline: 2.363 (0.116)\r\nGood: 1.460 (0.075)<\/pre>\n<\/p>\n<h4>Auto Imports<\/h4>\n<ul>\n<li>Download: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/auto_imports.csv\">auto_imports.csv<\/a><\/li>\n<li>Details: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/auto_imports.names\">auto_imports.names<\/a><\/li>\n<\/ul>\n<p>The complete code example for achieving baseline and a good result on this dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># baseline and good result for Auto Imports\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom numpy import absolute\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.preprocessing import OneHotEncoder\r\nfrom sklearn.compose import ColumnTransformer\r\nfrom sklearn.impute import SimpleImputer\r\nfrom sklearn.dummy import DummyRegressor\r\nfrom sklearn.ensemble import RandomForestRegressor\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/auto_imports.csv'\r\ndataframe = read_csv(url, header=None, na_values='?')\r\ndata = dataframe.values\r\nX, y = data[:, :-1], data[:, -1]\r\nprint('Shape: %s, %s' % (X.shape,y.shape))\r\ny = y.astype('float32')\r\n# evaluate naive\r\nnaive = DummyRegressor(strategy='median')\r\ncat_ix = [2,3,4,5,6,7,8,14,15,17]\r\nnum_ix = [0,1,9,10,11,12,13,16,18,19,20,21,22,23,24]\r\nsteps = [('c', Pipeline(steps=[('s',SimpleImputer(strategy='most_frequent')),('oe',OneHotEncoder(handle_unknown='ignore'))]), cat_ix), ('n', SimpleImputer(strategy='median'), num_ix)]\r\ntransform = ColumnTransformer(transformers=steps, remainder='passthrough')\r\npipeline = Pipeline(steps=[('ColumnTransformer',transform), ('Model',naive)])\r\ncv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')\r\nn_scores = absolute(n_scores)\r\nprint('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# evaluate model\r\nmodel = RandomForestRegressor(n_estimators=100,max_features=10)\r\npipeline = Pipeline(steps=[('ColumnTransformer',transform), ('Model',model)])\r\nm_scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')\r\nm_scores = absolute(m_scores)\r\nprint('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))<\/pre>\n<p>Running the example, you should see the following results.<\/p>\n<pre class=\"crayon-plain-tag\">Shape: (201, 25), (201,)\r\nBaseline: 5880.718 (1197.967)\r\nGood: 1405.420 (317.683)<\/pre>\n<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Tutorials<\/h3>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/how-to-know-if-your-machine-learning-model-has-good-performance\/\">How To Know if Your Machine Learning Model Has Good Performance<\/a><\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li><a href=\"https:\/\/archive.ics.uci.edu\/ml\/index.php\">UCI Machine Learning Repository<\/a><\/li>\n<li><a href=\"http:\/\/www.is.umk.pl\/~duch\/projects\/projects\/datasets-stat.html\">Statlog Datasets: comparison of results<\/a><\/li>\n<li><a href=\"http:\/\/www.is.umk.pl\/~duch\/projects\/projects\/datasets.html\">Datasets used for classification: comparison of results<\/a><\/li>\n<li><a href=\"https:\/\/amzn.to\/2lDHgeK\">Machine Learning, Neural and Statistical Classification<\/a>, 1994.<\/li>\n<li><a href=\"http:\/\/www1.maths.leeds.ac.uk\/~charles\/statlog\/\">Machine Learning, Neural and Statistical Classification, Homepage<\/a>, 1994.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/datasets\/index.html\">Dataset loading utilities, scikit-learn<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this post, you discovered standard machine learning datasets for classification and regression and the baseline and good results that one may expect to achieve on each.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>The importance of standard machine learning datasets.<\/li>\n<li>How to systematically evaluate a model on a standard machine learning dataset.<\/li>\n<li>Standard datasets for classification and regression and the baseline and good performance expected on each.<\/li>\n<\/ul>\n<p><strong>Did I miss your favorite dataset?<\/strong><br \/>\nLet me know in the comments and I will calculate a score for it, or perhaps even add it to this post.<\/p>\n<p><strong>Can you get a better score for a dataset?<\/strong><br \/>\nI would love to know; share your model and score in the comments below and I will try to reproduce it and update the post (and give you full credit!)<\/p>\n<p><strong>Do you have any questions?<\/strong><br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/results-for-standard-classification-and-regression-machine-learning-datasets\/\">Results for Standard Classification and Regression Machine Learning Datasets<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/results-for-standard-classification-and-regression-machine-learning-datasets\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee It is important that beginner machine learning practitioners practice on small real-world datasets. So-called standard machine learning datasets contain actual observations, fit [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/17\/results-for-standard-classification-and-regression-machine-learning-datasets\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":2937,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2936"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=2936"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2936\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/2937"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=2936"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=2936"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=2936"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}