{"id":3292,"date":"2020-03-31T18:00:13","date_gmt":"2020-03-31T18:00:13","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/03\/31\/gradient-boosting-with-scikit-learn-xgboost-lightgbm-and-catboost\/"},"modified":"2020-03-31T18:00:13","modified_gmt":"2020-03-31T18:00:13","slug":"gradient-boosting-with-scikit-learn-xgboost-lightgbm-and-catboost","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/03\/31\/gradient-boosting-with-scikit-learn-xgboost-lightgbm-and-catboost\/","title":{"rendered":"Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Gradient boosting is a powerful ensemble machine learning algorithm.<\/p>\n<p>It&rsquo;s popular for structured predictive modeling problems, such as classification and regression on tabular data, and is often the main algorithm or one of the main algorithms used in winning solutions to machine learning competitions, like those on Kaggle.<\/p>\n<p>There are many implementations of gradient boosting available, including standard implementations in SciPy and efficient third-party libraries. Each uses a different interface and even different names for the algorithm.<\/p>\n<p>In this tutorial, you will discover how to use gradient boosting models for classification and regression in Python.<\/p>\n<p>Standardized code examples are provided for the four major implementations of gradient boosting in Python, ready for you to copy-paste and use in your own predictive modeling project.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>Gradient boosting is an ensemble algorithm that fits boosted decision trees by minimizing an error gradient.<\/li>\n<li>How to evaluate and use gradient boosting with scikit-learn, including gradient boosting machines and the histogram-based algorithm.<\/li>\n<li>How to evaluate and use third-party gradient boosting algorithms, including XGBoost, LightGBM, and CatBoost.<\/li>\n<\/ul>\n<p>Let&rsquo;s get started.<\/p>\n<div id=\"attachment_10094\" style=\"width: 810px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10094\" class=\"size-full wp-image-10094\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/04\/Gradient-Boosting-with-Scikit-Learn-XGBoost-LightGBM-and-CatBoost.jpg\" alt=\"Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost\" width=\"800\" height=\"534\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/04\/Gradient-Boosting-with-Scikit-Learn-XGBoost-LightGBM-and-CatBoost.jpg 800w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/04\/Gradient-Boosting-with-Scikit-Learn-XGBoost-LightGBM-and-CatBoost-300x200.jpg 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/04\/Gradient-Boosting-with-Scikit-Learn-XGBoost-LightGBM-and-CatBoost-768x513.jpg 768w\" sizes=\"(max-width: 800px) 100vw, 800px\"><\/p>\n<p id=\"caption-attachment-10094\" class=\"wp-caption-text\">Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost<br \/>Photo by <a href=\"https:\/\/flickr.com\/photos\/shebalso\/441861081\/\">John<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into five parts; they are:<\/p>\n<ol>\n<li>Gradient Boosting Overview<\/li>\n<li>Gradient Boosting With Scikit-Learn\n<ol>\n<li>Library Installation<\/li>\n<li>Test Problems<\/li>\n<li>Gradient Boosting<\/li>\n<li>Histogram-Based Gradient Boosting<\/li>\n<\/ol>\n<\/li>\n<li>Gradient Boosting With XGBoost\n<ol>\n<li>Library Installation<\/li>\n<li>XGBoost for Classification<\/li>\n<li>XGBoost for Regression<\/li>\n<\/ol>\n<\/li>\n<li>Gradient Boosting With LightGBM\n<ol>\n<li>Library Installation<\/li>\n<li>LightGBM for Classification<\/li>\n<li>LightGBM for Regression<\/li>\n<\/ol>\n<\/li>\n<li>Gradient Boosting With CatBoost\n<ol>\n<li>Library Installation<\/li>\n<li>CatBoost for Classification<\/li>\n<li>CatBoost for Regression<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<h2>Gradient Boosting Overview<\/h2>\n<p>Gradient boosting refers to a class of ensemble machine learning algorithms that can be used for classification or regression predictive modeling problems.<\/p>\n<p>Gradient boosting is also known as gradient tree boosting, stochastic gradient boosting (an extension), and gradient boosting machines, or GBM for short.<\/p>\n<p>Ensembles are constructed from decision tree models. Trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. This is a type of ensemble machine learning model referred to as boosting.<\/p>\n<p>Models are fit using any arbitrary differentiable loss function and gradient descent optimization algorithm. This gives the technique its name, &ldquo;<em>gradient boosting<\/em>,&rdquo; as the loss gradient is minimized as the model is fit, much like a neural network.<\/p>\n<p>Gradient boosting is an effective machine learning algorithm and is often the main, or one of the main, algorithms used to win machine learning competitions (like Kaggle) on tabular and similar structured datasets.<\/p>\n<p><strong>Note<\/strong>: We will not be going into the theory behind how the gradient boosting algorithm works in this tutorial.<\/p>\n<p>For more on the gradient boosting algorithm, see the tutorial:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/gentle-introduction-gradient-boosting-algorithm-machine-learning\/\">A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning<\/a><\/li>\n<\/ul>\n<p>The algorithm provides hyperparameters that should, and perhaps must, be tuned for a specific dataset. Although there are many hyperparameters to tune, perhaps the most important are as follows:<\/p>\n<ul>\n<li>The number of trees or estimators in the model.<\/li>\n<li>The learning rate of the model.<\/li>\n<li>The row and column sampling rate for stochastic models.<\/li>\n<li>The maximum tree depth.<\/li>\n<li>The minimum tree weight.<\/li>\n<li>The regularization terms alpha and lambda.<\/li>\n<\/ul>\n<p><strong>Note<\/strong>: We will not be exploring how to configure or tune the configuration of gradient boosting algorithms in this tutorial.<\/p>\n<p>For more on tuning the hyperparameters of gradient boosting algorithms, see the tutorial:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/configure-gradient-boosting-algorithm\/\">How to Configure the Gradient Boosting Algorithm<\/a><\/li>\n<\/ul>\n<p>There are many implementations of the gradient boosting algorithm available in Python. Perhaps the most used implementation is the version provided with the scikit-learn library.<\/p>\n<p>Additional third-party libraries are available that provide computationally efficient alternate implementations of the algorithm that often achieve better results in practice. Examples include the XGBoost library, the LightGBM library, and the CatBoost library.<\/p>\n<p><strong>Do you have a different favorite gradient boosting implementation?<\/strong><br \/>\nLet me know in the comments below.<\/p>\n<p>When using gradient boosting on your predictive modeling project, you may want to test each implementation of the algorithm.<\/p>\n<p>This tutorial provides examples of each implementation of the gradient boosting algorithm on classification and regression predictive modeling problems that you can copy-paste into your project.<\/p>\n<p>Let&rsquo;s take a look at each in turn.<\/p>\n<p><strong>Note<\/strong>: We are not comparing the performance of the algorithms in this tutorial. Instead, we are providing code examples to demonstrate how to use each different implementation. As such, we are using synthetic test datasets to demonstrate evaluating and making a prediction with each implementation.<\/p>\n<p>This tutorial assumes you have Python and SciPy installed. If you need help, see the tutorial:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/setup-python-environment-machine-learning-deep-learning-anaconda\/\">How to Setup Your Python Environment for Machine Learning with Anaconda<\/a><\/li>\n<\/ul>\n<h2>Gradient Boosting with Scikit-Learn<\/h2>\n<p>In this section, we will review how to use the gradient boosting algorithm implementation in the <a href=\"https:\/\/scikit-learn.org\/\">scikit-learn library<\/a>.<\/p>\n<h3>Library Installation<\/h3>\n<p>First, let&rsquo;s install the library.<\/p>\n<p>Don&rsquo;t skip this step as you will need to ensure you have the latest version installed.<\/p>\n<p>You can install the scikit-learn library using the pip Python installer, as follows:<\/p>\n<pre class=\"crayon-plain-tag\">sudo pip install scikit-learn<\/pre>\n<p>For additional installation instructions specific to your platform, see:<\/p>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/install.html\">Installing scikit-learn<\/a><\/li>\n<\/ul>\n<p>Next, let&rsquo;s confirm that the library is installed and you are using a modern version.<\/p>\n<p>Run the following script to print the library version number.<\/p>\n<pre class=\"crayon-plain-tag\"># check scikit-learn version\r\nimport sklearn\r\nprint(sklearn.__version__)<\/pre>\n<p>Running the example, you should see the following version number or higher.<\/p>\n<pre class=\"crayon-plain-tag\">0.22.1<\/pre>\n<\/p>\n<h3>Test Problems<\/h3>\n<p>We will demonstrate the gradient boosting algorithm for classification and regression.<\/p>\n<p>As such, we will use synthetic test problems from the scikit-learn library.<\/p>\n<h4>Classification Dataset<\/h4>\n<p>We will use the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_classification.html\">make_classification() function<\/a> to create a test binary classification dataset.<\/p>\n<p>The dataset will have 1,000 examples, with 10 input features, five of which will be informative and the remaining five that will be redundant. We will fix the random number seed to ensure we get the same examples each time the code is run.<\/p>\n<p>An example of creating and summarizing the dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># test classification dataset\r\nfrom sklearn.datasets import make_classification\r\n# define dataset\r\nX, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)\r\n# summarize the dataset\r\nprint(X.shape, y.shape)<\/pre>\n<p>Running the example creates the dataset and confirms the expected number of samples and features.<\/p>\n<pre class=\"crayon-plain-tag\">(1000, 10) (1000,)<\/pre>\n<\/p>\n<h4>Regression Dataset<\/h4>\n<p>We will use the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_regression.html\">make_regression() function<\/a> to create a test regression dataset.<\/p>\n<p>Like the classification dataset, the regression dataset will have 1,000 examples, with 10 input features, five of which will be informative and the remaining five that will be redundant.<\/p>\n<pre class=\"crayon-plain-tag\"># test regression dataset\r\nfrom sklearn.datasets import make_regression\r\n# define dataset\r\nX, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)\r\n# summarize the dataset\r\nprint(X.shape, y.shape)<\/pre>\n<p>Running the example creates the dataset and confirms the expected number of samples and features.<\/p>\n<pre class=\"crayon-plain-tag\">(1000, 10) (1000,)<\/pre>\n<p>Next, let&rsquo;s look at how we can develop gradient boosting models in scikit-learn.<\/p>\n<h3>Gradient Boosting<\/h3>\n<p>The scikit-learn library provides the GBM algorithm for regression and classification via the <em>GradientBoostingClassifier<\/em> and <em>GradientBoostingRegressor<\/em> classes.<\/p>\n<p>Let&rsquo;s take a closer look at each in turn.<\/p>\n<h4>Gradient Boosting Machine for Classification<\/h4>\n<p>The example below first evaluates a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.GradientBoostingClassifier.html\">GradientBoostingClassifier<\/a> on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># gradient boosting for classification in scikit-learn\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom sklearn.datasets import make_classification\r\nfrom sklearn.ensemble import GradientBoostingClassifier\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom matplotlib import pyplot\r\n# define dataset\r\nX, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)\r\n# evaluate the model\r\nmodel = GradientBoostingClassifier()\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# fit the model on the whole dataset\r\nmodel = GradientBoostingClassifier()\r\nmodel.fit(X, y)\r\n# make a single prediction\r\nrow = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]\r\nyhat = model.predict(row)\r\nprint('Prediction: %d' % yhat[0])<\/pre>\n<p>Running the example first reports the evaluation of the model using repeated k-fold cross-validation, then the result of making a single prediction with a model fit on the entire dataset.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.915 (0.025)\r\nPrediction: 1<\/pre>\n<\/p>\n<h4>Gradient Boosting Machine for Regression<\/h4>\n<p>The example below first evaluates a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.GradientBoostingRegressor.html\">GradientBoostingRegressor<\/a> on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># gradient boosting for regression in scikit-learn\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom sklearn.datasets import make_regression\r\nfrom sklearn.ensemble import GradientBoostingRegressor\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedKFold\r\nfrom matplotlib import pyplot\r\n# define dataset\r\nX, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)\r\n# evaluate the model\r\nmodel = GradientBoostingRegressor()\r\ncv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# fit the model on the whole dataset\r\nmodel = GradientBoostingRegressor()\r\nmodel.fit(X, y)\r\n# make a single prediction\r\nrow = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]]\r\nyhat = model.predict(row)\r\nprint('Prediction: %.3f' % yhat[0])<\/pre>\n<p>Running the example first reports the evaluation of the model using repeated k-fold cross-validation, then the result of making a single prediction with a model fit on the entire dataset.<\/p>\n<pre class=\"crayon-plain-tag\">MAE: -11.854 (1.121)\r\nPrediction: -80.661<\/pre>\n<\/p>\n<h3>Histogram-Based Gradient Boosting<\/h3>\n<p>The scikit-learn library provides an alternate implementation of the gradient boosting algorithm, referred to as histogram-based gradient boosting.<\/p>\n<p>This is an alternate approach to implement gradient tree boosting inspired by the LightGBM library (described more later). This implementation is provided via the <em>HistGradientBoostingClassifier<\/em> and <em>HistGradientBoostingRegressor<\/em> classes.<\/p>\n<p>The primary benefit of the histogram-based approach to gradient boosting is speed. These implementations are designed to be much faster to fit on training data.<\/p>\n<p>At the time of writing, this is an experimental implementation and requires that you add the following line to your code to enable access to these classes.<\/p>\n<pre class=\"crayon-plain-tag\">from sklearn.experimental import enable_hist_gradient_boosting<\/pre>\n<p>Without this line, you will see an error like:<\/p>\n<pre class=\"crayon-plain-tag\">ImportError: cannot import name 'HistGradientBoostingClassifier'<\/pre>\n<p>or<\/p>\n<pre class=\"crayon-plain-tag\">ImportError: cannot import name 'HistGradientBoostingRegressor'<\/pre>\n<p>Let&rsquo;s take a close look at how to use this implementation.<\/p>\n<h4>Histogram-Based Gradient Boosting Machine for Classification<\/h4>\n<p>The example below first evaluates a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.HistGradientBoostingClassifier.html\">HistGradientBoostingClassifier<\/a> on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># histogram-based gradient boosting for classification in scikit-learn\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom sklearn.datasets import make_classification\r\nfrom sklearn.experimental import enable_hist_gradient_boosting\r\nfrom sklearn.ensemble import HistGradientBoostingClassifier\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom matplotlib import pyplot\r\n# define dataset\r\nX, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)\r\n# evaluate the model\r\nmodel = HistGradientBoostingClassifier()\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# fit the model on the whole dataset\r\nmodel = HistGradientBoostingClassifier()\r\nmodel.fit(X, y)\r\n# make a single prediction\r\nrow = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]\r\nyhat = model.predict(row)\r\nprint('Prediction: %d' % yhat[0])<\/pre>\n<p>Running the example first reports the evaluation of the model using repeated k-fold cross-validation, then the result of making a single prediction with a model fit on the entire dataset.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.935 (0.024)\r\nPrediction: 1<\/pre>\n<\/p>\n<h4>Histogram-Based Gradient Boosting Machine for Regression<\/h4>\n<p>The example below first evaluates a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.HistGradientBoostingRegressor.html\">HistGradientBoostingRegressor<\/a> on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># histogram-based gradient boosting for regression in scikit-learn\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom sklearn.datasets import make_regression\r\nfrom sklearn.experimental import enable_hist_gradient_boosting\r\nfrom sklearn.ensemble import HistGradientBoostingRegressor\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedKFold\r\nfrom matplotlib import pyplot\r\n# define dataset\r\nX, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)\r\n# evaluate the model\r\nmodel = HistGradientBoostingRegressor()\r\ncv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# fit the model on the whole dataset\r\nmodel = HistGradientBoostingRegressor()\r\nmodel.fit(X, y)\r\n# make a single prediction\r\nrow = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]]\r\nyhat = model.predict(row)\r\nprint('Prediction: %.3f' % yhat[0])<\/pre>\n<p>Running the example first reports the evaluation of the model using repeated k-fold cross-validation, then the result of making a single prediction with a model fit on the entire dataset.<\/p>\n<pre class=\"crayon-plain-tag\">MAE: -12.723 (1.540)\r\nPrediction: -77.837<\/pre>\n<\/p>\n<h2>Gradient Boosting With XGBoost<\/h2>\n<p><a href=\"https:\/\/xgboost.ai\/\">XGBoost<\/a>, which is short for &ldquo;<em>Extreme Gradient Boosting<\/em>,&rdquo; is a library that provides an efficient implementation of the gradient boosting algorithm.<\/p>\n<p>The main benefit of the XGBoost implementation is computational efficiency and often better model performance.<\/p>\n<p>For more on the benefits and capability of XGBoost, see the tutorial:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/gentle-introduction-xgboost-applied-machine-learning\/\">A Gentle Introduction to XGBoost for Applied Machine Learning<\/a><\/li>\n<\/ul>\n<h3>Library Installation<\/h3>\n<p>You can install the XGBoost library using the pip Python installer, as follows:<\/p>\n<pre class=\"crayon-plain-tag\">sudo pip install xgboost<\/pre>\n<p>For additional installation instructions specific to your platform see:<\/p>\n<ul>\n<li><a href=\"https:\/\/xgboost.readthedocs.io\/en\/latest\/build.html\">XGBoost Installation Guide<\/a><\/li>\n<\/ul>\n<p>Next, let&rsquo;s confirm that the library is installed and you are using a modern version.<\/p>\n<p>Run the following script to print the library version number.<\/p>\n<pre class=\"crayon-plain-tag\"># check xgboost version\r\nimport xgboost\r\nprint(xgboost.__version__)<\/pre>\n<p>Running the example, you should see the following version number or higher.<\/p>\n<pre class=\"crayon-plain-tag\">1.0.1<\/pre>\n<p>The XGBoost library provides wrapper classes so that the efficient algorithm implementation can be used with the scikit-learn library, specifically via the <em>XGBClassifier<\/em> and <em>XGBregressor<\/em> classes.<\/p>\n<p>Let&rsquo;s take a closer look at each in turn.<\/p>\n<h3>XGBoost for Classification<\/h3>\n<p>The example below first evaluates an <a href=\"https:\/\/xgboost.readthedocs.io\/en\/latest\/python\/python_api.html#xgboost.XGBClassifier\">XGBClassifier<\/a> on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># xgboost for classification\r\nfrom numpy import asarray\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom sklearn.datasets import make_classification\r\nfrom xgboost import XGBClassifier\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom matplotlib import pyplot\r\n# define dataset\r\nX, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)\r\n# evaluate the model\r\nmodel = XGBClassifier()\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# fit the model on the whole dataset\r\nmodel = XGBClassifier()\r\nmodel.fit(X, y)\r\n# make a single prediction\r\nrow = [2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]\r\nrow = asarray(row).reshape((1, len(row)))\r\nyhat = model.predict(row)\r\nprint('Prediction: %d' % yhat[0])<\/pre>\n<p>Running the example first reports the evaluation of the model using repeated k-fold cross-validation, then the result of making a single prediction with a model fit on the entire dataset.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.936 (0.019)\r\nPrediction: 1<\/pre>\n<\/p>\n<h3>XGBoost for Regression<\/h3>\n<p>The example below first evaluates an <a href=\"https:\/\/xgboost.readthedocs.io\/en\/latest\/python\/python_api.html#xgboost.XGBRegressor\">XGBRegressor<\/a> on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># xgboost for regression\r\nfrom numpy import asarray\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom sklearn.datasets import make_regression\r\nfrom xgboost import XGBRegressor\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedKFold\r\nfrom matplotlib import pyplot\r\n# define dataset\r\nX, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)\r\n# evaluate the model\r\nmodel = XGBRegressor(objective='reg:squarederror')\r\ncv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# fit the model on the whole dataset\r\nmodel = XGBRegressor(objective='reg:squarederror')\r\nmodel.fit(X, y)\r\n# make a single prediction\r\nrow = [2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]\r\nrow = asarray(row).reshape((1, len(row)))\r\nyhat = model.predict(row)\r\nprint('Prediction: %.3f' % yhat[0])<\/pre>\n<p>Running the example first reports the evaluation of the model using repeated k-fold cross-validation, then the result of making a single prediction with a model fit on the entire dataset.<\/p>\n<pre class=\"crayon-plain-tag\">MAE: -15.048 (1.316)\r\nPrediction: -93.434<\/pre>\n<\/p>\n<h2>Gradient Boosting With LightGBM<\/h2>\n<p><a href=\"https:\/\/github.com\/microsoft\/LightGBM\">LightGBM<\/a>, short for Light Gradient Boosted Machine, is a library developed at Microsoft that provides an efficient implementation of the gradient boosting algorithm.<\/p>\n<p>The primary benefit of the LightGBM is the changes to the training algorithm that make the process dramatically faster, and in many cases, result in a more effective model.<\/p>\n<p>For more technical details on the LightGBM algorithm, see the paper:<\/p>\n<ul>\n<li><a href=\"https:\/\/papers.nips.cc\/paper\/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree\">LightGBM: A Highly Efficient Gradient Boosting Decision Tree<\/a>, 2017.<\/li>\n<\/ul>\n<h3>Library Installation<\/h3>\n<p>You can install the LightGBM library using the pip Python installer, as follows:<\/p>\n<pre class=\"crayon-plain-tag\">sudo pip install lightgbm<\/pre>\n<p>For additional installation instructions specific to your platform, see:<\/p>\n<ul>\n<li><a href=\"https:\/\/lightgbm.readthedocs.io\/en\/latest\/Installation-Guide.html\">LightGBM Installation Guide<\/a><\/li>\n<\/ul>\n<p>Next, let&rsquo;s confirm that the library is installed and you are using a modern version.<\/p>\n<p>Run the following script to print the library version number.<\/p>\n<pre class=\"crayon-plain-tag\"># check lightgbm version\r\nimport lightgbm\r\nprint(lightgbm.__version__)<\/pre>\n<p>Running the example, you should see the following version number or higher.<\/p>\n<pre class=\"crayon-plain-tag\">2.3.1<\/pre>\n<p>The LightGBM library provides wrapper classes so that the efficient algorithm implementation can be used with the scikit-learn library, specifically via the <em>LGBMClassifier<\/em> and <em>LGBMRegressor<\/em> classes.<\/p>\n<p>Let&rsquo;s take a closer look at each in turn.<\/p>\n<h3>LightGBM for Classification<\/h3>\n<p>The example below first evaluates an <a href=\"https:\/\/lightgbm.readthedocs.io\/en\/latest\/pythonapi\/lightgbm.LGBMClassifier.html\">LGBMClassifier<\/a> on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># lightgbm for classification\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom sklearn.datasets import make_classification\r\nfrom lightgbm import LGBMClassifier\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom matplotlib import pyplot\r\n# define dataset\r\nX, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)\r\n# evaluate the model\r\nmodel = LGBMClassifier()\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# fit the model on the whole dataset\r\nmodel = LGBMClassifier()\r\nmodel.fit(X, y)\r\n# make a single prediction\r\nrow = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]\r\nyhat = model.predict(row)\r\nprint('Prediction: %d' % yhat[0])<\/pre>\n<p>Running the example first reports the evaluation of the model using repeated k-fold cross-validation, then the result of making a single prediction with a model fit on the entire dataset.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.934 (0.021)\r\nPrediction: 1<\/pre>\n<\/p>\n<h3>LightGBM for Regression<\/h3>\n<p>The example below first evaluates an <a href=\"https:\/\/lightgbm.readthedocs.io\/en\/latest\/pythonapi\/lightgbm.LGBMRegressor.html\">LGBMRegressor<\/a> on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># lightgbm for regression\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom sklearn.datasets import make_regression\r\nfrom lightgbm import LGBMRegressor\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedKFold\r\nfrom matplotlib import pyplot\r\n# define dataset\r\nX, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)\r\n# evaluate the model\r\nmodel = LGBMRegressor()\r\ncv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# fit the model on the whole dataset\r\nmodel = LGBMRegressor()\r\nmodel.fit(X, y)\r\n# make a single prediction\r\nrow = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]]\r\nyhat = model.predict(row)\r\nprint('Prediction: %.3f' % yhat[0])<\/pre>\n<p>Running the example first reports the evaluation of the model using repeated k-fold cross-validation, then the result of making a single prediction with a model fit on the entire dataset.<\/p>\n<pre class=\"crayon-plain-tag\">MAE: -12.739 (1.408)\r\nPrediction: -82.040<\/pre>\n<\/p>\n<h3>Gradient Boosting with CatBoost<\/h3>\n<p><a href=\"https:\/\/catboost.ai\/\">CatBoost<\/a> is a third-party library developed at <a href=\"https:\/\/en.wikipedia.org\/wiki\/Yandex\">Yandex<\/a> that provides an efficient implementation of the gradient boosting algorithm.<\/p>\n<p>The primary benefit of the CatBoost (in addition to computational speed improvements) is support for categorical input variables. This gives the library its name CatBoost for &ldquo;<em>Category Gradient Boosting<\/em>.&rdquo;<\/p>\n<p>For more technical details on the CatBoost algorithm, see the paper:<\/p>\n<ul>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1810.11363\">CatBoost: gradient boosting with categorical features support<\/a>, 2017.<\/li>\n<\/ul>\n<h3>Library Installation<\/h3>\n<p>You can install the CatBoost library using the pip Python installer, as follows:<\/p>\n<pre class=\"crayon-plain-tag\">sudo pip install catboost<\/pre>\n<p>For additional installation instructions specific to your platform, see:<\/p>\n<ul>\n<li><a href=\"https:\/\/catboost.ai\/docs\/concepts\/python-installation.html\">CatBoost Installation Guide<\/a><\/li>\n<\/ul>\n<p>Next, let&rsquo;s confirm that the library is installed and you are using a modern version.<\/p>\n<p>Run the following script to print the library version number.<\/p>\n<pre class=\"crayon-plain-tag\"># check catboost version\r\nimport catboost\r\nprint(catboost.__version__)<\/pre>\n<p>Running the example, you should see the following version number or higher.<\/p>\n<pre class=\"crayon-plain-tag\">0.21<\/pre>\n<p>The CatBoost library provides wrapper classes so that the efficient algorithm implementation can be used with the scikit-learn library, specifically via the <em>CatBoostClassifier<\/em> and <em>CatBoostRegressor<\/em> classes.<\/p>\n<p>Let&rsquo;s take a closer look at each in turn.<\/p>\n<h3>CatBoost for Classification<\/h3>\n<p>The example below first evaluates a <a href=\"https:\/\/catboost.ai\/docs\/concepts\/python-reference_catboostclassifier.html\">CatBoostClassifier<\/a> on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># catboost for classification\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom sklearn.datasets import make_classification\r\nfrom catboost import CatBoostClassifier\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom matplotlib import pyplot\r\n# define dataset\r\nX, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)\r\n# evaluate the model\r\nmodel = CatBoostClassifier(verbose=0, n_estimators=100)\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# fit the model on the whole dataset\r\nmodel = CatBoostClassifier(verbose=0, n_estimators=100)\r\nmodel.fit(X, y)\r\n# make a single prediction\r\nrow = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]\r\nyhat = model.predict(row)\r\nprint('Prediction: %d' % yhat[0])<\/pre>\n<p>Running the example first reports the evaluation of the model using repeated k-fold cross-validation, then the result of making a single prediction with a model fit on the entire dataset.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.931 (0.026)\r\nPrediction: 1<\/pre>\n<\/p>\n<h3>CatBoost for Regression<\/h3>\n<p>The example below first evaluates a <a href=\"https:\/\/catboost.ai\/docs\/concepts\/python-reference_catboostregressor.html\">CatBoostRegressor<\/a> on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># catboost for regression\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom sklearn.datasets import make_regression\r\nfrom catboost import CatBoostRegressor\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedKFold\r\nfrom matplotlib import pyplot\r\n# define dataset\r\nX, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)\r\n# evaluate the model\r\nmodel = CatBoostRegressor(verbose=0, n_estimators=100)\r\ncv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\r\n# fit the model on the whole dataset\r\nmodel = CatBoostRegressor(verbose=0, n_estimators=100)\r\nmodel.fit(X, y)\r\n# make a single prediction\r\nrow = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]]\r\nyhat = model.predict(row)\r\nprint('Prediction: %.3f' % yhat[0])<\/pre>\n<p>Running the example first reports the evaluation of the model using repeated k-fold cross-validation, then the result of making a single prediction with a model fit on the entire dataset.<\/p>\n<pre class=\"crayon-plain-tag\">MAE: -9.281 (0.951)\r\nPrediction: -74.212<\/pre>\n<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Tutorials<\/h3>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/setup-python-environment-machine-learning-deep-learning-anaconda\/\">How to Setup Your Python Environment for Machine Learning with Anaconda<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/gentle-introduction-gradient-boosting-algorithm-machine-learning\/\">A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/configure-gradient-boosting-algorithm\/\">How to Configure the Gradient Boosting Algorithm<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/gentle-introduction-xgboost-applied-machine-learning\/\">A Gentle Introduction to XGBoost for Applied Machine Learning<\/a><\/li>\n<\/ul>\n<h3>Papers<\/h3>\n<ul>\n<li><a href=\"https:\/\/www.sciencedirect.com\/science\/article\/pii\/S0167947301000652\">Stochastic Gradient Boosting<\/a>, 2002.<\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1603.02754\">XGBoost: A Scalable Tree Boosting System<\/a>, 2016.<\/li>\n<li><a href=\"https:\/\/papers.nips.cc\/paper\/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree\">LightGBM: A Highly Efficient Gradient Boosting Decision Tree<\/a>, 2017.<\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1810.11363\">CatBoost: gradient boosting with categorical features support<\/a>, 2017.<\/li>\n<\/ul>\n<h3>APIs<\/h3>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/\">Scikit-Learn Homepage<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/classes.html#module-sklearn.ensemble\">sklearn.ensemble API<\/a>.<\/li>\n<li><a href=\"https:\/\/xgboost.ai\/\">XGBoost Homepage<\/a>.<\/li>\n<li><a href=\"https:\/\/xgboost.readthedocs.io\/en\/latest\/python\/python_api.html\">XGBoost Python API<\/a>.<\/li>\n<li><a href=\"https:\/\/github.com\/microsoft\/LightGBM\">LightGBM Project<\/a>.<\/li>\n<li><a href=\"https:\/\/lightgbm.readthedocs.io\/en\/latest\/Python-API.html\">LightGBM Python API<\/a>.<\/li>\n<li><a href=\"https:\/\/catboost.ai\/\">CatBoost Homepage<\/a>.<\/li>\n<li><a href=\"https:\/\/catboost.ai\/docs\/\">CatBoost API<\/a>.<\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Gradient_boosting\">Gradient boosting, Wikipedia<\/a>.<\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/XGBoost\">XGBoost, Wikipedia<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to use gradient boosting models for classification and regression in Python.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>Gradient boosting is an ensemble algorithm that fits boosted decision trees by minimizing an error gradient.<\/li>\n<li>How to evaluate and use gradient boosting with scikit-learn, including gradient boosting machines and the histogram-based algorithm.<\/li>\n<li>How to evaluate and use third-party gradient boosting algorithms including XGBoost, LightGBM and CatBoost.<\/li>\n<\/ul>\n<p><strong>Do you have any questions?<\/strong><br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/gradient-boosting-with-scikit-learn-xgboost-lightgbm-and-catboost\/\">Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/gradient-boosting-with-scikit-learn-xgboost-lightgbm-and-catboost\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Gradient boosting is a powerful ensemble machine learning algorithm. It&rsquo;s popular for structured predictive modeling problems, such as classification and regression on [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/03\/31\/gradient-boosting-with-scikit-learn-xgboost-lightgbm-and-catboost\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":3293,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3292"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=3292"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3292\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/3293"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=3292"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=3292"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=3292"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}