{"id":1649,"date":"2019-01-29T18:00:06","date_gmt":"2019-01-29T18:00:06","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/01\/29\/how-to-choose-loss-functions-when-training-deep-learning-neural-networks\/"},"modified":"2019-01-29T18:00:06","modified_gmt":"2019-01-29T18:00:06","slug":"how-to-choose-loss-functions-when-training-deep-learning-neural-networks","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/01\/29\/how-to-choose-loss-functions-when-training-deep-learning-neural-networks\/","title":{"rendered":"How to Choose Loss Functions When Training Deep Learning Neural Networks"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Deep learning neural networks are trained using the stochastic gradient descent optimization algorithm.<\/p>\n<p>As part of the optimization algorithm, the error for the current state of the model must be estimated repeatedly. This requires the choice of an error function, conventionally called a loss function, that can be used to estimate the loss of the model so that the weights can be updated to reduce the loss on the next evaluation.<\/p>\n<p>Neural network models learn a mapping from inputs to outputs from examples and the choice of loss function must match the framing of the specific predictive modeling problem, such as classification or regression. Further, the configuration of the output layer must also be appropriate for the chosen loss function.<\/p>\n<p>In this tutorial, you will discover how to choose a loss function for your deep learning neural network for a given predictive modeling problem.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>How to configure a model for mean squared error and variants for regression problems.<\/li>\n<li>How to configure a model for cross-entropy and hinge loss functions for binary classification.<\/li>\n<li>How to configure a model for cross-entropy and KL divergence loss functions for multi-class classification.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_6928\" style=\"width: 650px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6928\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/01\/How-to-Choose-Loss-Functions-When-Training-Deep-Learning-Neural-Networks.jpg\" alt=\"How to Choose Loss Functions When Training Deep Learning Neural Networks\" width=\"640\" height=\"427\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/01\/How-to-Choose-Loss-Functions-When-Training-Deep-Learning-Neural-Networks.jpg 640w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/01\/How-to-Choose-Loss-Functions-When-Training-Deep-Learning-Neural-Networks-300x200.jpg 300w\" sizes=\"(max-width: 640px) 100vw, 640px\"><\/p>\n<p class=\"wp-caption-text\">How to Choose Loss Functions When Training Deep Learning Neural Networks<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/glaciernps\/5346153055\/\">GlacierNPS<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into three parts; they are:<\/p>\n<ol>\n<li>Regression Loss Functions\n<ol>\n<li>Mean Squared Error Loss<\/li>\n<li>Mean Squared Logarithmic Error Loss<\/li>\n<li>Mean Absolute Error Loss<\/li>\n<\/ol>\n<\/li>\n<li>Binary Classification Loss Functions\n<ol>\n<li>Binary Cross-Entropy<\/li>\n<li>Hinge Loss<\/li>\n<li>Squared Hinge Loss<\/li>\n<\/ol>\n<\/li>\n<li>Multi-Class Classification Loss Functions\n<ol>\n<li>Multi-Class Cross-Entropy Loss<\/li>\n<li>Sparse Multiclass Cross-Entropy Loss<\/li>\n<li>Kullback Leibler Divergence Loss<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<h2>Regression Loss Functions<\/h2>\n<p>A regression predictive modeling problem involves predicting a real-valued quantity.<\/p>\n<p>In this section, we will investigate loss functions that are appropriate for regression predictive modeling problems.<\/p>\n<p>As the context for this investigation, we will use a standard regression problem generator provided by the scikit-learn library in the <a href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_regression.html\">make_regression() function<\/a>. This function will generate examples from a simple regression problem with a given number of input variables, statistical noise, and other properties.<\/p>\n<p>We will use this function to define a problem that has 20 input features; 10 of the features will be meaningful and 10 will not be relevant. A total of 1,000 examples will be randomly generated. The pseudorandom number generator will be fixed to ensure that we get the same 1,000 examples each time the code is run.<\/p>\n<pre class=\"crayon-plain-tag\"># generate regression dataset\r\nX, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1)<\/pre>\n<p>Neural networks generally perform better when the real-valued input and output variables are to be scaled to a sensible range. For this problem, each of the input variables and the target variable have a Gaussian distribution; therefore, standardizing the data in this case is desirable.<\/p>\n<p>We can achieve this using the <a href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.StandardScaler.html\">StandardScaler transformer<\/a> class also from the scikit-learn library. On a real problem, we would prepare the scaler on the training dataset and apply it to the train and test sets, but for simplicity, we will scale all of the data together before splitting into train and test sets.<\/p>\n<pre class=\"crayon-plain-tag\"># standardize dataset\r\nX = StandardScaler().fit_transform(X)\r\ny = StandardScaler().fit_transform(y.reshape(len(y),1))[:,0]<\/pre>\n<p>Once scaled, the data will be split evenly into train and test sets.<\/p>\n<pre class=\"crayon-plain-tag\"># split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]<\/pre>\n<p>A small Multilayer Perceptron (MLP) model will be defined to address this problem and provide the basis for exploring different loss functions.<\/p>\n<p>The model will expect 20 features as input as defined by the problem. The model will have one hidden layer with 25 nodes and will use the rectified linear activation function. The output layer will have 1 node, given the one real-value to be predicted, and will use the linear activation function.<\/p>\n<pre class=\"crayon-plain-tag\"># define model\r\nmodel = Sequential()\r\nmodel.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(1, activation='linear'))<\/pre>\n<p>The model will be fit with stochastic gradient descent with a learning rate of 0.01 and a momentum of 0.9, both sensible default values.<\/p>\n<p>Training will be performed for 100 epochs and the test set will be evaluated at the end of each epoch so that we can plot learning curves at the end of the run.<\/p>\n<pre class=\"crayon-plain-tag\">opt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='...', optimizer=opt)\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0)<\/pre>\n<p>Now that we have the basis of a problem and model, we can take a look evaluating three common loss functions that are appropriate for a regression predictive modeling problem.<\/p>\n<p>Although an MLP is used in these examples, the same loss functions can be used when training CNN and RNN models for regression.<\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Want Better Results with Deep Learning?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/1433e7773f72a2%3A164f8be4f346dc\/5764144745676800\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"1433e7773f72a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/1433e7773f72a2%3A164f8be4f346dc\/5764144745676800\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1543333086.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<h3>Mean Squared Error Loss<\/h3>\n<p>The Mean Squared Error, or MSE, loss is the default loss to use for regression problems.<\/p>\n<p>Mathematically, it is the preferred loss function under the inference framework of maximum likelihood if the distribution of the target variable is Gaussian. It is the loss function to be evaluated first and only changed if you have a good reason.<\/p>\n<p>Mean squared error is calculated as the average of the squared differences between the predicted and actual values. The result is always positive regardless of the sign of the predicted and actual values and a perfect value is 0.0. The squaring means that larger mistakes result in more error than smaller mistakes, meaning that the model is punished for making larger mistakes.<\/p>\n<p>The mean squared error loss function can be used in Keras by specifying \u2018<em>mse<\/em>\u2018 or \u2018<em>mean_squared_error<\/em>\u2018 as the loss function when compiling the model.<\/p>\n<pre class=\"crayon-plain-tag\">model.compile(loss='mean_squared_error')<\/pre>\n<p>It is recommended that the output layer has one node for the target variable and the linear activation function is used.<\/p>\n<pre class=\"crayon-plain-tag\">model.add(Dense(1, activation='linear'))<\/pre>\n<p>A complete example of demonstrating an MLP on the described regression problem is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp for regression with mse loss function\r\nfrom sklearn.datasets import make_regression\r\nfrom sklearn.preprocessing import StandardScaler\r\nfrom keras.models import Sequential\r\nfrom keras.layers import Dense\r\nfrom keras.optimizers import SGD\r\nfrom matplotlib import pyplot\r\n# generate regression dataset\r\nX, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1)\r\n# standardize dataset\r\nX = StandardScaler().fit_transform(X)\r\ny = StandardScaler().fit_transform(y.reshape(len(y),1))[:,0]\r\n# split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(1, activation='linear'))\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='mean_squared_error', optimizer=opt)\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0)\r\n# evaluate the model\r\ntrain_mse = model.evaluate(trainX, trainy, verbose=0)\r\ntest_mse = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_mse, test_mse))\r\n# plot loss during training\r\npyplot.title('Loss \/ Mean Squared Error')\r\npyplot.plot(history.history['loss'], label='train')\r\npyplot.plot(history.history['val_loss'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example first prints the mean squared error for the model on the train and test datasets.<\/p>\n<p>Given the stochastic nature of the training algorithm, your specific results may vary. Try running the example a few times.<\/p>\n<p>In this case, we can see that the model learned the problem achieving zero error, at least to three decimal places.<\/p>\n<pre class=\"crayon-plain-tag\">Train: 0.000, Test: 0.001<\/pre>\n<p>A line plot is also created showing the mean squared error loss over the training epochs for both the train (blue) and test (orange) sets.<\/p>\n<p>We can see that the model converged reasonably quickly and both train and test performance remained equivalent. The performance and convergence behavior of the model suggest that mean squared error is a good match for a neural network learning this problem.<\/p>\n<div id=\"attachment_6917\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6917\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-plot-of-Mean-Squared-Error-Loss-over-Training-Epochs-When-Optimizing-the-Mean-Squared-Error-Loss-Function.png\" alt=\"Line plot of Mean Squared Error Loss over Training Epochs When Optimizing the Mean Squared Error Loss Function\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-plot-of-Mean-Squared-Error-Loss-over-Training-Epochs-When-Optimizing-the-Mean-Squared-Error-Loss-Function.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-plot-of-Mean-Squared-Error-Loss-over-Training-Epochs-When-Optimizing-the-Mean-Squared-Error-Loss-Function-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-plot-of-Mean-Squared-Error-Loss-over-Training-Epochs-When-Optimizing-the-Mean-Squared-Error-Loss-Function-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-plot-of-Mean-Squared-Error-Loss-over-Training-Epochs-When-Optimizing-the-Mean-Squared-Error-Loss-Function-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line plot of Mean Squared Error Loss over Training Epochs When Optimizing the Mean Squared Error Loss Function<\/p>\n<\/div>\n<h3>Mean Squared Logarithmic Error Loss<\/h3>\n<p>There may be regression problems in which the target value has a spread of values and when predicting a large value, you may not want to punish a model as heavily as mean squared error.<\/p>\n<p>Instead, you can first calculate the natural logarithm of each of the predicted values, then calculate the mean squared error. This is called the Mean Squared Logarithmic Error loss, or MSLE for short.<\/p>\n<p>It has the effect of relaxing the punishing effect of large differences in large predicted values.<\/p>\n<p>As a loss measure, it may be more appropriate when the model is predicting unscaled quantities directly. Nevertheless, we can demonstrate this loss function using our simple regression problem.<\/p>\n<p>The model can be updated to use the \u2018<em>mean_squared_logarithmic_error<\/em>\u2018 loss function and keep the same configuration for the output layer. We will also track the mean squared error as a metric when fitting the model so that we can use it as a measure of performance and plot the learning curve.<\/p>\n<pre class=\"crayon-plain-tag\">model.compile(loss='mean_squared_logarithmic_error', optimizer=opt, metrics=['mse'])<\/pre>\n<p>The complete example of using the MSLE loss function is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp for regression with msle loss function\r\nfrom sklearn.datasets import make_regression\r\nfrom sklearn.preprocessing import StandardScaler\r\nfrom keras.models import Sequential\r\nfrom keras.layers import Dense\r\nfrom keras.optimizers import SGD\r\nfrom matplotlib import pyplot\r\n# generate regression dataset\r\nX, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1)\r\n# standardize dataset\r\nX = StandardScaler().fit_transform(X)\r\ny = StandardScaler().fit_transform(y.reshape(len(y),1))[:,0]\r\n# split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(1, activation='linear'))\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='mean_squared_logarithmic_error', optimizer=opt, metrics=['mse'])\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0)\r\n# evaluate the model\r\n_, train_mse = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_mse = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_mse, test_mse))\r\n# plot loss during training\r\npyplot.subplot(211)\r\npyplot.title('Loss')\r\npyplot.plot(history.history['loss'], label='train')\r\npyplot.plot(history.history['val_loss'], label='test')\r\npyplot.legend()\r\n# plot mse during training\r\npyplot.subplot(212)\r\npyplot.title('Mean Squared Error')\r\npyplot.plot(history.history['mean_squared_error'], label='train')\r\npyplot.plot(history.history['val_mean_squared_error'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example first prints the mean squared error for the model on the train and test dataset.<\/p>\n<p>Given the stochastic nature of the training algorithm, your specific results may vary. Try running the example a few times.<\/p>\n<p>In this case, we can see that the model resulted in slightly worse MSE on both the training and test dataset. It may not be a good fit for this problem as the distribution of the target variable is a standard Gaussian.<\/p>\n<pre class=\"crayon-plain-tag\">Train: 0.165, Test: 0.184<\/pre>\n<p>A line plot is also created showing the mean squared logistic error loss over the training epochs for both the train (blue) and test (orange) sets (top), and a similar plot for the mean squared error (bottom).<\/p>\n<p>We can see that the MSLE converged well over the 100 epochs algorithm; it appears that the MSE may be showing signs of overfitting the problem, dropping fast and starting to rise from epoch 20 onwards.<\/p>\n<div id=\"attachment_6918\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6918\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-plots-of-Mean-Squared-Logistic-Error-Loss-and-Mean-Squared-Error-over-Training-Epochs.png\" alt=\"Line Plots of Mean Squared Logistic Error Loss and Mean Squared Error Over Training Epochs\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-plots-of-Mean-Squared-Logistic-Error-Loss-and-Mean-Squared-Error-over-Training-Epochs.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-plots-of-Mean-Squared-Logistic-Error-Loss-and-Mean-Squared-Error-over-Training-Epochs-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-plots-of-Mean-Squared-Logistic-Error-Loss-and-Mean-Squared-Error-over-Training-Epochs-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-plots-of-Mean-Squared-Logistic-Error-Loss-and-Mean-Squared-Error-over-Training-Epochs-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plots of Mean Squared Logistic Error Loss and Mean Squared Error Over Training Epochs<\/p>\n<\/div>\n<h3>Mean Absolute Error Loss<\/h3>\n<p>On some regression problems, the distribution of the target variable may be mostly Gaussian, but may have outliers, e.g. large or small values far from the mean value.<\/p>\n<p>The Mean Absolute Error, or MAE, loss is an appropriate loss function in this case as it is more robust to outliers. It is calculated as the average of the absolute difference between the actual and predicted values.<\/p>\n<p>The model can be updated to use the \u2018<em>mean_absolute_error<\/em>\u2018 loss function and keep the same configuration for the output layer.<\/p>\n<pre class=\"crayon-plain-tag\">model.compile(loss='mean_absolute_error', optimizer=opt, metrics=['mse'])<\/pre>\n<p>The complete example using the mean absolute error as the loss function on the regression test problem is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp for regression with mae loss function\r\nfrom sklearn.datasets import make_regression\r\nfrom sklearn.preprocessing import StandardScaler\r\nfrom keras.models import Sequential\r\nfrom keras.layers import Dense\r\nfrom keras.optimizers import SGD\r\nfrom matplotlib import pyplot\r\n# generate regression dataset\r\nX, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1)\r\n# standardize dataset\r\nX = StandardScaler().fit_transform(X)\r\ny = StandardScaler().fit_transform(y.reshape(len(y),1))[:,0]\r\n# split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(1, activation='linear'))\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='mean_absolute_error', optimizer=opt, metrics=['mse'])\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0)\r\n# evaluate the model\r\n_, train_mse = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_mse = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_mse, test_mse))\r\n# plot loss during training\r\npyplot.subplot(211)\r\npyplot.title('Loss')\r\npyplot.plot(history.history['loss'], label='train')\r\npyplot.plot(history.history['val_loss'], label='test')\r\npyplot.legend()\r\n# plot mse during training\r\npyplot.subplot(212)\r\npyplot.title('Mean Squared Error')\r\npyplot.plot(history.history['mean_squared_error'], label='train')\r\npyplot.plot(history.history['val_mean_squared_error'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example first prints the mean squared error for the model on the train and test dataset.<\/p>\n<p>Given the stochastic nature of the training algorithm, your specific results may vary. Try running the example a few times.<\/p>\n<p>In this case, we can see that the model learned the problem, achieving a near zero error, at least to three decimal places.<\/p>\n<pre class=\"crayon-plain-tag\">Train: 0.002, Test: 0.002<\/pre>\n<p>A line plot is also created showing the mean absolute error loss over the training epochs for both the train (blue) and test (orange) sets (top), and a similar plot for the mean squared error (bottom).<\/p>\n<p>In this case, we can see that MAE does converge but shows a bumpy course, although the dynamics of MSE don\u2019t appear greatly affected. We know that the target variable is a standard Gaussian with no large outliers, so MAE would not be a good fit in this case.<\/p>\n<p>It might be more appropriate on this problem if we did not scale the target variable first.<\/p>\n<div id=\"attachment_6919\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6919\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-plots-of-Mean-Absolute-Error-Loss-and-Mean-Squared-Error-over-Training-Epochs.png\" alt=\"Line plots of Mean Absolute Error Loss and Mean Squared Error over Training Epochs\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-plots-of-Mean-Absolute-Error-Loss-and-Mean-Squared-Error-over-Training-Epochs.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-plots-of-Mean-Absolute-Error-Loss-and-Mean-Squared-Error-over-Training-Epochs-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-plots-of-Mean-Absolute-Error-Loss-and-Mean-Squared-Error-over-Training-Epochs-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-plots-of-Mean-Absolute-Error-Loss-and-Mean-Squared-Error-over-Training-Epochs-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line plots of Mean Absolute Error Loss and Mean Squared Error over Training Epochs<\/p>\n<\/div>\n<h2>Binary Classification Loss Functions<\/h2>\n<p>Binary classification are those predictive modeling problems where examples are assigned one of two labels.<\/p>\n<p>The problem is often framed as predicting a value of 0 or 1 for the first or second class and is often implemented as predicting the probability of the example belonging to class value 1.<\/p>\n<p>In this section, we will investigate loss functions that are appropriate for binary classification predictive modeling problems.<\/p>\n<p>We will generate examples from the circles test problem in scikit-learn as the basis for this investigation. The <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_circles.html\">circles problem<\/a> involves samples drawn from two concentric circles on a two-dimensional plane, where points on the outer circle belong to class 0 and points for the inner circle belong to class 1. Statistical noise is added to the samples to add ambiguity and make the problem more challenging to learn.<\/p>\n<p>We will generate 1,000 examples and add 10% statistical noise. The pseudorandom number generator will be seeded with the same value to ensure that we always get the same 1,000 examples.<\/p>\n<pre class=\"crayon-plain-tag\"># generate circles\r\nX, y = make_circles(n_samples=1000, noise=0.1, random_state=1)<\/pre>\n<p>We can create a scatter plot of the dataset to get an idea of the problem we are modeling. The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># scatter plot of the circles dataset with points colored by class\r\nfrom sklearn.datasets import make_circles\r\nfrom numpy import where\r\nfrom matplotlib import pyplot\r\n# generate circles\r\nX, y = make_circles(n_samples=1000, noise=0.1, random_state=1)\r\n# select indices of points with each class label\r\nfor i in range(2):\r\n\tsamples_ix = where(y == i)\r\n\tpyplot.scatter(X[samples_ix, 0], X[samples_ix, 1], label=str(i))\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example creates a scatter plot of the examples, where the input variables define the location of the point and the class value defines the color, with class 0 blue and class 1 orange.<\/p>\n<div id=\"attachment_6920\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6920\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Dataset-for-the-Circles-Binary-Classification-Problem.png\" alt=\"Scatter Plot of Dataset for the Circles Binary Classification Problem\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Dataset-for-the-Circles-Binary-Classification-Problem.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Dataset-for-the-Circles-Binary-Classification-Problem-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Dataset-for-the-Circles-Binary-Classification-Problem-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Dataset-for-the-Circles-Binary-Classification-Problem-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Scatter Plot of Dataset for the Circles Binary Classification Problem<\/p>\n<\/div>\n<p>The points are already reasonably scaled around 0, almost in [-1,1]. We won\u2019t rescale them in this case.<\/p>\n<p>The dataset is split evenly for train and test sets.<\/p>\n<pre class=\"crayon-plain-tag\"># split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]<\/pre>\n<p>A simple MLP model can be defined to address this problem that expects two inputs for the two features in the dataset, a hidden layer with 50 nodes, a rectified linear activation function and an output layer that will need to be configured for the choice of loss function.<\/p>\n<pre class=\"crayon-plain-tag\"># define model\r\nmodel = Sequential()\r\nmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(1, activation='...'))<\/pre>\n<p>The model will be fit using stochastic gradient descent with the sensible default learning rate of 0.01 and momentum of 0.9.<\/p>\n<pre class=\"crayon-plain-tag\">opt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='...', optimizer=opt, metrics=['accuracy'])<\/pre>\n<p>We will fit the model for 200 training epochs and evaluate the performance of the model against the loss and accuracy at the end of each epoch so that we can plot learning curves.<\/p>\n<pre class=\"crayon-plain-tag\"># fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0)<\/pre>\n<p>Now that we have the basis of a problem and model, we can take a look evaluating three common loss functions that are appropriate for a binary classification predictive modeling problem.<\/p>\n<p>Although an MLP is used in these examples, the same loss functions can be used when training CNN and RNN models for binary classification.<\/p>\n<h3>Binary Cross-Entropy Loss<\/h3>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Cross_entropy\">Cross-entropy<\/a> is the default loss function to use for binary classification problems.<\/p>\n<p>It is intended for use with binary classification where the target values are in the set {0, 1}.<\/p>\n<p>Mathematically, it is the preferred loss function under the inference framework of maximum likelihood. It is the loss function to be evaluated first and only changed if you have a good reason.<\/p>\n<p>Cross-entropy will calculate a score that summarizes the average difference between the actual and predicted probability distributions for predicting class 1. The score is minimized and a perfect cross-entropy value is 0.<\/p>\n<p>Cross-entropy can be specified as the loss function in Keras by specifying \u2018<em>binary_crossentropy<\/em>\u2018 when compiling the model.<\/p>\n<pre class=\"crayon-plain-tag\">model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])<\/pre>\n<p>The function requires that the output layer is configured with a single node and a \u2018<em>sigmoid<\/em>\u2018 activation in order to predict the probability for class 1.<\/p>\n<pre class=\"crayon-plain-tag\">model.add(Dense(1, activation='sigmoid'))<\/pre>\n<p>The complete example of an MLP with cross-entropy loss for the two circles binary classification problem is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp for the circles problem with cross entropy loss\r\nfrom sklearn.datasets import make_circles\r\nfrom keras.models import Sequential\r\nfrom keras.layers import Dense\r\nfrom keras.optimizers import SGD\r\nfrom matplotlib import pyplot\r\n# generate 2d classification dataset\r\nX, y = make_circles(n_samples=1000, noise=0.1, random_state=1)\r\n# split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(1, activation='sigmoid'))\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0)\r\n# evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\r\n# plot loss during training\r\npyplot.subplot(211)\r\npyplot.title('Loss')\r\npyplot.plot(history.history['loss'], label='train')\r\npyplot.plot(history.history['val_loss'], label='test')\r\npyplot.legend()\r\n# plot accuracy during training\r\npyplot.subplot(212)\r\npyplot.title('Accuracy')\r\npyplot.plot(history.history['acc'], label='train')\r\npyplot.plot(history.history['val_acc'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example first prints the classification accuracy for the model on the train and test dataset.<\/p>\n<p>Given the stochastic nature of the training algorithm, your specific results may vary. Try running the example a few times.<\/p>\n<p>In this case, we can see that the model learned the problem reasonably well, achieving about 83% accuracy on the training dataset and about 85% on the test dataset. The scores are reasonably close, suggesting the model is probably not over or underfit.<\/p>\n<pre class=\"crayon-plain-tag\">Train: 0.836, Test: 0.852<\/pre>\n<p>A figure is also created showing two line plots, the top with the cross-entropy loss over epochs for the train (blue) and test (orange) dataset, and the bottom plot showing classification accuracy over epochs.<\/p>\n<p>The plot shows that the training process converged well. The plot for loss is smooth, given the continuous nature of the error between the probability distributions, whereas the line plot for accuracy shows bumps, given examples in the train and test set can ultimately only be predicted as correct or incorrect, providing less granular feedback on performance.<\/p>\n<div id=\"attachment_6921\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6921\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Cross-Entropy-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Two-Circles-Binary-Classification-Problem.png\" alt=\"Line Plots of Cross Entropy Loss and Classification Accuracy over Training Epochs on the Two Circles Binary Classification Problem\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Cross-Entropy-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Two-Circles-Binary-Classification-Problem.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Cross-Entropy-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Two-Circles-Binary-Classification-Problem-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Cross-Entropy-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Two-Circles-Binary-Classification-Problem-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Cross-Entropy-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Two-Circles-Binary-Classification-Problem-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plots of Cross Entropy Loss and Classification Accuracy over Training Epochs on the Two Circles Binary Classification Problem<\/p>\n<\/div>\n<h3>Hinge Loss<\/h3>\n<p>An alternative to cross-entropy for binary classification problems is the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Hinge_loss\">hinge loss function<\/a>, primarily developed for use with Support Vector Machine (SVM) models.<\/p>\n<p>It is intended for use with binary classification where the target values are in the set {-1, 1}.<\/p>\n<p>The hinge loss function encourages examples to have the correct sign, assigning more error when there is a difference in the sign between the actual and predicted class values.<\/p>\n<p>Reports of performance with the hinge loss are mixed, sometimes resulting in better performance than cross-entropy on binary classification problems.<\/p>\n<p>Firstly, the target variable must be modified to have values in the set {-1, 1}.<\/p>\n<pre class=\"crayon-plain-tag\"># change y from {0,1} to {-1,1}\r\ny[where(y == 0)] = -1<\/pre>\n<p>The hinge loss function can then be specified as the \u2018<em>hinge<\/em>\u2018 in the compile function.<\/p>\n<pre class=\"crayon-plain-tag\">model.compile(loss='hinge', optimizer=opt, metrics=['accuracy'])<\/pre>\n<p>Finally, the output layer of the network must be configured to have a single node with a hyperbolic tangent activation function capable of outputting a single value in the range [-1, 1].<\/p>\n<pre class=\"crayon-plain-tag\">model.add(Dense(1, activation='tanh'))<\/pre>\n<p>The complete example of an MLP with a hinge loss function for the two circles binary classification problem is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp for the circles problem with hinge loss\r\nfrom sklearn.datasets import make_circles\r\nfrom keras.models import Sequential\r\nfrom keras.layers import Dense\r\nfrom keras.optimizers import SGD\r\nfrom matplotlib import pyplot\r\nfrom numpy import where\r\n# generate 2d classification dataset\r\nX, y = make_circles(n_samples=1000, noise=0.1, random_state=1)\r\n# change y from {0,1} to {-1,1}\r\ny[where(y == 0)] = -1\r\n# split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(1, activation='tanh'))\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='hinge', optimizer=opt, metrics=['accuracy'])\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0)\r\n# evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\r\n# plot loss during training\r\npyplot.subplot(211)\r\npyplot.title('Loss')\r\npyplot.plot(history.history['loss'], label='train')\r\npyplot.plot(history.history['val_loss'], label='test')\r\npyplot.legend()\r\n# plot accuracy during training\r\npyplot.subplot(212)\r\npyplot.title('Accuracy')\r\npyplot.plot(history.history['acc'], label='train')\r\npyplot.plot(history.history['val_acc'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example first prints the classification accuracy for the model on the train and test dataset.<\/p>\n<p>Given the stochastic nature of the training algorithm, your specific results may vary. Try running the example a few times.<\/p>\n<p>In this case, we can see slightly worse performance than using cross-entropy, with the chosen model configuration with less than 80% accuracy on the train and test sets.<\/p>\n<pre class=\"crayon-plain-tag\">Train: 0.792, Test: 0.740<\/pre>\n<p>A figure is also created showing two line plots, the top with the hinge loss over epochs for the train (blue) and test (orange) dataset, and the bottom plot showing classification accuracy over epochs.<\/p>\n<p>The plot of hinge loss shows that the model has converged and has reasonable loss on both datasets. The plot of classification accuracy also shows signs of convergence, albeit at a lower level of skill than may be desirable on this problem.<\/p>\n<div id=\"attachment_6922\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6922\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Hinge-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Two-Circles-Binary-Classification-Problem.png\" alt=\"Line Plots of Hinge Loss and Classification Accuracy over Training Epochs on the Two Circles Binary Classification Problem\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Hinge-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Two-Circles-Binary-Classification-Problem.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Hinge-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Two-Circles-Binary-Classification-Problem-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Hinge-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Two-Circles-Binary-Classification-Problem-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Hinge-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Two-Circles-Binary-Classification-Problem-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plots of Hinge Loss and Classification Accuracy over Training Epochs on the Two Circles Binary Classification Problem<\/p>\n<\/div>\n<h3>Squared Hinge Loss<\/h3>\n<p>The hinge loss function has many extensions, often the subject of investigation with SVM models.<\/p>\n<p>A popular extension is called the squared hinge loss that simply calculates the square of the score hinge loss. It has the effect of smoothing the surface of the error function and making it numerically easier to work with.<\/p>\n<p>If using a hinge loss does result in better performance on a given binary classification problem, is likely that a squared hinge loss may be appropriate.<\/p>\n<p>As with using the hinge loss function, the target variable must be modified to have values in the set {-1, 1}.<\/p>\n<pre class=\"crayon-plain-tag\"># change y from {0,1} to {-1,1}\r\ny[where(y == 0)] = -1<\/pre>\n<p>The squared hinge loss can be specified as \u2018<em>squared_hinge<\/em>\u2018 in the <em>compile()<\/em> function when defining the model.<\/p>\n<pre class=\"crayon-plain-tag\">model.compile(loss='squared_hinge', optimizer=opt, metrics=['accuracy'])<\/pre>\n<p>And finally, the output layer must use a single node with a hyperbolic tangent activation function capable of outputting continuous values in the range [-1, 1].<\/p>\n<pre class=\"crayon-plain-tag\">model.add(Dense(1, activation='tanh'))<\/pre>\n<p>The complete example of an MLP with the squared hinge loss function on the two circles binary classification problem is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp for the circles problem with squared hinge loss\r\nfrom sklearn.datasets import make_circles\r\nfrom keras.models import Sequential\r\nfrom keras.layers import Dense\r\nfrom keras.optimizers import SGD\r\nfrom matplotlib import pyplot\r\nfrom numpy import where\r\n# generate 2d classification dataset\r\nX, y = make_circles(n_samples=1000, noise=0.1, random_state=1)\r\n# change y from {0,1} to {-1,1}\r\ny[where(y == 0)] = -1\r\n# split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(1, activation='tanh'))\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='squared_hinge', optimizer=opt, metrics=['accuracy'])\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0)\r\n# evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\r\n# plot loss during training\r\npyplot.subplot(211)\r\npyplot.title('Loss')\r\npyplot.plot(history.history['loss'], label='train')\r\npyplot.plot(history.history['val_loss'], label='test')\r\npyplot.legend()\r\n# plot accuracy during training\r\npyplot.subplot(212)\r\npyplot.title('Accuracy')\r\npyplot.plot(history.history['acc'], label='train')\r\npyplot.plot(history.history['val_acc'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example first prints the classification accuracy for the model on the train and test datasets.<\/p>\n<p>Given the stochastic nature of the training algorithm, your specific results may vary. Try running the example a few times.<\/p>\n<p>In this case, we can see that for this problem and the chosen model configuration, the hinge squared loss may not be appropriate, resulting in classification accuracy of less than 70% on the train and test sets.<\/p>\n<pre class=\"crayon-plain-tag\">Train: 0.682, Test: 0.646<\/pre>\n<p>A figure is also created showing two line plots, the top with the squared hinge loss over epochs for the train (blue) and test (orange) dataset, and the bottom plot showing classification accuracy over epochs.<\/p>\n<p>The plot of loss shows that indeed, the model converged, but the shape of the error surface is not as smooth as other loss functions where small changes to the weights are causing large changes in loss.<\/p>\n<div id=\"attachment_6923\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6923\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Squared-Hinge-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Two-Circles-Binary-Classification-Problem.png\" alt=\"Line Plots of Squared Hinge Loss and Classification Accuracy over Training Epochs on the Two Circles Binary Classification Problem\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Squared-Hinge-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Two-Circles-Binary-Classification-Problem.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Squared-Hinge-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Two-Circles-Binary-Classification-Problem-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Squared-Hinge-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Two-Circles-Binary-Classification-Problem-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Squared-Hinge-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Two-Circles-Binary-Classification-Problem-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plots of Squared Hinge Loss and Classification Accuracy over Training Epochs on the Two Circles Binary Classification Problem<\/p>\n<\/div>\n<h2>Multi-Class Classification Loss Functions<\/h2>\n<p>Multi-Class classification are those predictive modeling problems where examples are assigned one of more than two classes.<\/p>\n<p>The problem is often framed as predicting an integer value, where each class is assigned a unique integer value from 0 to (<em>num_classes \u2013 1<\/em>). The problem is often implemented as predicting the probability of the example belonging to each known class.<\/p>\n<p>In this section, we will investigate loss functions that are appropriate for multi-class classification predictive modeling problems.<\/p>\n<p>We will use the blobs problem as the basis for the investigation. The <a href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_blobs.html\">make_blobs() function<\/a> provided by the scikit-learn provides a way to generate examples given a specified number of classes and input features. We will use this function to generate 1,000 examples for a 3-class classification problem with 2 input variables. The pseudorandom number generator will be seeded consistently so that the same 1,000 examples are generated each time the code is run.<\/p>\n<pre class=\"crayon-plain-tag\"># generate dataset\r\nX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)<\/pre>\n<p>The two input variables can be taken as <em>x<\/em> and <em>y<\/em> coordinates for points on a two-dimensional plane.<\/p>\n<p>The example below creates a scatter plot of the entire dataset coloring points by their class membership.<\/p>\n<pre class=\"crayon-plain-tag\"># scatter plot of blobs dataset\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom numpy import where\r\nfrom matplotlib import pyplot\r\n# generate dataset\r\nX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n# select indices of points with each class label\r\nfor i in range(3):\r\n\tsamples_ix = where(y == i)\r\n\tpyplot.scatter(X[samples_ix, 0], X[samples_ix, 1])\r\npyplot.show()<\/pre>\n<p>Running the example creates a scatter plot showing the 1,000 examples in the dataset with examples belonging to the 0, 1, and 2 classes colors blue, orange, and green respectively.<\/p>\n<div id=\"attachment_6924\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6924\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Examples-Generated-from-the-Blobs-Multi-Class-Classification-Problem.png\" alt=\"Scatter Plot of Examples Generated from the Blobs Multi-Class Classification Problem\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Examples-Generated-from-the-Blobs-Multi-Class-Classification-Problem.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Examples-Generated-from-the-Blobs-Multi-Class-Classification-Problem-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Examples-Generated-from-the-Blobs-Multi-Class-Classification-Problem-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Examples-Generated-from-the-Blobs-Multi-Class-Classification-Problem-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Scatter Plot of Examples Generated from the Blobs Multi-Class Classification Problem<\/p>\n<\/div>\n<p>The input features are Gaussian and could benefit from standardization; nevertheless, we will keep the values unscaled in this example for brevity.<\/p>\n<p>The dataset will be split evenly between train and test sets.<\/p>\n<pre class=\"crayon-plain-tag\"># split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]<\/pre>\n<p>A small MLP model will be used as the basis for exploring loss functions.<\/p>\n<p>The model expects two input variables, has 50 nodes in the hidden layer and the rectified linear activation function, and an output layer that must be customized based on the selection of the loss function.<\/p>\n<pre class=\"crayon-plain-tag\"># define model\r\nmodel = Sequential()\r\nmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(..., activation='...'))<\/pre>\n<p>The model is fit using stochastic gradient descent with a sensible default learning rate of 0.01 and a momentum of 0.9.<\/p>\n<pre class=\"crayon-plain-tag\"># compile model\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='...', optimizer=opt, metrics=['accuracy'])<\/pre>\n<p>The model will be fit for 100 epochs on the training dataset and the test dataset will be used as a validation dataset, allowing us to evaluate both loss and classification accuracy on the train and test sets at the end of each training epoch and draw learning curves.<\/p>\n<pre class=\"crayon-plain-tag\"># fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0)<\/pre>\n<p>Now that we have the basis of a problem and model, we can take a look evaluating three common loss functions that are appropriate for a multi-class classification predictive modeling problem.<\/p>\n<p>Although an MLP is used in these examples, the same loss functions can be used when training CNN and RNN models for multi-class classification.<\/p>\n<h3>Multi-Class Cross-Entropy Loss<\/h3>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Cross_entropy\">Cross-entropy<\/a> is the default loss function to use for multi-class classification problems.<\/p>\n<p>In this case, it is intended for use with multi-class classification where the target values are in the set {0, 1, 3, \u2026, n}, where each class is assigned a unique integer value.<\/p>\n<p>Mathematically, it is the preferred loss function under the inference framework of maximum likelihood. It is the loss function to be evaluated first and only changed if you have a good reason.<\/p>\n<p>Cross-entropy will calculate a score that summarizes the average difference between the actual and predicted probability distributions for all classes in the problem. The score is minimized and a perfect cross-entropy value is 0.<\/p>\n<p>Cross-entropy can be specified as the loss function in Keras by specifying \u2018<em>categorical_crossentropy<\/em>\u2018 when compiling the model.<\/p>\n<pre class=\"crayon-plain-tag\">model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])<\/pre>\n<p>The function requires that the output layer is configured with an <em>n<\/em> nodes (one for each class), in this case three nodes, and a \u2018<em>softmax<\/em>\u2018 activation in order to predict the probability for each class.<\/p>\n<pre class=\"crayon-plain-tag\">model.add(Dense(3, activation='softmax'))<\/pre>\n<p>In turn, this means that the target variable must be one hot encoded.<\/p>\n<p>This is to ensure that each example has an expected probability of 1.0 for the actual class value and an expected probability of 0.0 for all other class values. This can be achieved using the <a href=\"https:\/\/keras.io\/utils\/#to_categorical\">to_categorical() Keras function<\/a>.<\/p>\n<pre class=\"crayon-plain-tag\"># one hot encode output variable\r\ny = to_categorical(y)<\/pre>\n<p>The complete example of an MLP with cross-entropy loss for the multi-class blobs classification problem is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp for the blobs multi-class classification problem with cross-entropy loss\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom keras.layers import Dense\r\nfrom keras.models import Sequential\r\nfrom keras.optimizers import SGD\r\nfrom keras.utils import to_categorical\r\nfrom matplotlib import pyplot\r\n# generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n# one hot encode output variable\r\ny = to_categorical(y)\r\n# split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(3, activation='softmax'))\r\n# compile model\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0)\r\n# evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\r\n# plot loss during training\r\npyplot.subplot(211)\r\npyplot.title('Loss')\r\npyplot.plot(history.history['loss'], label='train')\r\npyplot.plot(history.history['val_loss'], label='test')\r\npyplot.legend()\r\n# plot accuracy during training\r\npyplot.subplot(212)\r\npyplot.title('Accuracy')\r\npyplot.plot(history.history['acc'], label='train')\r\npyplot.plot(history.history['val_acc'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example first prints the classification accuracy for the model on the train and test dataset.<\/p>\n<p>Given the stochastic nature of the training algorithm, your specific results may vary. Try running the example a few times.<\/p>\n<p>In this case, we can see the model performed well, achieving a classification accuracy of about 84% on the training dataset and about 82% on the test dataset.<\/p>\n<pre class=\"crayon-plain-tag\">Train: 0.840, Test: 0.822<\/pre>\n<p>A figure is also created showing two line plots, the top with the cross-entropy loss over epochs for the train (blue) and test (orange) dataset, and the bottom plot showing classification accuracy over epochs.<\/p>\n<p>In this case, the plot shows the model seems to have converged. The line plots for both cross-entropy and accuracy both show good convergence behavior, although somewhat bumpy. The model may be well configured given no sign of over or under fitting. The learning rate or batch size may be tuned to even out the smoothness of the convergence in this case.<\/p>\n<div id=\"attachment_6925\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6925\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Cross-Entropy-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Blobs-Multi-Class-Classification-Problem.png\" alt=\"Line Plots of Cross Entropy Loss and Classification Accuracy over Training Epochs on the Blobs Multi-Class Classification Problem\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Cross-Entropy-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Blobs-Multi-Class-Classification-Problem.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Cross-Entropy-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Blobs-Multi-Class-Classification-Problem-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Cross-Entropy-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Blobs-Multi-Class-Classification-Problem-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Cross-Entropy-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Blobs-Multi-Class-Classification-Problem-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plots of Cross Entropy Loss and Classification Accuracy over Training Epochs on the Blobs Multi-Class Classification Problem<\/p>\n<\/div>\n<h3>Sparse Multiclass Cross-Entropy Loss<\/h3>\n<p>A possible cause of frustration when using cross-entropy with classification problems with a large number of labels is the one hot encoding process.<\/p>\n<p>For example, predicting words in a vocabulary may have tens or hundreds of thousands of categories, one for each label. This can mean that the target element of each training example may require a one hot encoded vector with tens or hundreds of thousands of zero values, requiring significant memory.<\/p>\n<p>Sparse cross-entropy addresses this by performing the same cross-entropy calculation of error, without requiring that the target variable be one hot encoded prior to training.<\/p>\n<p>Sparse cross-entropy can be used in keras for multi-class classification by using \u2018<em>sparse_categorical_crossentropy<\/em>\u2018 when calling the <em>compile()<\/em> function.<\/p>\n<pre class=\"crayon-plain-tag\">model.compile(loss='sparse_categorical_crossentropy', optimizer=opt, metrics=['accuracy'])<\/pre>\n<p>The function requires that the output layer is configured with an <em>n<\/em> nodes (one for each class), in this case three nodes, and a \u2018<em>softmax<\/em>\u2018 activation in order to predict the probability for each class.<\/p>\n<pre class=\"crayon-plain-tag\">model.add(Dense(3, activation='softmax'))<\/pre>\n<p>No one hot encoding of the target variable is required, a benefit of this loss function.<\/p>\n<p>The complete example of training an MLP with sparse cross-entropy on the blobs multi-class classification problem is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp for the blobs multi-class classification problem with sparse cross-entropy loss\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom keras.layers import Dense\r\nfrom keras.models import Sequential\r\nfrom keras.optimizers import SGD\r\nfrom matplotlib import pyplot\r\n# generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n# split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(3, activation='softmax'))\r\n# compile model\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='sparse_categorical_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0)\r\n# evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\r\n# plot loss during training\r\npyplot.subplot(211)\r\npyplot.title('Loss')\r\npyplot.plot(history.history['loss'], label='train')\r\npyplot.plot(history.history['val_loss'], label='test')\r\npyplot.legend()\r\n# plot accuracy during training\r\npyplot.subplot(212)\r\npyplot.title('Accuracy')\r\npyplot.plot(history.history['acc'], label='train')\r\npyplot.plot(history.history['val_acc'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example first prints the classification accuracy for the model on the train and test dataset.<\/p>\n<p>Given the stochastic nature of the training algorithm, your specific results may vary. Try running the example a few times.<\/p>\n<p>In this case, we can see the model achieves good performance on the problem. In fact, if you repeat the experiment many times, the average performance of sparse and non-sparse cross-entropy should be comparable.<\/p>\n<pre class=\"crayon-plain-tag\">Train: 0.832, Test: 0.818<\/pre>\n<p>A figure is also created showing two line plots, the top with the sparse cross-entropy loss over epochs for the train (blue) and test (orange) dataset, and the bottom plot showing classification accuracy over epochs.<\/p>\n<p>In this case, the plot shows good convergence of the model over training with regard to loss and classification accuracy.<\/p>\n<div id=\"attachment_6926\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6926\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Sparse-Cross-Entropy-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Blobs-Multi-Class-Classification-Problem.png\" alt=\"Line Plots of Sparse Cross Entropy Loss and Classification Accuracy over Training Epochs on the Blobs Multi-Class Classification Problem\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Sparse-Cross-Entropy-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Blobs-Multi-Class-Classification-Problem.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Sparse-Cross-Entropy-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Blobs-Multi-Class-Classification-Problem-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Sparse-Cross-Entropy-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Blobs-Multi-Class-Classification-Problem-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Sparse-Cross-Entropy-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Blobs-Multi-Class-Classification-Problem-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plots of Sparse Cross Entropy Loss and Classification Accuracy over Training Epochs on the Blobs Multi-Class Classification Problem<\/p>\n<\/div>\n<h3>Kullback Leibler Divergence Loss<\/h3>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Kullback%E2%80%93Leibler_divergence\">Kullback Leibler Divergence<\/a>, or KL Divergence for short, is a measure of how one probability distribution differs from a baseline distribution.<\/p>\n<p>A KL divergence loss of 0 suggests the distributions are identical. In practice, the behavior of KL Divergence is very similar to cross-entropy. It calculates how much information is lost (in terms of bits) if the predicted probability distribution is used to approximate the desired target probability distribution.<\/p>\n<p>As such, the KL divergence loss function is more commonly used when using models that learn to approximate a more complex function than simply multi-class classification, such as in the case of an autoencoder used for learning a dense feature representation under a model that must reconstruct the original input. In this case, KL divergence loss would be preferred. Nevertheless, it can be used for multi-class classification, in which case it is functionally equivalent to multi-class cross-entropy.<\/p>\n<p>KL divergence loss can be used in Keras by specifying \u2018<em>kullback_leibler_divergence<\/em>\u2018 in the <em>compile()<\/em> function.<\/p>\n<pre class=\"crayon-plain-tag\">model.compile(loss='kullback_leibler_divergence', optimizer=opt, metrics=['accuracy'])<\/pre>\n<p>As with cross-entropy, the output layer is configured with an <em>n<\/em> nodes (one for each class), in this case three nodes, and a \u2018<em>softmax<\/em>\u2018 activation in order to predict the probability for each class.<\/p>\n<p>Also, as with categorical cross-entropy, we must one hot encode the target variable to have an expected probability of 1.0 for the class value and 0.0 for all other class values.<\/p>\n<pre class=\"crayon-plain-tag\"># one hot encode output variable\r\ny = to_categorical(y)<\/pre>\n<p>The complete example of training an MLP with KL divergence loss for the blobs multi-class classification problem is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp for the blobs multi-class classification problem with kl divergence loss\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom keras.layers import Dense\r\nfrom keras.models import Sequential\r\nfrom keras.optimizers import SGD\r\nfrom keras.utils import to_categorical\r\nfrom matplotlib import pyplot\r\n# generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n# one hot encode output variable\r\ny = to_categorical(y)\r\n# split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(3, activation='softmax'))\r\n# compile model\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='kullback_leibler_divergence', optimizer=opt, metrics=['accuracy'])\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0)\r\n# evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\r\n# plot loss during training\r\npyplot.subplot(211)\r\npyplot.title('Loss')\r\npyplot.plot(history.history['loss'], label='train')\r\npyplot.plot(history.history['val_loss'], label='test')\r\npyplot.legend()\r\n# plot accuracy during training\r\npyplot.subplot(212)\r\npyplot.title('Accuracy')\r\npyplot.plot(history.history['acc'], label='train')\r\npyplot.plot(history.history['val_acc'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example first prints the classification accuracy for the model on the train and test dataset.<\/p>\n<p>Given the stochastic nature of the training algorithm, your specific results may vary. Try running the example a few times.<\/p>\n<p>In this case, we see performance that is similar to those results seen with cross-entropy loss, in this case about 82% accuracy on the train and test dataset.<\/p>\n<pre class=\"crayon-plain-tag\">Train: 0.822, Test: 0.822<\/pre>\n<p>A figure is also created showing two line plots, the top with the KL divergence loss over epochs for the train (blue) and test (orange) dataset, and the bottom plot showing classification accuracy over epochs.<\/p>\n<p>In this case, the plot shows good convergence behavior for both loss and classification accuracy. It is very likely that an evaluation of cross-entropy would result in nearly identical behavior given the similarities in the measure.<\/p>\n<div id=\"attachment_6927\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6927\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-KL-Divergence-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Blobs-Multi-Class-Classification-Problem.png\" alt=\"Line Plots of KL Divergence Loss and Classification Accuracy over Training Epochs on the Blobs Multi-Class Classification Problem\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-KL-Divergence-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Blobs-Multi-Class-Classification-Problem.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-KL-Divergence-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Blobs-Multi-Class-Classification-Problem-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-KL-Divergence-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Blobs-Multi-Class-Classification-Problem-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-KL-Divergence-Loss-and-Classification-Accuracy-over-Training-Epochs-on-the-Blobs-Multi-Class-Classification-Problem-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plots of KL Divergence Loss and Classification Accuracy over Training Epochs on the Blobs Multi-Class Classification Problem<\/p>\n<\/div>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Papers<\/h3>\n<ul>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1702.05659\">On Loss Functions for Deep Neural Networks in Classification<\/a>, 2017.<\/li>\n<\/ul>\n<h3>API<\/h3>\n<ul>\n<li><a href=\"https:\/\/keras.io\/losses\/\">Keras Loss Functions API<\/a><\/li>\n<li><a href=\"https:\/\/keras.io\/activations\/\">Keras Activation Functions API<\/a><\/li>\n<li><a href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.StandardScaler.html\">sklearn.preprocessing.StandardScaler API<\/a><\/li>\n<li><a href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_regression.html\">sklearn.datasets.make_regression API<\/a><\/li>\n<li><a href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_circles.html\">sklearn.datasets.make_circles API<\/a><\/li>\n<li><a href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_blobs.html\">sklearn.datasets.make_blobs API<\/a><\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Mean_squared_error\">Mean squared error, Wikipedia<\/a>.<\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Mean_absolute_error\">Mean absolute error, Wikipedia<\/a>.<\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Cross_entropy\">Cross entropy, Wikipedia<\/a>.<\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Hinge_loss\">Hinge loss, Wikipedia<\/a>.<\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Kullback%E2%80%93Leibler_divergence\">Kullback\u2013Leibler divergence, Wikipedia<\/a>.<\/li>\n<li><a href=\"https:\/\/isaacchanghau.github.io\/post\/loss_functions\/\">Loss Functions in Neural Networks<\/a>, 2017.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to choose a loss function for your deep learning neural network for a given predictive modeling problem.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How to configure a model for mean squared error and variants for regression problems.<\/li>\n<li>How to configure a model for cross-entropy and hinge loss functions for binary classification.<\/li>\n<li>How to configure a model for cross-entropy and KL divergence loss functions for multi-class classification.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/how-to-choose-loss-functions-when-training-deep-learning-neural-networks\/\">How to Choose Loss Functions When Training Deep Learning Neural Networks<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/how-to-choose-loss-functions-when-training-deep-learning-neural-networks\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Deep learning neural networks are trained using the stochastic gradient descent optimization algorithm. As part of the optimization algorithm, the error for [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/01\/29\/how-to-choose-loss-functions-when-training-deep-learning-neural-networks\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":1650,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1649"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=1649"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1649\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/1650"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=1649"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=1649"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=1649"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}