{"id":3750,"date":"2020-08-09T19:00:07","date_gmt":"2020-08-09T19:00:07","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/08\/09\/how-to-use-seaborn-data-visualization-for-machine-learning\/"},"modified":"2020-08-09T19:00:07","modified_gmt":"2020-08-09T19:00:07","slug":"how-to-use-seaborn-data-visualization-for-machine-learning","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/08\/09\/how-to-use-seaborn-data-visualization-for-machine-learning\/","title":{"rendered":"How to use Seaborn Data Visualization for Machine Learning"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Data visualization provides insight into the distribution and relationships between variables in a dataset.<\/p>\n<p>This insight can be helpful in selecting data preparation techniques to apply prior to modeling and the types of algorithms that may be most suited to the data.<\/p>\n<p>Seaborn is a data visualization library for Python that runs on top of the popular Matplotlib data visualization library, although it provides a simple interface and aesthetically better-looking plots.<\/p>\n<p>In this tutorial, you will discover a gentle introduction to Seaborn data visualization for machine learning.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>How to summarize the distribution of variables using bar charts, histograms, and box and whisker plots.<\/li>\n<li>How to summarize relationships using line plots and scatter plots.<\/li>\n<li>How to compare the distribution and relationships of variables for different class values on the same plot.<\/li>\n<\/ul>\n<p>Let&rsquo;s get started.<\/p>\n<div id=\"attachment_10415\" style=\"width: 810px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10415\" class=\"size-full wp-image-10415\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/06\/How-to-use-Seaborn-Data-Visualization-for-Machine-Learning.jpg\" alt=\"How to use Seaborn Data Visualization for Machine Learning\" width=\"800\" height=\"536\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/06\/How-to-use-Seaborn-Data-Visualization-for-Machine-Learning.jpg 800w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/06\/How-to-use-Seaborn-Data-Visualization-for-Machine-Learning-300x201.jpg 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/06\/How-to-use-Seaborn-Data-Visualization-for-Machine-Learning-768x515.jpg 768w\" sizes=\"(max-width: 800px) 100vw, 800px\"><\/p>\n<p id=\"caption-attachment-10415\" class=\"wp-caption-text\">How to use Seaborn Data Visualization for Machine Learning<br \/>Photo by <a href=\"https:\/\/flickr.com\/photos\/mdpettitt\/2743243609\/\">Martin Pettitt<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into six parts; they are:<\/p>\n<ul>\n<li>Seaborn Data Visualization Library<\/li>\n<li>Line Plots<\/li>\n<li>Bar Chart Plots<\/li>\n<li>Histogram Plots<\/li>\n<li>Box and Whisker Plots<\/li>\n<li>Scatter Plots<\/li>\n<\/ul>\n<h2>Seaborn Data Visualization Library<\/h2>\n<p>The primary plotting library for Python is called <a href=\"https:\/\/matplotlib.org\/\">Matplotlib<\/a>.<\/p>\n<p><a href=\"https:\/\/seaborn.pydata.org\/\">Seaborn<\/a> is a plotting library that offers a simpler interface, sensible defaults for plots needed for machine learning, and most importantly, the plots are aesthetically better looking than those in Matplotlib.<\/p>\n<p>Seaborn requires that Matplotlib is installed first.<\/p>\n<p>You can install Matplotlib directly using <a href=\"https:\/\/en.wikipedia.org\/wiki\/Pip_(package_manager)\">pip<\/a>, as follows:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">sudo pip install matplotlib<\/pre>\n<p>Once installed, you can confirm that the library can be loaded and used by printing the version number, as follows:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># matplotlib\r\nimport matplotlib\r\nprint('matplotlib: %s' % matplotlib.__version__)<\/pre>\n<p>Running the example prints the current version of the Matplotlib library.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">matplotlib: 3.1.2<\/pre>\n<p>Next, the Seaborn library can be installed, also using pip:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">sudo pip install seaborn<\/pre>\n<p>Once installed, we can also confirm the library can be loaded and used by printing the version number, as follows:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># seaborn\r\nimport seaborn\r\nprint('seaborn: %s' % seaborn.__version__)<\/pre>\n<p>Running the example prints the current version of the Seaborn library.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">seaborn: 0.10.0<\/pre>\n<p>To create Seaborn plots, you must import the Seaborn library and call functions to create the plots.<\/p>\n<p>Importantly, Seaborn plotting functions expect data to be provided as <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.DataFrame.html\">Pandas DataFrames<\/a>. This means that if you are loading your data from CSV files, you must use Pandas functions like <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.read_csv.html\">read_csv()<\/a> to load your data as a DataFrame. When plotting, columns can then be specified via the DataFrame name or column index.<\/p>\n<p>To show the plot, you can call the <a href=\"https:\/\/matplotlib.org\/api\/_as_gen\/matplotlib.pyplot.show.html\">show() function<\/a> on Matplotlib library.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# display the plot\r\npyplot.show()<\/pre>\n<p>Alternatively, the plots can be saved to file, such as a PNG formatted image file. The <a href=\"https:\/\/matplotlib.org\/api\/_as_gen\/matplotlib.pyplot.savefig.html\">savefig() Matplotlib function<\/a> can be used to save images.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# save the plot\r\npyplot.savefig('my_image.png')<\/pre>\n<p>Now that we have Seaborn installed, let&rsquo;s look at some common plots we may need when working with machine learning data.<\/p>\n<h2>Line Plots<\/h2>\n<p>A line plot is generally used to present observations collected at regular intervals.<\/p>\n<p>The x-axis represents the regular interval, such as time. The y-axis shows the observations, ordered by the x-axis and connected by a line.<\/p>\n<p>A line plot can be created in Seaborn by calling the <a href=\"https:\/\/seaborn.pydata.org\/generated\/seaborn.lineplot.html\">lineplot() function<\/a> and passing the x-axis data for the regular interval, and y-axis for the observations.<\/p>\n<p>We can demonstrate a line plot using a time series dataset of <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/monthly-car-sales.csv\">monthly car sales<\/a>.<\/p>\n<p>The dataset has two columns: &ldquo;<em>Month<\/em>&rdquo; and &ldquo;<em>Sales<\/em>.&rdquo; Month will be used as the x-axis and Sales will be plotted on the y-axis.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# create line plot\r\nlineplot(x='Month', y='Sales', data=dataset)<\/pre>\n<p>Tying this together, the complete example is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># line plot of a time series dataset\r\nfrom pandas import read_csv\r\nfrom seaborn import lineplot\r\nfrom matplotlib import pyplot\r\n# load the dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/monthly-car-sales.csv'\r\ndataset = read_csv(url, header=0)\r\n# create line plot\r\nlineplot(x='Month', y='Sales', data=dataset)\r\n# show plot\r\npyplot.show()<\/pre>\n<p>Running the example first loads the time series dataset and creates a line plot of the data, clearly showing a trend and seasonality in the sales data.<\/p>\n<div id=\"attachment_10407\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10407\" class=\"size-full wp-image-10407\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Line-Plot-of-a-Time-Series-Dataset.png\" alt=\"Line Plot of a Time Series Dataset\" width=\"1280\" height=\"960\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Line-Plot-of-a-Time-Series-Dataset.png 1280w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Line-Plot-of-a-Time-Series-Dataset-300x225.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Line-Plot-of-a-Time-Series-Dataset-1024x768.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Line-Plot-of-a-Time-Series-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10407\" class=\"wp-caption-text\">Line Plot of a Time Series Dataset<\/p>\n<\/div>\n<p>For more great examples of line plots with Seaborn, see: <a href=\"https:\/\/seaborn.pydata.org\/tutorial\/relational.html\">Visualizing statistical relationships<\/a>.<\/p>\n<h2>Bar Chart Plots<\/h2>\n<p>A bar chart is generally used to present relative quantities for multiple categories.<\/p>\n<p>The x-axis represents the categories that are spaced evenly. The y-axis represents the quantity for each category and is drawn as a bar from the baseline to the appropriate level on the y-axis.<\/p>\n<p>A bar chart can be created in Seaborn by calling the <a href=\"https:\/\/seaborn.pydata.org\/generated\/seaborn.countplot.html\">countplot() function<\/a> and passing the data.<\/p>\n<p>We will demonstrate a bar chart with a variable from the <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/breast-cancer.csv\">breast cancer classification dataset<\/a> that is comprised of categorical input variables.<\/p>\n<p>We will just plot one variable, in this case, the first variable which is the age bracket.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# create line plot\r\ncountplot(x=0, data=dataset)<\/pre>\n<p>Tying this together, the complete example is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># bar chart plot of a categorical variable\r\nfrom pandas import read_csv\r\nfrom seaborn import countplot\r\nfrom matplotlib import pyplot\r\n# load the dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/breast-cancer.csv'\r\ndataset = read_csv(url, header=None)\r\n# create bar chart plot\r\ncountplot(x=0, data=dataset)\r\n# show plot\r\npyplot.show()<\/pre>\n<p>Running the example first loads the breast cancer dataset and creates a bar chart plot of the data, showing each age group and the number of individuals (samples) that fall within reach group.<\/p>\n<div id=\"attachment_10408\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10408\" class=\"size-full wp-image-10408\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Bar-Chart-Plot-of-Age-Range-Categorical-Variable.png\" alt=\"Bar Chart Plot of Age Range Categorical Variable\" width=\"1280\" height=\"960\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Bar-Chart-Plot-of-Age-Range-Categorical-Variable.png 1280w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Bar-Chart-Plot-of-Age-Range-Categorical-Variable-300x225.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Bar-Chart-Plot-of-Age-Range-Categorical-Variable-1024x768.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Bar-Chart-Plot-of-Age-Range-Categorical-Variable-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10408\" class=\"wp-caption-text\">Bar Chart Plot of Age Range Categorical Variable<\/p>\n<\/div>\n<p>We might also want to plot the counts for each category for a variable, such as the first variable, against the class label.<\/p>\n<p>This can be achieved using the <em>countplot()<\/em> function and specifying the class variable (column index 9) via the &ldquo;<em>hue<\/em>&rdquo; argument, as follows:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# create bar chart plot\r\ncountplot(x=0, hue=9, data=dataset)<\/pre>\n<p>Tying this together, the complete example is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># bar chart plot of a categorical variable against a class variable\r\nfrom pandas import read_csv\r\nfrom seaborn import countplot\r\nfrom matplotlib import pyplot\r\n# load the dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/breast-cancer.csv'\r\ndataset = read_csv(url, header=None)\r\n# create bar chart plot\r\ncountplot(x=0, hue=9, data=dataset)\r\n# show plot\r\npyplot.show()<\/pre>\n<p>Running the example first loads the breast cancer dataset and creates a bar chart plot of the data, showing each age group and the number of individuals (samples) that fall within each group separated by the two class labels for the dataset.<\/p>\n<div id=\"attachment_10409\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10409\" class=\"size-full wp-image-10409\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Bar-Chart-Plot-of-Age-Range-Categorical-Variable-by-Class-Label.png\" alt=\"Bar Chart Plot of Age Range Categorical Variable by Class Label\" width=\"1280\" height=\"960\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Bar-Chart-Plot-of-Age-Range-Categorical-Variable-by-Class-Label.png 1280w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Bar-Chart-Plot-of-Age-Range-Categorical-Variable-by-Class-Label-300x225.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Bar-Chart-Plot-of-Age-Range-Categorical-Variable-by-Class-Label-1024x768.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Bar-Chart-Plot-of-Age-Range-Categorical-Variable-by-Class-Label-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10409\" class=\"wp-caption-text\">Bar Chart Plot of Age Range Categorical Variable by Class Label<\/p>\n<\/div>\n<p>For more great examples of bar chart plots with Seaborn, see: <a href=\"https:\/\/seaborn.pydata.org\/tutorial\/categorical.html\">Plotting with categorical data<\/a>.<\/p>\n<h2>Histogram Plots<\/h2>\n<p>A histogram plot is generally used to summarize the distribution of a numerical data sample.<\/p>\n<p>The x-axis represents discrete bins or intervals for the observations. For example, observations with values between 1 and 10 may be split into five bins, the values [1,2] would be allocated to the first bin, [3,4] would be allocated to the second bin, and so on.<\/p>\n<p>The y-axis represents the frequency or count of the number of observations in the dataset that belong to each bin.<\/p>\n<p>Essentially, a data sample is transformed into a bar chart where each category on the x-axis represents an interval of observation values.<\/p>\n<p>A histogram can be created in Seaborn by calling the <a href=\"https:\/\/seaborn.pydata.org\/generated\/seaborn.distplot.html\">distplot() function<\/a> and passing the variable.<\/p>\n<p>We will demonstrate a boxplot with a numerical variable from the <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.csv\">diabetes classification dataset<\/a>. We will just plot one variable, in this case, the first variable, which is the number of times that a patient was pregnant.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# create histogram plot\r\ndistplot(dataset[[0]])<\/pre>\n<p>Tying this together, the complete example is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># histogram plot of a numerical variable\r\nfrom pandas import read_csv\r\nfrom seaborn import distplot\r\nfrom matplotlib import pyplot\r\n# load the dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.csv'\r\ndataset = read_csv(url, header=None)\r\n# create histogram plot\r\ndistplot(dataset[[0]])\r\n# show plot\r\npyplot.show()<\/pre>\n<p>Running the example first loads the diabetes dataset and creates a histogram plot of the variable, showing the distribution of the values with a hard cut-off at zero.<\/p>\n<p>The plot shows both the histogram (counts of bins) as well as a smooth estimate of the probability density function.<\/p>\n<div id=\"attachment_10410\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10410\" class=\"size-full wp-image-10410\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Histogram-Plot-of-Number-of-Times-Pregnant-Numerical-Variable.png\" alt=\"Histogram Plot of Number of Times Pregnant Numerical Variable\" width=\"1280\" height=\"960\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Histogram-Plot-of-Number-of-Times-Pregnant-Numerical-Variable.png 1280w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Histogram-Plot-of-Number-of-Times-Pregnant-Numerical-Variable-300x225.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Histogram-Plot-of-Number-of-Times-Pregnant-Numerical-Variable-1024x768.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Histogram-Plot-of-Number-of-Times-Pregnant-Numerical-Variable-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10410\" class=\"wp-caption-text\">Histogram Plot of Number of Times Pregnant Numerical Variable<\/p>\n<\/div>\n<p>For more great examples of histogram plots with Seaborn, see: <a href=\"https:\/\/seaborn.pydata.org\/tutorial\/distributions.html\">Visualizing the distribution of a dataset<\/a>.<\/p>\n<h2>Box and Whisker Plots<\/h2>\n<p>A box and whisker plot, or boxplot for short, is generally used to summarize the distribution of a data sample.<\/p>\n<p>The x-axis is used to represent the data sample, where multiple boxplots can be drawn side by side on the x-axis if desired.<\/p>\n<p>The y-axis represents the observation values. A box is drawn to summarize the middle 50 percent of the dataset starting at the observation at the 25th percentile and ending at the 75th percentile. This is called the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Interquartile_range\">interquartile range<\/a>, or IQR. The median, or 50th percentile, is drawn with a line.<\/p>\n<p>Lines called whiskers are drawn extending from both ends of the box, calculated as (1.5 * IQR) to demonstrate the expected range of sensible values in the distribution. Observations outside the whiskers might be outliers and are drawn with small circles.<\/p>\n<p>A boxplot can be created in Seaborn by calling the <a href=\"https:\/\/seaborn.pydata.org\/generated\/seaborn.boxplot.html\">boxplot() function<\/a> and passing the data.<\/p>\n<p>We will demonstrate a boxplot with a numerical variable from the <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.csv\">diabetes classification dataset<\/a>. We will just plot one variable, in this case, the first variable, which is the number of times that a patient was pregnant.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# create box and whisker plot\r\nboxplot(x=0, data=dataset)<\/pre>\n<p>Tying this together, the complete example is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># box and whisker plot of a numerical variable\r\nfrom pandas import read_csv\r\nfrom seaborn import boxplot\r\nfrom matplotlib import pyplot\r\n# load the dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.csv'\r\ndataset = read_csv(url, header=None)\r\n# create box and whisker plot\r\nboxplot(x=0, data=dataset)\r\n# show plot\r\npyplot.show()<\/pre>\n<p>Running the example first loads the diabetes dataset and creates a boxplot plot of the first input variable, showing the distribution of the number of times patients were pregnant.<\/p>\n<p>We can see the median just above 2.5 times, some outliers up around 15 times (wow!).<\/p>\n<div id=\"attachment_10411\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10411\" class=\"size-full wp-image-10411\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Box-and-Whisker-Plot-of-Number-of-Times-Pregnant-Numerical-Variable.png\" alt=\"Box and Whisker Plot of Number of Times Pregnant Numerical Variable\" width=\"1280\" height=\"960\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Box-and-Whisker-Plot-of-Number-of-Times-Pregnant-Numerical-Variable.png 1280w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Box-and-Whisker-Plot-of-Number-of-Times-Pregnant-Numerical-Variable-300x225.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Box-and-Whisker-Plot-of-Number-of-Times-Pregnant-Numerical-Variable-1024x768.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Box-and-Whisker-Plot-of-Number-of-Times-Pregnant-Numerical-Variable-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10411\" class=\"wp-caption-text\">Box and Whisker Plot of Number of Times Pregnant Numerical Variable<\/p>\n<\/div>\n<p>We might also want to plot the distribution of the numerical variable for each value of a categorical variable, such as the first variable, against the class label.<\/p>\n<p>This can be achieved by calling the <em>boxplot()<\/em> function and passing the class variable as the x-axis and the numerical variable as the y-axis.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# create box and whisker plot\r\nboxplot(x=8, y=0, data=dataset)<\/pre>\n<p>Tying this together, the complete example is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># box and whisker plot of a numerical variable vs class label\r\nfrom pandas import read_csv\r\nfrom seaborn import boxplot\r\nfrom matplotlib import pyplot\r\n# load the dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.csv'\r\ndataset = read_csv(url, header=None)\r\n# create box and whisker plot\r\nboxplot(x=8, y=0, data=dataset)\r\n# show plot\r\npyplot.show()<\/pre>\n<p>Running the example first loads the diabetes dataset and creates a boxplot of the data, showing the distribution of the number of times pregnant as a numerical variable for the two-class labels.<\/p>\n<div id=\"attachment_10412\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10412\" class=\"size-full wp-image-10412\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Box-and-Whisker-Plot-of-Number-of-Times-Pregnant-Numerical-Variable-by-Class-Label.png\" alt=\"Box and Whisker Plot of Number of Times Pregnant Numerical Variable by Class Label\" width=\"1280\" height=\"960\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Box-and-Whisker-Plot-of-Number-of-Times-Pregnant-Numerical-Variable-by-Class-Label.png 1280w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Box-and-Whisker-Plot-of-Number-of-Times-Pregnant-Numerical-Variable-by-Class-Label-300x225.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Box-and-Whisker-Plot-of-Number-of-Times-Pregnant-Numerical-Variable-by-Class-Label-1024x768.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Box-and-Whisker-Plot-of-Number-of-Times-Pregnant-Numerical-Variable-by-Class-Label-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10412\" class=\"wp-caption-text\">Box and Whisker Plot of Number of Times Pregnant Numerical Variable by Class Label<\/p>\n<\/div>\n<h2>Scatter Plots<\/h2>\n<p>A scatter plot, or scatterplot, is generally used to summarize the relationship between two paired data samples.<\/p>\n<p>Paired data samples mean that two measures were recorded for a given observation, such as the weight and height of a person.<\/p>\n<p>The x-axis represents observation values for the first sample, and the y-axis represents the observation values for the second sample. Each point on the plot represents a single observation.<\/p>\n<p>A scatterplot can be created in Seaborn by calling the <a href=\"https:\/\/seaborn.pydata.org\/generated\/seaborn.scatterplot.html\">scatterplot() function<\/a> and passing the two numerical variables.<\/p>\n<p>We will demonstrate a scatterplot with two numerical variables from the <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.csv\">diabetes classification dataset<\/a>. We will plot the first versus the second variable, in this case, the first variable, which is the number of times that a patient was pregnant, and the second is the plasma glucose concentration after a two hour oral glucose tolerance test (<a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.names\">more details of the variables here<\/a>).<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# create scatter plot\r\nscatterplot(x=0, y=1, data=dataset)<\/pre>\n<p>Tying this together, the complete example is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># scatter plot of two numerical variables\r\nfrom pandas import read_csv\r\nfrom seaborn import scatterplot\r\nfrom matplotlib import pyplot\r\n# load the dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.csv'\r\ndataset = read_csv(url, header=None)\r\n# create scatter plot\r\nscatterplot(x=0, y=1, data=dataset)\r\n# show plot\r\npyplot.show()<\/pre>\n<p>Running the example first loads the diabetes dataset and creates a scatter plot of the first two input variables.<\/p>\n<p>We can see a somewhat uniform relationship between the two variables.<\/p>\n<div id=\"attachment_10413\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10413\" class=\"size-full wp-image-10413\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Scatter-Plot-of-Number-of-Times-Pregnant-vs-Plasma-Glucose-Numerical-Variables.png\" alt=\"Scatter Plot of Number of Times Pregnant vs. Plasma Glucose Numerical Variables\" width=\"1280\" height=\"960\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Scatter-Plot-of-Number-of-Times-Pregnant-vs-Plasma-Glucose-Numerical-Variables.png 1280w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Scatter-Plot-of-Number-of-Times-Pregnant-vs-Plasma-Glucose-Numerical-Variables-300x225.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Scatter-Plot-of-Number-of-Times-Pregnant-vs-Plasma-Glucose-Numerical-Variables-1024x768.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Scatter-Plot-of-Number-of-Times-Pregnant-vs-Plasma-Glucose-Numerical-Variables-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10413\" class=\"wp-caption-text\">Scatter Plot of Number of Times Pregnant vs. Plasma Glucose Numerical Variables<\/p>\n<\/div>\n<p>We might also want to plot the relationship for the pair of numerical variables against the class label.<\/p>\n<p>This can be achieved using the scatterplot() function and specifying the class variable (column index 8) via the &ldquo;<em>hue<\/em>&rdquo; argument, as follows:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# create scatter plot\r\nscatterplot(x=0, y=1, hue=8, data=dataset)<\/pre>\n<p>Tying this together, the complete example is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># scatter plot of two numerical variables vs class label\r\nfrom pandas import read_csv\r\nfrom seaborn import scatterplot\r\nfrom matplotlib import pyplot\r\n# load the dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.csv'\r\ndataset = read_csv(url, header=None)\r\n# create scatter plot\r\nscatterplot(x=0, y=1, hue=8, data=dataset)\r\n# show plot\r\npyplot.show()<\/pre>\n<p>Running the example first loads the diabetes dataset and creates a scatter plot of the first two variables vs. class label.<\/p>\n<div id=\"attachment_10414\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10414\" class=\"size-full wp-image-10414\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Scatter-Plot-of-Number-of-Times-Pregnant-vs-Plasma-Glucose-Numerical-Variables-by-Class-Label.png\" alt=\"Scatter Plot of Number of Times Pregnant vs. Plasma Glucose Numerical Variables by Class Label\" width=\"1280\" height=\"960\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Scatter-Plot-of-Number-of-Times-Pregnant-vs-Plasma-Glucose-Numerical-Variables-by-Class-Label.png 1280w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Scatter-Plot-of-Number-of-Times-Pregnant-vs-Plasma-Glucose-Numerical-Variables-by-Class-Label-300x225.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Scatter-Plot-of-Number-of-Times-Pregnant-vs-Plasma-Glucose-Numerical-Variables-by-Class-Label-1024x768.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Scatter-Plot-of-Number-of-Times-Pregnant-vs-Plasma-Glucose-Numerical-Variables-by-Class-Label-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10414\" class=\"wp-caption-text\">Scatter Plot of Number of Times Pregnant vs. Plasma Glucose Numerical Variables by Class Label<\/p>\n<\/div>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Tutorials<\/h3>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/data-visualization-methods-in-python\/\">A Gentle Introduction to Data Visualization Methods in Python<\/a><\/li>\n<\/ul>\n<h3>APIs<\/h3>\n<ul>\n<li><a href=\"https:\/\/seaborn.pydata.org\/index.html\">Seaborn Homepage<\/a>.<\/li>\n<li><a href=\"https:\/\/seaborn.pydata.org\/tutorial.html\">Official seaborn tutorial<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered a gentle introduction to Seaborn data visualization for machine learning.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How to summarize the distribution of variables using bar charts, histograms, and box and whisker plots.<\/li>\n<li>How to summarize relationships using line plots and scatter plots.<\/li>\n<li>How to compare the distribution and relationships of variables for different class values on the same plot.<\/li>\n<\/ul>\n<p><strong>Do you have any questions?<\/strong><br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/seaborn-data-visualization-for-machine-learning\/\">How to use Seaborn Data Visualization for Machine Learning<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/seaborn-data-visualization-for-machine-learning\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Data visualization provides insight into the distribution and relationships between variables in a dataset. This insight can be helpful in selecting data [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/08\/09\/how-to-use-seaborn-data-visualization-for-machine-learning\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":3751,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3750"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=3750"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3750\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/3751"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=3750"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=3750"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=3750"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}