{"id":3644,"date":"2020-07-09T19:00:07","date_gmt":"2020-07-09T19:00:07","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/07\/09\/6-dimensionality-reduction-algorithms-with-python\/"},"modified":"2020-07-09T19:00:07","modified_gmt":"2020-07-09T19:00:07","slug":"6-dimensionality-reduction-algorithms-with-python","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/07\/09\/6-dimensionality-reduction-algorithms-with-python\/","title":{"rendered":"6 Dimensionality Reduction Algorithms With Python"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p><strong>Dimensionality reduction<\/strong> is an unsupervised learning technique.<\/p>\n<p>Nevertheless, it can be used as a data transform pre-processing step for machine learning algorithms on classification and regression predictive modeling datasets with supervised learning algorithms.<\/p>\n<p>There are many dimensionality reduction algorithms to choose from and no single best algorithm for all cases. Instead, it is a good idea to explore a range of dimensionality reduction algorithms and different configurations for each algorithm.<\/p>\n<p>In this tutorial, you will discover how to fit and evaluate top dimensionality reduction algorithms in Python.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>Dimensionality reduction seeks a lower-dimensional representation of numerical input data that preserves the salient relationships in the data.<\/li>\n<li>There are many different dimensionality reduction algorithms and no single best method for all datasets.<\/li>\n<li>How to implement, fit, and evaluate top dimensionality reduction in Python with the scikit-learn machine learning library.<\/li>\n<\/ul>\n<p>Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more <a href=\"https:\/\/machinelearningmastery.com\/data-preparation-for-machine-learning\/\">in my new book<\/a>, with 30 step-by-step tutorials and full Python source code.<\/p>\n<p>Let&rsquo;s get started.<\/p>\n<div id=\"attachment_11005\" style=\"width: 810px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-11005\" class=\"size-full wp-image-11005\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/07\/Dimensionality-Reduction-Algorithms-With-Python.jpg\" alt=\"Dimensionality Reduction Algorithms With Python\" width=\"800\" height=\"400\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/07\/Dimensionality-Reduction-Algorithms-With-Python.jpg 800w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/07\/Dimensionality-Reduction-Algorithms-With-Python-300x150.jpg 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/07\/Dimensionality-Reduction-Algorithms-With-Python-768x384.jpg 768w\" sizes=\"(max-width: 800px) 100vw, 800px\"><\/p>\n<p id=\"caption-attachment-11005\" class=\"wp-caption-text\">Dimensionality Reduction Algorithms With Python<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/volvob12b\/24673788260\/\">Bernard Spragg. NZ<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into three parts; they are:<\/p>\n<ol>\n<li>Dimensionality Reduction<\/li>\n<li>Dimensionality Reduction Algorithms<\/li>\n<li>Examples of Dimensionality Reduction\n<ol>\n<li>Scikit-Learn Library Installation<\/li>\n<li>Classification Dataset<\/li>\n<li>Principal Component Analysis<\/li>\n<li>Singular Value Decomposition<\/li>\n<li>Linear Discriminant Analysis<\/li>\n<li>Isomap Embedding<\/li>\n<li>Locally Linear Embedding<\/li>\n<li>Modified Locally Linear Embedding<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<h2>Dimensionality Reduction<\/h2>\n<p>Dimensionality reduction refers to techniques for reducing the number of input variables in training data.<\/p>\n<blockquote>\n<p>When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the &ldquo;essence&rdquo; of the data. This is called dimensionality reduction.<\/p>\n<\/blockquote>\n<p>&mdash; Page 11, <a href=\"https:\/\/amzn.to\/2ucStHi\">Machine Learning: A Probabilistic Perspective<\/a>, 2012.<\/p>\n<p>High-dimensionality might mean hundreds, thousands, or even millions of input variables.<\/p>\n<p>Fewer input dimensions often means correspondingly fewer parameters or a simpler structure in the machine learning model, referred to as <a href=\"https:\/\/machinelearningmastery.com\/degrees-of-freedom-in-machine-learning\/\">degrees of freedom<\/a>. A model with too many degrees of freedom is likely to overfit the training dataset and may not perform well on new data.<\/p>\n<p>It is desirable to have simple models that generalize well, and in turn, input data with few input variables. This is particularly true for linear models where the number of inputs and the degrees of freedom of the model are often closely related.<\/p>\n<p>Dimensionality reduction is a data preparation technique performed on data prior to modeling. It might be performed after data cleaning and data scaling and before training a predictive model.<\/p>\n<blockquote>\n<p>&hellip; dimensionality reduction yields a more compact, more easily interpretable representation of the target concept, focusing the user&rsquo;s attention on the most relevant variables.<\/p>\n<\/blockquote>\n<p>&mdash; Page 289, <a href=\"https:\/\/amzn.to\/2tlRP9V\">Data Mining: Practical Machine Learning Tools and Techniques<\/a>, 4th edition, 2016.<\/p>\n<p>As such, any dimensionality reduction performed on training data must also be performed on new data, such as a test dataset, validation dataset, and data when making a prediction with the <a href=\"https:\/\/machinelearningmastery.com\/train-final-machine-learning-model\/\">final model<\/a>.<\/p>\n<\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Want to Get Started With Data Preparation?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/1041bc0ec172a2%3A164f8be4f346dc\/4935938752774144\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\" rel=\"noopener noreferrer\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"1041bc0ec172a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/1041bc0ec172a2%3A164f8be4f346dc\/4935938752774144\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1589485176.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<h2>Dimensionality Reduction Algorithms<\/h2>\n<p>There are many algorithms that can be used for dimensionality reduction.<\/p>\n<p>Two main classes of methods are those drawn from linear algebra and those drawn from manifold learning.<\/p>\n<h3>Linear Algebra Methods<\/h3>\n<p>Matrix factorization methods drawn from the field of linear algebra can be used for dimensionality.<\/p>\n<p>For more on matrix factorization, see the tutorial:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/introduction-to-matrix-decompositions-for-machine-learning\/\">A Gentle Introduction to Matrix Factorization for Machine Learning<\/a><\/li>\n<\/ul>\n<p>Some of the more popular methods include:<\/p>\n<ul>\n<li>Principal Components Analysis<\/li>\n<li>Singular Value Decomposition<\/li>\n<li>Non-Negative Matrix Factorization<\/li>\n<\/ul>\n<h3>Manifold Learning Methods<\/h3>\n<p>Manifold learning methods seek a lower-dimensional projection of high dimensional input that captures the salient properties of the input data.<\/p>\n<p>Some of the more popular methods include:<\/p>\n<ul>\n<li>Isomap Embedding<\/li>\n<li>Locally Linear Embedding<\/li>\n<li>Multidimensional Scaling<\/li>\n<li>Spectral Embedding<\/li>\n<li>t-distributed Stochastic Neighbor Embedding<\/li>\n<\/ul>\n<p>Each algorithm offers a different approach to the challenge of discovering natural relationships in data at lower dimensions.<\/p>\n<p>There is no best dimensionality reduction algorithm, and no easy way to find the best algorithm for your data without using controlled experiments.<\/p>\n<p>In this tutorial, we will review how to use each subset of these popular dimensionality reduction algorithms from the scikit-learn library.<\/p>\n<p>The examples will provide the basis for you to copy-paste the examples and test the methods on your own data.<\/p>\n<p>We will not dive into the theory behind how the algorithms work or compare them directly. For a good starting point on this topic, see:<\/p>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/decomposition.html\">Decomposing signals in components, scikit-learn API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/manifold.html\">Manifold Learning, scikit-learn API<\/a>.<\/li>\n<\/ul>\n<p>Let&rsquo;s dive in.<\/p>\n<h2>Examples of Dimensionality Reduction<\/h2>\n<p>In this section, we will review how to use popular dimensionality reduction algorithms in scikit-learn.<\/p>\n<p>This includes an example of using the dimensionality reduction technique as a data transform in a modeling pipeline and evaluating a model fit on the data.<\/p>\n<p>The examples are designed for you to copy-paste into your own project and apply the methods to your own data. There are some algorithms available in the scikit-learn library that are not demonstrated because they cannot be used as a data transform directly given the nature of the algorithm.<\/p>\n<p>As such, we will use a synthetic classification dataset in each example.<\/p>\n<h3>Scikit-Learn Library Installation<\/h3>\n<p>First, let&rsquo;s install the library.<\/p>\n<p>Don&rsquo;t skip this step as you will need to ensure you have the latest version installed.<\/p>\n<p>You can install the scikit-learn library using the pip Python installer, as follows:<\/p>\n<pre class=\"crayon-plain-tag\">sudo pip install scikit-learn<\/pre>\n<p>For additional installation instructions specific to your platform, see:<\/p>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/install.html\">Installing scikit-learn<\/a><\/li>\n<\/ul>\n<p>Next, let&rsquo;s confirm that the library is installed and you are using a modern version.<\/p>\n<p>Run the following script to print the library version number.<\/p>\n<pre class=\"crayon-plain-tag\"># check scikit-learn version\r\nimport sklearn\r\nprint(sklearn.__version__)<\/pre>\n<p>Running the example, you should see the following version number or higher.<\/p>\n<pre class=\"crayon-plain-tag\">0.23.0<\/pre>\n<\/p>\n<h3>Classification Dataset<\/h3>\n<p>We will use the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_classification.html\">make_classification() function<\/a> to create a test binary classification dataset.<\/p>\n<p>The dataset will have 1,000 examples with 20 input features, 10 of which are informative and 10 of which are redundant. This provides an opportunity for each technique to identify and remove redundant input features.<\/p>\n<p>The fixed random seed for the pseudorandom number generator ensures we generate the same synthetic dataset each time the code runs.<\/p>\n<p>An example of creating and summarizing the synthetic classification dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># synthetic classification dataset\r\nfrom sklearn.datasets import make_classification\r\n# define dataset\r\nX, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)\r\n# summarize the dataset\r\nprint(X.shape, y.shape)<\/pre>\n<p>Running the example creates the dataset and reports the number of rows and columns matching our expectations.<\/p>\n<pre class=\"crayon-plain-tag\">(1000, 20) (1000,)<\/pre>\n<p>It is a binary classification task and we will evaluate a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.linear_model.LogisticRegression.html\">LogisticRegression<\/a> model after each dimensionality reduction transform.<\/p>\n<p>The model will be evaluated using the gold standard of <a href=\"https:\/\/machinelearningmastery.com\/k-fold-cross-validation\/\">repeated stratified 10-fold cross-validation<\/a>. The mean and standard deviation classification accuracy across all folds and repeats will be reported.<\/p>\n<p>The example below evaluates the model on the raw dataset as a point of comparison.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate logistic regression model on raw data\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom sklearn.datasets import make_classification\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.linear_model import LogisticRegression\r\n# define dataset\r\nX, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)\r\n# define the model\r\nmodel = LogisticRegression()\r\n# evaluate model\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n# report performance\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))<\/pre>\n<p>Running the example evaluates the logistic regression on the raw dataset with all 20 columns, achieving a classification accuracy of about 82.4 percent.<\/p>\n<p>A successful dimensionality reduction transform on this data should result in a model that has better accuracy than this baseline, although this may not be possible with all techniques.<\/p>\n<p>Note: we are not trying to &ldquo;<em>solve<\/em>&rdquo; this dataset, just provide working examples that you can use as a starting point.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.824 (0.034)<\/pre>\n<p>Next, we can start looking at examples of dimensionality reduction algorithms applied to this dataset.<\/p>\n<p>I have made some minimal attempts to tune each method to the dataset. Each dimensionality reduction method will be configured to reduce the 20 input columns to 10 where possible.<\/p>\n<p>We will use a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.pipeline.Pipeline.html\">Pipeline<\/a> to combine the data transform and model into an atomic unit that can be evaluated using the cross-validation procedure; for example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the pipeline\r\nsteps = [('pca', PCA(n_components=10)), ('m', LogisticRegression())]\r\nmodel = Pipeline(steps=steps)<\/pre>\n<p>Let&rsquo;s get started.<\/p>\n<p><strong>Can you get a better result for one of the algorithms?<\/strong><br \/>\nLet me know in the comments below.<\/p>\n<h3>Principal Component Analysis<\/h3>\n<p>Principal Component Analysis, or PCA, might be the most popular technique for dimensionality reduction with dense data (few zero values).<\/p>\n<p>For more on how PCA works, see the tutorial:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/calculate-principal-component-analysis-scratch-python\/\">How to Calculate Principal Component Analysis (PCA) from Scratch in Python<\/a><\/li>\n<\/ul>\n<p>The scikit-learn library provides the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.decomposition.PCA.html\">PCA class<\/a> implementation of Principal Component Analysis that can be used as a dimensionality reduction data transform. The &ldquo;<em>n_components<\/em>&rdquo; argument can be set to configure the number of desired dimensions in the output of the transform.<\/p>\n<p>The complete example of evaluating a model with PCA dimensionality reduction is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate pca with logistic regression algorithm for classification\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom sklearn.datasets import make_classification\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.decomposition import PCA\r\nfrom sklearn.linear_model import LogisticRegression\r\n# define dataset\r\nX, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)\r\n# define the pipeline\r\nsteps = [('pca', PCA(n_components=10)), ('m', LogisticRegression())]\r\nmodel = Pipeline(steps=steps)\r\n# evaluate model\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n# report performance\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))<\/pre>\n<p>Running the example evaluates the modeling pipeline with dimensionality reduction and a logistic regression predictive model.<\/p>\n<p>In this case, we don&rsquo;t see any lift in model performance in using the PCA transform.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.824 (0.034)<\/pre>\n<\/p>\n<h3>Singular Value Decomposition<\/h3>\n<p>Singular Value Decomposition, or SVD, is one of the most popular techniques for dimensionality reduction for <a href=\"https:\/\/machinelearningmastery.com\/sparse-matrices-for-machine-learning\/\">sparse data<\/a> (data with many zero values).<\/p>\n<p>For more on how SVD works, see the tutorial:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/singular-value-decomposition-for-machine-learning\/\">How to Calculate the SVD from Scratch with Python<\/a><\/li>\n<\/ul>\n<p>The scikit-learn library provides the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.decomposition.TruncatedSVD.html\">TruncatedSVD class<\/a> implementation of Singular Value Decomposition that can be used as a dimensionality reduction data transform. The &ldquo;<em>n_components<\/em>&rdquo; argument can be set to configure the number of desired dimensions in the output of the transform.<\/p>\n<p>The complete example of evaluating a model with SVD dimensionality reduction is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate svd with logistic regression algorithm for classification\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom sklearn.datasets import make_classification\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.decomposition import TruncatedSVD\r\nfrom sklearn.linear_model import LogisticRegression\r\n# define dataset\r\nX, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)\r\n# define the pipeline\r\nsteps = [('svd', TruncatedSVD(n_components=10)), ('m', LogisticRegression())]\r\nmodel = Pipeline(steps=steps)\r\n# evaluate model\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n# report performance\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))<\/pre>\n<p>Running the example evaluates the modeling pipeline with dimensionality reduction and a logistic regression predictive model.<\/p>\n<p>In this case, we don&rsquo;t see any lift in model performance in using the SVD transform.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.824 (0.034)<\/pre>\n<\/p>\n<h3>Linear Discriminant Analysis<\/h3>\n<p>Linear Discriminant Analysis, or LDA, is a multi-class classification algorithm that can be used for dimensionality reduction.<\/p>\n<p>The number of dimensions for the projection is limited to 1 and C-1, where C is the number of classes. In this case, our dataset is a binary classification problem (two classes), limiting the number of dimensions to 1.<\/p>\n<p>For more on LDA for dimensionality reduction, see the tutorial:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/linear-discriminant-analysis-for-dimensionality-reduction-in-python\/\">Linear Discriminant Analysis for Dimensionality Reduction in Python<\/a><\/li>\n<\/ul>\n<p>The scikit-learn library provides the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html\">LinearDiscriminantAnalysis class<\/a> implementation of Linear Discriminant Analysis that can be used as a dimensionality reduction data transform. The &ldquo;<em>n_components<\/em>&rdquo; argument can be set to configure the number of desired dimensions in the output of the transform.<\/p>\n<p>The complete example of evaluating a model with LDA dimensionality reduction is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate lda with logistic regression algorithm for classification\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom sklearn.datasets import make_classification\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysis\r\nfrom sklearn.linear_model import LogisticRegression\r\n# define dataset\r\nX, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)\r\n# define the pipeline\r\nsteps = [('lda', LinearDiscriminantAnalysis(n_components=1)), ('m', LogisticRegression())]\r\nmodel = Pipeline(steps=steps)\r\n# evaluate model\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n# report performance\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))<\/pre>\n<p>Running the example evaluates the modeling pipeline with dimensionality reduction and a logistic regression predictive model.<\/p>\n<p>In this case, we can see a slight lift in performance as compared to the baseline fit on the raw data.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.825 (0.034)<\/pre>\n<\/p>\n<h3>Isomap Embedding<\/h3>\n<p>Isomap Embedding, or Isomap, creates an embedding of the dataset and attempts to preserve the relationships in the dataset.<\/p>\n<p>The scikit-learn library provides the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.manifold.Isomap.html\">Isomap class<\/a> implementation of Isomap Embedding that can be used as a dimensionality reduction data transform. The &ldquo;<em>n_components<\/em>&rdquo; argument can be set to configure the number of desired dimensions in the output of the transform.<\/p>\n<p>The complete example of evaluating a model with SVD dimensionality reduction is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate isomap with logistic regression algorithm for classification\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom sklearn.datasets import make_classification\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.manifold import Isomap\r\nfrom sklearn.linear_model import LogisticRegression\r\n# define dataset\r\nX, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)\r\n# define the pipeline\r\nsteps = [('iso', Isomap(n_components=10)), ('m', LogisticRegression())]\r\nmodel = Pipeline(steps=steps)\r\n# evaluate model\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n# report performance\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))<\/pre>\n<p>Running the example evaluates the modeling pipeline with dimensionality reduction and a logistic regression predictive model.<\/p>\n<p>In this case, we can see a lift in performance with the Isomap data transform as compared to the baseline fit on the raw data.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.888 (0.029)<\/pre>\n<\/p>\n<h3>Locally Linear Embedding<\/h3>\n<p>Locally Linear Embedding, or LLE, creates an embedding of the dataset and attempts to preserve the relationships between neighborhoods in the dataset.<\/p>\n<p>The scikit-learn library provides the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.manifold.LocallyLinearEmbedding.html\">LocallyLinearEmbedding class<\/a> implementation of Locally Linear Embedding that can be used as a dimensionality reduction data transform. The &ldquo;<em>n_components<\/em>&rdquo; argument can be set to configure the number of desired dimensions in the output of the transform<\/p>\n<p>The complete example of evaluating a model with LLE dimensionality reduction is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate lle and logistic regression for classification\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom sklearn.datasets import make_classification\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.manifold import LocallyLinearEmbedding\r\nfrom sklearn.linear_model import LogisticRegression\r\n# define dataset\r\nX, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)\r\n# define the pipeline\r\nsteps = [('lle', LocallyLinearEmbedding(n_components=10)), ('m', LogisticRegression())]\r\nmodel = Pipeline(steps=steps)\r\n# evaluate model\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n# report performance\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))<\/pre>\n<p>Running the example evaluates the modeling pipeline with dimensionality reduction and a logistic regression predictive model.<\/p>\n<p>In this case, we can see a lift in performance with the LLE data transform as compared to the baseline fit on the raw data.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.886 (0.028)<\/pre>\n<\/p>\n<h3>Modified Locally Linear Embedding<\/h3>\n<p>Modified Locally Linear Embedding, or Modified LLE, is an extension of Locally Linear Embedding that creates multiple weighting vectors for each neighborhood.<\/p>\n<p>The scikit-learn library provides the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.manifold.LocallyLinearEmbedding.html\">LocallyLinearEmbedding class<\/a> implementation of Modified Locally Linear Embedding that can be used as a dimensionality reduction data transform. The &ldquo;<em>method<\/em>&rdquo; argument must be set to &lsquo;modified&rsquo; and the &ldquo;<em>n_components<\/em>&rdquo; argument can be set to configure the number of desired dimensions in the output of the transform which must be less than the &ldquo;<em>n_neighbors<\/em>&rdquo; argument.<\/p>\n<p>The complete example of evaluating a model with Modified LLE dimensionality reduction is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate modified lle and logistic regression for classification\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom sklearn.datasets import make_classification\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.manifold import LocallyLinearEmbedding\r\nfrom sklearn.linear_model import LogisticRegression\r\n# define dataset\r\nX, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)\r\n# define the pipeline\r\nsteps = [('lle', LocallyLinearEmbedding(n_components=5, method='modified', n_neighbors=10)), ('m', LogisticRegression())]\r\nmodel = Pipeline(steps=steps)\r\n# evaluate model\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n# report performance\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))<\/pre>\n<p>Running the example evaluates the modeling pipeline with dimensionality reduction and a logistic regression predictive model.<\/p>\n<p>In this case, we can see a lift in performance with the modified LLE data transform as compared to the baseline fit on the raw data.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.846 (0.036)<\/pre>\n<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Tutorials<\/h3>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/dimensionality-reduction-for-machine-learning\/\">Introduction to Dimensionality Reduction for Machine Learning<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/principal-components-analysis-for-dimensionality-reduction-in-python\/\">Principal Component Analysis for Dimensionality Reduction in Python<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/singular-value-decomposition-for-dimensionality-reduction-in-python\/\">Singular Value Decomposition for Dimensionality Reduction in Python<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/linear-discriminant-analysis-for-dimensionality-reduction-in-python\/\">Linear Discriminant Analysis for Dimensionality Reduction in Python<\/a><\/li>\n<\/ul>\n<h3>APIs<\/h3>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/decomposition.html\">Decomposing signals in components, scikit-learn API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/manifold.html\">Manifold Learning, scikit-learn API<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to fit and evaluate top dimensionality reduction algorithms in Python.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>Dimensionality reduction seeks a lower-dimensional representation of numerical input data that preserves the salient relationships in the data.<\/li>\n<li>There are many different dimensionality reduction algorithms and no single best method for all datasets.<\/li>\n<li>How to implement, fit, and evaluate top dimensionality reduction in Python with the scikit-learn machine learning library.<\/li>\n<\/ul>\n<p><strong>Do you have any questions?<\/strong><br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/dimensionality-reduction-algorithms-with-python\/\">6 Dimensionality Reduction Algorithms With Python<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/dimensionality-reduction-algorithms-with-python\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Dimensionality reduction is an unsupervised learning technique. Nevertheless, it can be used as a data transform pre-processing step for machine learning algorithms [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/07\/09\/6-dimensionality-reduction-algorithms-with-python\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":3645,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3644"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=3644"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3644\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/3645"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=3644"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=3644"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=3644"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}