{"id":4616,"date":"2021-05-03T06:34:28","date_gmt":"2021-05-03T06:34:28","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2021\/05\/03\/secret-behind-the-dimensionality-reduction-for-data-scientist\/"},"modified":"2021-05-03T06:34:28","modified_gmt":"2021-05-03T06:34:28","slug":"secret-behind-the-dimensionality-reduction-for-data-scientist","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2021\/05\/03\/secret-behind-the-dimensionality-reduction-for-data-scientist\/","title":{"rendered":"Secret behind the Dimensionality Reduction for Data Scientist"},"content":{"rendered":"<p>Author: Shanthababu P<\/p>\n<div>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone align-center\" src=\"https:\/\/editor.analyticsvidhya.com\/uploads\/78698SrDS.png\" alt=\"Dimensionality Reduction Data Science\" width=\"547\" height=\"346\"><\/p>\n<p>Hello! I like to share my interesting experience While I was working as a junior Data Scientist, I can even say I was a beginner during that time in this data science domain.<\/p>\n<p class=\"\">One of the customers came to us for machine learning implementation for their problem statement in either way unsupervised and supervised forms, I thought it was going to be as usual mode of execution and process because based on my experience for small scale implementation or during my training period we use to have 25-30 features and we play around with that and we use to predict or classify or clustering the dataset and share the outcome.<\/p>\n<p class=\"\">But this time they come up with thousands of features, But I was a little surprised and scared about the implementation and my head started spinning as anything. Same time my Senior Data Scientist brought everyone from the team into the meeting room.<\/p>\n<div class=\"medium-insert-images\"><\/div>\n<p class=\"\">My\u00a0Senior Data Scientist (Sr. DS) coined the new word to us, that is nothing but<span>\u00a0<\/span><i><b>Dimensionality Reduction<\/b><\/i><span>\u00a0<\/span>(OR)<span>\u00a0<\/span><i><b>Dimension Reduction<\/b><\/i><span>\u00a0<\/span>(OR)<span>\u00a0<\/span><i><b>Curse Of Dimensionality,<span>\u00a0<\/span><\/b>all beginners thought that he is going to explain something in Physis, we had little remembrance that we had come across this term during our training programme. then he started to sketch on the board (Refer fig-1). When we started looking at 1-D, 2-D we are much comfortable but 3-D and above our heads started to spin.\u00a0<\/i><\/p>\n<div class=\"medium-insert-images\">\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone align-center\" src=\"https:\/\/editor.analyticsvidhya.com\/uploads\/146951D- 2D.png\" alt=\"Dimensionality Reduction 1D - 2D\" width=\"774\" height=\"259\"> \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 1-D and 2-D<\/div>\n<div class=\"medium-insert-images medium-insert-images-wide\">\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone align-center\" src=\"https:\/\/editor.analyticsvidhya.com\/uploads\/719093-D.png\" alt=\"Dimensionality Reduction weather report\" width=\"460\" height=\"278\"> \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 3 \u2013 D<\/div>\n<p>Sr. DS has continued his lecture, all these sample pictures are just notable<span>\u00a0<\/span><span id=\"spancomment908\" class=\"comment-highlite\">features and we could play around<\/span><span>\u00a0<\/span>with these, in a real-time scenario, many Machine Learning(ML) problems involve thousands of features, so we end up training those models became extremely slow and will not give good solutions for business problem and we couldn\u2019t freeze the model, this situation is the so-called \u201c<i><b>Curse Of Dimensionality\u201d\u00a0<\/b><\/i>working. Then we all started asking a question that how we should handle this.<\/p>\n<p>He took a long breath and continue to share his experience in his own style.\u00a0 He started with a simple definition as follows.<\/p>\n<p>\u00a0<\/p>\n<h2><b>What is Dimensionality?\u00a0<\/b><\/h2>\n<p>We can say the number of features in our dataset is referred to as its dimensionality.<\/p>\n<p>\u00a0<\/p>\n<h2>\n<b>What is\u00a0<\/b><b>Dimensionality Reduction?<\/b><b>\u00a0<\/b><br \/>\n<\/h2>\n<p>Dimensionality Reduction is the process of reducing the dimensions(features) of a given dataset. Let\u2019s say if your dataset with a hundred columns\/features and bringing the number of columns down to 20-25.\u00a0 In simple terms, you are converting the<span>\u00a0<\/span><b>Cylinder<\/b>\/<i><b>Sphere to a Circle<\/b><\/i><span>\u00a0<\/span>or<span>\u00a0<\/span><b><i>Cube into a Plane<\/i><span>\u00a0<\/span><\/b>in the two-dimensional space as below figure.<\/p>\n<div class=\"medium-insert-images medium-insert-images-wide\">\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone align-center\" src=\"https:\/\/editor.analyticsvidhya.com\/uploads\/96012DR-Shape.png\" alt=\"Dimensionality Reduction 3D-2D\" width=\"352\" height=\"285\"> \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0Converting 3D- 2D<\/div>\n<p class=\"\">He has drawn below the relationship clearly between<span>\u00a0<\/span><i>Modle Performance<\/i><span>\u00a0<\/span>and<span>\u00a0<\/span><i>Number of Features(Dimensions)<\/i>.\u00a0As the number of features increases, the number of data points also increases proportionally. the straight statement is that the more features will bring more data samples, So we have represented all combinations of features and their values.<\/p>\n<div class=\"medium-insert-images medium-insert-images-wide\">\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone align-center\" src=\"https:\/\/editor.analyticsvidhya.com\/uploads\/32524FVsP.png\" alt=\"Dimensionality Reduction Mp\" width=\"322\" height=\"246\"><i>Modle Performance<\/i><span>\u00a0<\/span>Vs\u00a0<i>Number of Features<\/i>\n<\/div>\n<p>Now everyone in the room got the feel of what is \u201c<span><b>Curse Of Dimensionality\u201d\u00a0<\/b><\/span>at a very high level.<\/p>\n<p>\u00a0<\/p>\n<h2>Benefits of doing Dimensionality Reduction<\/h2>\n<p class=\"\">Suddenly, one of the team members asked can he tell us the benefits of doing\u00a0dimensionality reduction in the given dataset.<\/p>\n<p class=\"\">Our Sr. DS didn\u2019t stop sharing his extensive knowledge further. He has continued as below.<\/p>\n<p class=\"\">There are lots of benefits if we go with\u00a0dimensionality reduction.<\/p>\n<ul>\n<li id=\"d859\">It helps to remove redundancy in the features and noise error factors ultimately enhanced visualization of the given data set.<\/li>\n<li id=\"d859\">Excellent memory management activity has been exhibited due to dimensionality reduction.<\/li>\n<li id=\"b2ef\">Improving the performance of the model by choosing the right features by removing the unnecessary lists of features from the dataset.<\/li>\n<li id=\"ccaf\">Certainly, less number of dimensions (mandatory list of dimensions) required less computing efficiency and train the model faster with improved model accuracy.<\/li>\n<li id=\"7ced\">Considerably reducing the Complexity and Overfitting of the overall model and its performance.<\/li>\n<\/ul>\n<p>Yes! it was an awe-inspiring spectacle, robustness, and dynamics of the \u201c<b>Dimensionality Reduction\u201d<\/b><span>. Now I can\u00a0<\/span><span>visualization the overall benefit as below. hope it could help you too\u00a0\u00a0<\/span><\/p>\n<div class=\"medium-insert-images\">\n<p><span>\u00a0<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter\" src=\"https:\/\/editor.analyticsvidhya.com\/uploads\/41402DR-benifts.png\" alt=\"Benefits of doing Dimensionality Reduction\" width=\"566\" height=\"317\"><br \/> \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 Benefits of\u00a0Dimensionality Reduction.\n<\/div>\n<p class=\"\">What is next, Of Course! We jump into the next major question that what are techniques available for Dimensionality Reduction.<\/p>\n<p>\u00a0<\/p>\n<h2>Dimensionality Reduction \u2013 Techniques<\/h2>\n<p class=\"\">Our Sr. DS very much interested continued his explanation on the\u00a0techniques whichever possible in Data Science domain, broadly classified into two approaches as mentioned earlier considering selecting the best-fit Feature(s) or removing less important Feature in the given high dimensional dataset. these high-level techniques use to be called<span>\u00a0<\/span><b>Feature Selection<span>\u00a0<\/span><\/b>or<span>\u00a0<\/span><b>Feature Extraction,<\/b><span>\u00a0<\/span>and basically, this is part of<b><span>\u00a0<\/span>Feature Engineering.<\/b><span>\u00a0<\/span>He has connected the dots perfectly.<\/p>\n<div class=\"medium-insert-images\">\n<img loading=\"lazy\" decoding=\"async\" class=\"aligncenter align-center\" src=\"https:\/\/editor.analyticsvidhya.com\/uploads\/64476DR-FE.png\" alt=\"Locating Dimensionality Reduction in\u00a0Feature Engineering family\" width=\"440\" height=\"388\"> \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 Locating Dimensionality Reduction in<b>\u00a0Feature Engineering family<\/b>\n<\/div>\n<p class=\"\">He took us further in-depth concepts to understand the big picture of applied \u201c<i><b>Dimensionality Reduction\u201d<\/b><\/i>\u00a0on the high dimensional dataset. Once we saw the below figure we able to relate the Feature Engineering and Dimensionality Reduction. Look at this figure the essence of\u00a0Dimensionality Reduction well by our Sr. DS is in it!<\/p>\n<div class=\"medium-insert-images\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone align-center\" src=\"https:\/\/editor.analyticsvidhya.com\/uploads\/26473DR-FS_FE.png\" alt=\"Dimensionality reduction\" width=\"658\" height=\"288\"><\/div>\n<p class=\"\">Everyone was interested to know how to apply all these using Phyton libraries with the help of simple coding. our Sr. DS asked me to bring colorful markers and dusters<\/p>\n<div class=\"medium-insert-images medium-insert-images-grid\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone align-center\" src=\"https:\/\/editor.analyticsvidhya.com\/uploads\/69343MD.png\" alt=\"marker\" width=\"266\" height=\"146\"><\/div>\n<p class=\"\">Sr. DS picked up the new blue marker and started explaining PCA with a simple example as follows, before that he explained what is PCA stuff for dimensionality reduction.<\/p>\n<p><b>Principal Component Analysis(PCA):\u00a0<\/b><span>PCA is a technique for dimensionality reduction of a given dataset, by increasing interpretability with negligible information loss. Here the number of variables is decreasing, so it makes further analysis simpler.\u00a0<\/span><span>Which converts a set of correlated variables to a set of uncorrelated variables.\u00a0<\/span><span>Used for machine learning predictive modeling. And he advised us to go through\u00a0<\/span><strong>Eigenvector,\u00a0<\/strong><strong>Eigen Values<\/strong><\/p>\n<p class=\"\"><span>He took familiar\u00a0<\/span><span>wines.csv for his quick analysis.<\/span><\/p>\n<div class=\"medium-insert-images medium-insert-images-wide\">\n<p>\u00a0<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone align-center\" src=\"https:\/\/editor.analyticsvidhya.com\/uploads\/55529Wins.png\" alt=\"PCA\" width=\"493\" height=\"102\">\n<\/div>\n<pre># Import all the necessary packages <br>import pandas as pd <br>\nimport numpy as np <br>\nimport matplotlib.pyplot as plt <br>\nimport seaborn as sns from sklearn.model_selection <br>\nimport train_test_split from sklearn.linear_model <br>\nimport LinearRegression from sklearn.metrics <br>\nimport confusion_matrix from sklearn.metrics <br>\nimport accuracy_score from sklearn <br>\nimport metrics %matplotlib inline <br>\nimport matplotlib.pyplot as plt <br>\n%matplotlib inline <br><br>\nwq_dataset = pd.read_csv('winequality.csv')<\/pre>\n<h5><b>EDA on a given data set<\/b><\/h5>\n<pre>wq_dataset.head(5)<\/pre>\n<div class=\"medium-insert-images medium-insert-images-wide\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone align-center\" src=\"https:\/\/editor.analyticsvidhya.com\/uploads\/45603wins-head.png\" alt=\"dataset.head\" width=\"1047\" height=\"173\"><\/div>\n<pre>wq_dataset.describe()<\/pre>\n<div class=\"medium-insert-images\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone align-center\" src=\"https:\/\/editor.analyticsvidhya.com\/uploads\/29591Describe.png\" alt=\"describe\" width=\"1100\" height=\"283\"><\/div>\n<pre>wq_dataset.isnull().any()<\/pre>\n<div class=\"medium-insert-images medium-insert-images-grid\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone\" src=\"https:\/\/editor.analyticsvidhya.com\/uploads\/37310Null check.png\" alt=\"null\" width=\"263\" height=\"245\"><\/div>\n<p class=\"\">No Null value in the given data set, So great and we\u2019re lucky.<\/p>\n<div class=\"medium-insert-images medium-insert-images-left\"><\/div>\n<h4>Find correlations of each feature<\/h4>\n<pre>correlations = wq_dataset.corr()['quality'].drop('quality') print(correlations)<\/pre>\n<div class=\"medium-insert-images medium-insert-images-grid\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone\" src=\"https:\/\/editor.analyticsvidhya.com\/uploads\/20629correlations.png\" alt=\"correlations\" width=\"214\" height=\"173\"><\/div>\n<h4>Correlation Representation using Heatmap<\/h4>\n<pre>sns.heatmap(wq_dataset.corr()) plt.show()<\/pre>\n<div class=\"medium-insert-images medium-insert-images-wide\">\n<p>\u00a0<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone align-center\" src=\"https:\/\/editor.analyticsvidhya.com\/uploads\/77232cor-HM.png\" alt=\"correlation dimensionality reduction\" width=\"489\" height=\"371\">\n<\/div>\n<pre>x = wq_dataset[features] y = wq_dataset['quality']<\/pre>\n<p class=\"\"><span>[\u2018fixed acidity\u2019, \u2018volatile acidity\u2019, \u2018citric acid\u2019, \u2018chlorides\u2019, \u2018total sulfur dioxide\u2019, \u2018density\u2019, \u2018sulphates\u2019, \u2018alcohol\u2019]<\/span><\/p>\n<div><b># Create training and testing set using train_test_split<\/b><\/div>\n<pre>x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=3)<\/pre>\n<p class=\"\"><b>Training and Testing Shape<\/b><\/p>\n<pre>print('Traning data shape:', x_train.shape) print('Testing data shape:', x_test.shape)<\/pre>\n<pre>Traning data shape: (1199, 8) Testing data shape: (400, 8)<\/pre>\n<p class=\"\"><span><b>PCA implementation for Dimensionality reduction (with 2 columns)<\/b><\/span><\/p>\n<pre>from sklearn.decomposition import PCA pca_wins = PCA(n_components=2) principalComponents_wins = pca_wins.fit_transform(x)<\/pre>\n<p class=\"\">Naming them as\u00a0<i><b>principal component 1, principal component 2<\/b><\/i><\/p>\n<pre>pcs_wins_df = pd.DataFrame(data = principalComponents_wins, columns = ['principal component 1', 'principal component 2'])<\/pre>\n<p class=\"\">New\u00a0<span>principal components and their values.<\/span><\/p>\n<pre>pcs_wins_df.head()<\/pre>\n<div class=\"medium-insert-images medium-insert-images-grid\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone align-center\" src=\"https:\/\/editor.analyticsvidhya.com\/uploads\/75634pc12.png\" alt=\"principal components and their values\" width=\"362\" height=\"174\"><\/div>\n<p class=\"\"><span>We all surprised when looking at the above two columns with new column name and values, We asked what happen to<i>\u00a0<\/i><\/span><span><i>\u2018fixed acidity\u2019, \u2018volatile acidity, \u2018citric acid\u2019, \u2018chlorides\u2019, \u2018total sulfur dioxide\u2019, \u2018density\u2019, \u2018sulphates\u2019, \u2018alcohol\u2019<\/i>\u00a0columns. Sr. DS said all gone, now we have just two columns after we applied PCA for dimensionality reduction on given data and we are going to implement few models and this is going to be the normal way.<\/span><\/p>\n<p class=\"\">He has mentioned one keyword<span>\u00a0<\/span><b>\u201cvariation per principal component\u201d<\/b><\/p>\n<p class=\"\">this is the fraction of<span>\u00a0<\/span><b>variance explained<\/b><span>\u00a0<\/span>by a<span>\u00a0<\/span><b>principal component<\/b><span>\u00a0<\/span>is the ratio between the<span>\u00a0<\/span><b>variance<\/b><span>\u00a0<\/span>of that<span>\u00a0<\/span><b>principal component<\/b><span>\u00a0<\/span>and the total<span>\u00a0<\/span><b>variance<\/b>.<\/p>\n<pre>print('Explained variation per principal component: {}'.format(pca_wins.explained_variance_ratio_))<\/pre>\n<pre><b>Explained variation per principal component: [0.99615166 0.00278501]<\/b><\/pre>\n<p class=\"\">Followed by this he was demonstrated the following models<\/p>\n<div>\n<ul>\n<li>Logistic Regression<\/li>\n<li>Random forest<\/li>\n<li>KNN<\/li>\n<li>Naive Bayes<\/li>\n<\/ul>\n<\/div>\n<p class=\"\">Accuracy was better and little difference among each model, but he has mentioned this is for PCA implementation. Everyone in the room felt that we have completed an excellent\u00a0roller coaster. he has advised us to do hands-on other Dimensionality Reduction \u2013 Techniques.<\/p>\n<div class=\"medium-insert-images medium-insert-images-wide\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone align-center\" src=\"https:\/\/editor.analyticsvidhya.com\/uploads\/19502roller coaster.png\" alt=\"Dimensionality Reduction - Techniques\" width=\"420\" height=\"315\"><\/div>\n<p class=\"\">Okay, Guys! Thanks for your time, hope I able to narrate my learning experience of\u00a0Dimensionality Reduction \u2013 Techniques in right ways here, I trust it would help to continue the journey to handle complex data set in machine learning problem statement. Cheers!<\/p>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:1048978\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Shanthababu P Hello! I like to share my interesting experience While I was working as a junior Data Scientist, I can even say I [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2021\/05\/03\/secret-behind-the-dimensionality-reduction-for-data-scientist\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":4617,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4616"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=4616"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4616\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/4617"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=4616"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=4616"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=4616"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}