{"id":5532,"date":"2022-04-03T06:29:10","date_gmt":"2022-04-03T06:29:10","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2022\/04\/03\/a-guide-to-getting-datasets-for-machine-learning-in-python\/"},"modified":"2022-04-03T06:29:10","modified_gmt":"2022-04-03T06:29:10","slug":"a-guide-to-getting-datasets-for-machine-learning-in-python","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2022\/04\/03\/a-guide-to-getting-datasets-for-machine-learning-in-python\/","title":{"rendered":"A Guide to Getting Datasets for Machine Learning in Python"},"content":{"rendered":"<p>Author: Adrian Tam<\/p>\n<div>\n<p>Compared to other programming exercises, a machine learning project is a blend of code and data. You need both to achieve the result and do something useful. Over the years, many well-known datasets have been created, and many have become standards or benchmarks. In this tutorial, we are going to see how we can obtain those well-known public datasets easily. We will also learn how to make a synthetic dataset if none of the existing datasets fits our needs.<\/p>\n<p>After finishing this tutorial, you will know:<\/p>\n<ul>\n<li>Where to look for freely available datasets for machine learning projects<\/li>\n<li>How to download datasets using libraries in Python<\/li>\n<li>How to generate synthetic datasets using scikit-learn<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_13374\" style=\"width: 810px\" class=\"wp-caption aligncenter\">\n<img decoding=\"async\" aria-describedby=\"caption-attachment-13374\" class=\"size-full wp-image-13374\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/pexels-olha-ruskykh-7166023-scaled.jpg\" alt=\"\" width=\"800\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/pexels-olha-ruskykh-7166023-scaled.jpg 2560w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/pexels-olha-ruskykh-7166023-300x200.jpg 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/pexels-olha-ruskykh-7166023-1024x683.jpg 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/pexels-olha-ruskykh-7166023-768x512.jpg 768w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/pexels-olha-ruskykh-7166023-1536x1024.jpg 1536w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/pexels-olha-ruskykh-7166023-2048x1365.jpg 2048w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/pexels-olha-ruskykh-7166023-600x400.jpg 600w\" sizes=\"(max-width: 2560px) 100vw, 2560px\"><\/p>\n<p id=\"caption-attachment-13374\" class=\"wp-caption-text\">A Guide to Getting Datasets for Machine Learning in Python<br \/>Photo by <a href=\"https:\/\/www.pexels.com\/photo\/close-up-shot-of-cassette-tapes-with-small-pieces-of-flowers-7166023\/\">Olha Ruskykh<\/a>. Some rights reserved.<\/p>\n<\/div>\n<h2 id=\"Tutorial-Overview\">Tutorial Overview<\/h2>\n<p>This tutorial is divided into four parts; they are:<\/p>\n<ol>\n<li>Dataset repositories<\/li>\n<li>Retrieving dataset in scikit-learn and Seaborn<\/li>\n<li>Retrieving dataset in TensorFlow<\/li>\n<li>Generating dataset in scikit-learn<\/li>\n<\/ol>\n<h2 id=\"Loading-Financial-Data-Using-Pandas-Datareader\">Dataset Repositories<\/h2>\n<p>Machine learning has been developed for decades, and therefore there are some datasets of historical significance. One of the most well-known repositories for these datasets is the <a href=\"https:\/\/archive.ics.uci.edu\/ml\/index.php\" target=\"_blank\" rel=\"noopener\">UCI Machine Learning Repository<\/a>. Most of the datasets over there are small in size because the technology at the time was not advanced enough to handle larger size data. Some famous datasets located in this repository are the iris flower dataset (introduced by Ronald Fisher in 1936) and the 20 newsgroups dataset (textual data usually referred to by information retrieval literature).<\/p>\n<p>Newer datasets are usually larger in size. For example, the ImageNet dataset is over 160 GB. These datasets are commonly found in <a href=\"https:\/\/www.kaggle.com\/\" target=\"_blank\" rel=\"noopener\">Kaggle<\/a>, and we can search them by name. If we need to download them, it is recommended to use Kaggle\u2019s command line tool after registering for an account.<\/p>\n<p><a href=\"https:\/\/www.openml.org\/\" target=\"_blank\" rel=\"noopener\">OpenML<\/a> is a newer repository that hosts a lot of datasets. It is convenient because you can search for the dataset by name, but it also has a standardized web API for users to retrieve data. It would be useful if you want to use Weka since it provides files in ARFF format.<\/p>\n<p>But still, many datasets are publicly available but not in these repositories for various reasons. You may also want to check out the \u201cList of datasets for machine-learning research\u201d page <a href=\"https:\/\/en.wikipedia.org\/wiki\/List_of_datasets_for_machine-learning_research\">List of datasets for machine-learning research<\/a>\u201d on Wikipedia. That page contains a long list of datasets attributed to different categories, with links to download them.<\/p>\n<h2>Retrieving Datasets in scikit-learn and Seaborn<\/h2>\n<p>Trivially, you may obtain those datasets by downloading them from the web, either through the browser, via command line, using the <code>wget<\/code> tool, or using network libraries such as <code>requests<\/code> in Python. Since some of those datasets have become a standard or benchmark, many machine learning libraries have created functions to help retrieve them. For practical reasons, often, the datasets are not shipped with the libraries but downloaded in real time when you invoke the functions. Therefore, you need to have a steady internet connection to use them.<\/p>\n<p>Scikit-learn is an example where you can download the dataset using its API. The related functions are defined under <code>sklearn.datasets,<\/code>and you may see the list of functions at:<\/p>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/classes.html#module-sklearn.datasets\">https:\/\/scikit-learn.org\/stable\/modules\/classes.html#module-sklearn.datasets<\/a><\/li>\n<\/ul>\n<p>For example, you can use the function <code>load_iris()<\/code> to get the iris flower dataset as follows:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">import sklearn.datasets\r\n\r\ndata, target = sklearn.datasets.load_iris(return_X_y=True, as_frame=True)\r\ndata[\"target\"] = target\r\nprint(data)<\/pre>\n<p>The <code>load_iris()<\/code> function would return numpy arrays (i.e., does not have column headers) instead of pandas DataFrame unless the argument <code>as_frame=True<\/code> is specified. Also, we pass\u00a0<code>return_X_y=True<\/code> to the function, so only the machine learning features and targets are returned, rather than some metadata such as the description of the dataset. The above code prints the following:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  target\r\n0                  5.1               3.5                1.4               0.2       0\r\n1                  4.9               3.0                1.4               0.2       0\r\n2                  4.7               3.2                1.3               0.2       0\r\n3                  4.6               3.1                1.5               0.2       0\r\n4                  5.0               3.6                1.4               0.2       0\r\n..                 ...               ...                ...               ...     ...\r\n145                6.7               3.0                5.2               2.3       2\r\n146                6.3               2.5                5.0               1.9       2\r\n147                6.5               3.0                5.2               2.0       2\r\n148                6.2               3.4                5.4               2.3       2\r\n149                5.9               3.0                5.1               1.8       2\r\n\r\n[150 rows x 5 columns]<\/pre>\n<p>Separating the features and targets is convenient for training a scikit-learn model, but combining them would be helpful for visualization. For example, we may combine the DataFrame as above and then visualize the correlogram using Seaborn:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">import sklearn.datasets\r\nimport matplotlib.pyplot as plt\r\nimport seaborn as sns\r\n\r\ndata, target = sklearn.datasets.load_iris(return_X_y=True, as_frame=True)\r\ndata[\"target\"] = target\r\n \r\nsns.pairplot(data, kind=\"scatter\", diag_kind=\"kde\", hue=\"target\",\r\n             palette=\"muted\", plot_kws={'alpha':0.7})\r\nplt.show()<\/pre>\n<p><img decoding=\"async\" class=\"aligncenter size-full wp-image-13375\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-1.png\" alt=\"\" width=\"800\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-1.png 1508w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-1-300x282.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-1-1024x962.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-1-768x721.png 768w\" sizes=\"(max-width: 1508px) 100vw, 1508px\"><\/p>\n<p>From the correlogram, we can see that target 0 is easy to distinguish, but targets 1 and 2 usually have some overlap. Because this dataset is also useful to demonstrate plotting functions, we can find the equivalent data loading function from Seaborn. We can rewrite the above into the following:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">import matplotlib.pyplot as plt\r\nimport seaborn as sns\r\n \r\ndata = sns.load_dataset(\"iris\")\r\nsns.pairplot(data, kind=\"scatter\", diag_kind=\"kde\", hue=\"species\",\r\n             palette=\"muted\", plot_kws={'alpha':0.7})\r\nplt.show()<\/pre>\n<p><img decoding=\"async\" class=\"aligncenter size-full wp-image-13376\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-2.png\" alt=\"\" width=\"800\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-2.png 1592w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-2-300x266.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-2-1024x907.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-2-768x680.png 768w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-2-1536x1360.png 1536w\" sizes=\"(max-width: 1592px) 100vw, 1592px\"><\/p>\n<p>The dataset supported by Seaborn is more limited. We can see the names of all supported datasets by running:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">import seaborn as sns\r\nprint(sns.get_dataset_names())<\/pre>\n<p>where the following is all the datasets from Seaborn:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">['anagrams', 'anscombe', 'attention', 'brain_networks', 'car_crashes',\r\n'diamonds', 'dots', 'exercise', 'flights', 'fmri', 'gammas', 'geyser',\r\n'iris', 'mpg', 'penguins', 'planets', 'taxis', 'tips', 'titanic']<\/pre>\n<p>There are a handful of similar functions to load the \u201c<a href=\"https:\/\/scikit-learn.org\/stable\/datasets\/toy_dataset.html\" target=\"_blank\" rel=\"noopener\">toy datasets<\/a>\u201d from scikit-learn. For example, we have <code>load_wine()<\/code>\u00a0and\u00a0<code>load_diabetes()<\/code>\u00a0defined in similar fashion.<\/p>\n<p>Larger datasets are also similar. We have <code>fetch_california_housing()<\/code>, for example, that needs to download the dataset from the internet (hence the \u201cfetch\u201d in the function name). Scikit-learn documentation calls these the \u201c<a href=\"https:\/\/scikit-learn.org\/stable\/datasets\/real_world.html\" target=\"_blank\" rel=\"noopener\">real-world datasets<\/a>,\u201d but, in fact, the toy datasets are equally real.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">import sklearn.datasets\r\n\r\ndata = sklearn.datasets.fetch_california_housing(return_X_y=False, as_frame=True)\r\ndata = data[\"frame\"]\r\nprint(data)<\/pre>\n<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude  MedHouseVal\r\n0      8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88    -122.23        4.526\r\n1      8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86    -122.22        3.585\r\n2      7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85    -122.24        3.521\r\n3      5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85    -122.25        3.413\r\n4      3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85    -122.25        3.422\r\n...       ...       ...       ...        ...         ...       ...       ...        ...          ...\r\n20635  1.5603      25.0  5.045455   1.133333       845.0  2.560606     39.48    -121.09        0.781\r\n20636  2.5568      18.0  6.114035   1.315789       356.0  3.122807     39.49    -121.21        0.771\r\n20637  1.7000      17.0  5.205543   1.120092      1007.0  2.325635     39.43    -121.22        0.923\r\n20638  1.8672      18.0  5.329513   1.171920       741.0  2.123209     39.43    -121.32        0.847\r\n20639  2.3886      16.0  5.254717   1.162264      1387.0  2.616981     39.37    -121.24        0.894\r\n\r\n[20640 rows x 9 columns]<\/pre>\n<p>If we need more than these, scikit-learn provides a handy function to read any dataset from OpenML. For example,<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">import sklearn.datasets\r\n\r\ndata = sklearn.datasets.fetch_openml(\"diabetes\", version=1, as_frame=True, return_X_y=False)\r\ndata = data[\"frame\"]\r\nprint(data)<\/pre>\n<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">preg   plas  pres  skin   insu  mass   pedi   age            class\r\n0     6.0  148.0  72.0  35.0    0.0  33.6  0.627  50.0  tested_positive\r\n1     1.0   85.0  66.0  29.0    0.0  26.6  0.351  31.0  tested_negative\r\n2     8.0  183.0  64.0   0.0    0.0  23.3  0.672  32.0  tested_positive\r\n3     1.0   89.0  66.0  23.0   94.0  28.1  0.167  21.0  tested_negative\r\n4     0.0  137.0  40.0  35.0  168.0  43.1  2.288  33.0  tested_positive\r\n..    ...    ...   ...   ...    ...   ...    ...   ...              ...\r\n763  10.0  101.0  76.0  48.0  180.0  32.9  0.171  63.0  tested_negative\r\n764   2.0  122.0  70.0  27.0    0.0  36.8  0.340  27.0  tested_negative\r\n765   5.0  121.0  72.0  23.0  112.0  26.2  0.245  30.0  tested_negative\r\n766   1.0  126.0  60.0   0.0    0.0  30.1  0.349  47.0  tested_positive\r\n767   1.0   93.0  70.0  31.0    0.0  30.4  0.315  23.0  tested_negative\r\n\r\n[768 rows x 9 columns]<\/pre>\n<p>Sometimes, we should not use the name to identify a dataset in OpenML as there may be multiple datasets of the same name. We can search for the data ID on OpenML and use it in the function as follows:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">import sklearn.datasets\r\n\r\ndata = sklearn.datasets.fetch_openml(data_id=42437, return_X_y=False, as_frame=True)\r\ndata = data[\"frame\"]\r\nprint(data)<\/pre>\n<p>The data ID in the code above refers to the titanic dataset. We can extend the code into the following to show how we can obtain the titanic dataset and then run the logistic regression:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">from sklearn.linear_model import LogisticRegression\r\nfrom sklearn.datasets import fetch_openml\r\n\r\nX, y = fetch_openml(data_id=42437, return_X_y=True, as_frame=False)\r\nclf = LogisticRegression(random_state=0).fit(X, y)\r\nprint(clf.score(X,y)) # accuracy\r\nprint(clf.coef_)      # coefficient in logistic regression<\/pre>\n<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">0.8114478114478114\r\n[[-0.7551392   2.24013347 -0.20761281  0.28073571  0.24416706 -0.36699113\r\n   0.4782924 ]]<\/pre>\n<\/p>\n<h2 id=\"Retrieving-dataset-in-TensorFlow\">Retrieving Datasets in TensorFlow<\/h2>\n<p>Besides scikit-learn, TensorFlow is another tool that we can use for machine learning projects. For similar reasons, there is also a dataset API for TensorFlow that gives you the dataset in a format that works best with TensorFlow. Unlike scikit-learn, the API is not part of the standard TensorFlow package. You need to install it using the command:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">pip install tensorflow-datasets<\/pre>\n<p>The list of all datasets is available on the catalog:<\/p>\n<ul>\n<li><a href=\"https:\/\/www.tensorflow.org\/datasets\/catalog\/overview#all_datasets\" target=\"_blank\" rel=\"noopener\">https:\/\/www.tensorflow.org\/datasets\/catalog\/overview#all_datasets<\/a><\/li>\n<\/ul>\n<p>All datasets are identified by a name. The names can be found in the catalog above. You may also get a list of names using the following:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">import tensorflow_datasets as tfds\r\nprint(tfds.list_builders())<\/pre>\n<p>which prints more than 1,000 names.<\/p>\n<p>As an example, let\u2019s pick the MNIST handwritten digits dataset as an example. We can download the data as follows:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">import tensorflow_datasets as tfds\r\nds = tfds.load(\"mnist\", split=\"train\", shuffle_files=True)\r\nprint(ds)<\/pre>\n<p>This shows us that <code>tfds.load()<\/code> gives us an object of type <code>tensorflow.data.OptionsDataset<\/code>:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">&lt;_OptionsDataset shapes: {image: (28, 28, 1), label: ()}, types: {image: tf.uint8, label: tf.int64}&gt;<\/pre>\n<p>In particular, this dataset has the data instances (images) in a numpy array of shapes (28,28,1), and the targets (labels) are scalars.<\/p>\n<p>With minor polishing, the data is ready for use in the Keras <code>fit()<\/code> function. An example is as follows:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">import tensorflow as tf\r\nimport tensorflow_datasets as tfds\r\nfrom tensorflow.keras.models import Sequential\r\nfrom tensorflow.keras.layers import Conv2D, Dense, AveragePooling2D, Dropout, Flatten\r\nfrom tensorflow.keras.callbacks import EarlyStopping\r\n\r\n# Read data with train-test split\r\nds_train, ds_test = tfds.load(\"mnist\", split=['train', 'test'],\r\n                              shuffle_files=True, as_supervised=True)\r\n\r\n# Set up BatchDataset from the OptionsDataset object\r\nds_train = ds_train.batch(32)\r\nds_test = ds_test.batch(32)\r\n\r\n# Build LeNet5 model and fit\r\nmodel = Sequential([\r\n    Conv2D(6, (5,5), input_shape=(28,28,1), padding=\"same\", activation=\"tanh\"),\r\n    AveragePooling2D((2,2), strides=2),\r\n    Conv2D(16, (5,5), activation=\"tanh\"),\r\n    AveragePooling2D((2,2), strides=2),\r\n    Conv2D(120, (5,5), activation=\"tanh\"),\r\n    Flatten(),\r\n    Dense(84, activation=\"tanh\"),\r\n    Dense(10, activation=\"softmax\")\r\n])\r\nmodel.compile(loss=\"sparse_categorical_crossentropy\", optimizer=\"adam\", metrics=[\"sparse_categorical_accuracy\"])\r\nearlystopping = EarlyStopping(monitor=\"val_loss\", patience=2, restore_best_weights=True)\r\nmodel.fit(ds_train, validation_data=ds_test, epochs=100, callbacks=[earlystopping])<\/pre>\n<p>If we provided <code>as_supervised=True<\/code>, the dataset would be records of tuples (features, targets) instead of the dictionary. It is required for Keras. Moreover, to use the dataset in the <code>fit()<\/code> function, we need to create an iterable of batches. This is done by setting up the batch size of the dataset to convert it from <code>OptionsDataset<\/code> object into <code>BatchDataset<\/code> object.<\/p>\n<p>We applied the LeNet5 model for the image classification. But since the target in the dataset is a numerical value (0 to 9) rather than a Boolean vector, we ask Keras to convert the softmax output vector into a number before computing accuracy and loss by specifying <code>sparse_categorical_accuracy<\/code>\u00a0and\u00a0<code>sparse_categorical_crossentropy<\/code> in the <code>compile()<\/code>\u00a0function.<\/p>\n<p>The key here is to understand every dataset is in a different shape. When you use it with your TensorFlow model, you need to adapt your model to fit the dataset.<\/p>\n<h2 id=\"Generating-dataset-in-scikit-learn\">Generating Datasets in scikit-learn<\/h2>\n<p>In scikit-learn, there is a set of very useful functions to generate a dataset with particular properties. Because we can control the properties of the synthetic dataset, it is helpful to evaluate the performance of our models in a specific situation that is not commonly seen in other datasets.<\/p>\n<p>Scikit-learn documentation calls these functions the <strong>samples generator<\/strong>. It is easy to use; for example:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">from sklearn.datasets import make_circles\r\nimport matplotlib.pyplot as plt\r\n\r\ndata, target = make_circles(n_samples=500, shuffle=True, factor=0.7, noise=0.1)\r\nplt.figure(figsize=(6,6))\r\nplt.scatter(data[:,0], data[:,1], c=target, alpha=0.8, cmap=\"Set1\")\r\nplt.show()<\/pre>\n<p><img decoding=\"async\" class=\"aligncenter size-full wp-image-13377\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-3.png\" alt=\"\" width=\"400\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-3.png 754w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-3-300x279.png 300w\" sizes=\"(max-width: 754px) 100vw, 754px\"><\/p>\n<p>The\u00a0<code>make_circles()<\/code> function generates coordinates of scattered points in a 2D plane such that there are two classes positioned in the form of concentric circles. We can control the size and overlap of the circles with the parameters <code>factor<\/code>\u00a0and\u00a0<code>noise<\/code> in the argument. This synthetic dataset is helpful to evaluate classification models such as a support vector machine since there is no linear separator available.<\/p>\n<p>The output from\u00a0<code>make_circles()<\/code> is always in two classes, and the coordinates are always in 2D. But some other functions can generate points of more classes or in higher dimensions, such as <code>make_blob()<\/code>. In the example below, we generate a dataset in 3D with 4 classes:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">from sklearn.datasets import make_blobs\r\nimport matplotlib.pyplot as plt\r\n\r\ndata, target = make_blobs(n_samples=500, n_features=3, centers=4,\r\n                          shuffle=True, random_state=42, cluster_std=2.5)\r\n\r\nfig = plt.figure(figsize=(8,8))\r\nax = fig.add_subplot(projection='3d')\r\nax.scatter(data[:,0], data[:,1], data[:,2], c=target, alpha=0.7, cmap=\"Set1\")\r\nplt.show()<\/pre>\n<p><img decoding=\"async\" class=\"aligncenter size-full wp-image-13379\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-4.png\" alt=\"\" width=\"400\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-4.png 854w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-4-300x282.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-4-768x723.png 768w\" sizes=\"(max-width: 854px) 100vw, 854px\"><\/p>\n<p>There are also some functions to generate a dataset for regression problems. For example, <code>make_s_curve()<\/code>\u00a0and\u00a0<code>make_swiss_roll()<\/code>\u00a0will generate coordinates in 3D with targets as continuous values.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">from sklearn.datasets import make_s_curve, make_swiss_roll\r\nimport matplotlib.pyplot as plt\r\n\r\ndata, target = make_s_curve(n_samples=5000, random_state=42)\r\n\r\nfig = plt.figure(figsize=(15,8))\r\nax = fig.add_subplot(121, projection='3d')\r\nax.scatter(data[:,0], data[:,1], data[:,2], c=target, alpha=0.7, cmap=\"viridis\")\r\n\r\ndata, target = make_swiss_roll(n_samples=5000, random_state=42)\r\nax = fig.add_subplot(122, projection='3d')\r\nax.scatter(data[:,0], data[:,1], data[:,2], c=target, alpha=0.7, cmap=\"viridis\")\r\n\r\nplt.show()<\/pre>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-13378\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-5.png\" alt=\"\" width=\"1700\" height=\"726\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-5.png 1700w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-5-300x128.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-5-1024x437.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-5-768x328.png 768w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/getdata-5-1536x656.png 1536w\" sizes=\"(max-width: 1700px) 100vw, 1700px\"><\/p>\n<p>If we prefer not to look at the data from a geometric perspective, there are also <code>make_classification()<\/code>\u00a0and\u00a0<code>make_regression()<\/code>. Compared to the other functions, these two provide us more control over the feature sets, such as introducing some redundant or irrelevant features.<\/p>\n<p>Below is an example of using\u00a0<code>make_regression()<\/code>\u00a0to generate a dataset and run linear regression with it:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">from sklearn.datasets import make_regression\r\nfrom sklearn.linear_model import LinearRegression\r\nimport numpy as np\r\n\r\n# Generate 10-dimensional features and 1-dimensional targets\r\nX, y = make_regression(n_samples=500, n_features=10, n_targets=1, n_informative=4,\r\n                       noise=0.5, bias=-2.5, random_state=42)\r\n\r\n# Run linear regression on the data\r\nreg = LinearRegression()\r\nreg.fit(X, y)\r\n\r\n# Print the coefficient and intercept found\r\nwith np.printoptions(precision=5, linewidth=100, suppress=True):\r\n    print(np.array(reg.coef_))\r\n    print(reg.intercept_)<\/pre>\n<p>In the example above, we created 10-dimensional features, but only 4 of them are informative. Hence from the result of the regression, we found only 4 of the coefficients are significantly non-zero.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">[-0.00435 -0.02232 19.0113   0.04391 46.04906 -0.02882 -0.05692 28.61786 -0.01839 16.79397]\r\n-2.5106367126731413<\/pre>\n<p>An example of using\u00a0<code>make_classification()<\/code> similarly is as follows. A support vector machine classifier is used in this case:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">from sklearn.datasets import make_classification\r\nfrom sklearn.svm import SVC\r\nimport numpy as np\r\n\r\n# Generate 10-dimensional features and 3-class targets\r\nX, y = make_classification(n_samples=1000, n_features=10, n_classes=3,\r\n                           n_informative=4, n_redundant=2, n_repeated=1,\r\n                           random_state=42)\r\n\r\n# Run SVC on the data\r\nclf = SVC(kernel=\"rbf\")\r\nclf.fit(X, y)\r\n\r\n# Print the accuracy\r\nprint(clf.score(X, y))<\/pre>\n<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Repositories<\/h3>\n<ul>\n<li><a href=\"https:\/\/archive.ics.uci.edu\/ml\/index.php\">UCI machine learning repository<\/a><\/li>\n<li><a href=\"https:\/\/www.kaggle.com\/\">Kaggle<\/a><\/li>\n<li><a href=\"https:\/\/www.openml.org\/\">OpenML<\/a><\/li>\n<li>Wikipedia,\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/List_of_datasets_for_machine-learning_research\" target=\"_blank\" rel=\"noopener\">https:\/\/en.wikipedia.org\/wiki\/List_of_datasets_for_machine-learning_research<\/a>\n<\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li>\n<a href=\"https:\/\/en.wikipedia.org\/wiki\/List_of_datasets_for_machine-learning_research\">List of datasets for machine-learning research<\/a>, Wikipedia<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/datasets\/toy_dataset.html\">scikit-learn toy datasets<\/a><\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/datasets\/real_world.html\">scikit-learn real-world datasets<\/a><\/li>\n<li><a href=\"https:\/\/www.tensorflow.org\/datasets\/catalog\/overview#all_datasets\">TensorFlow datasets catalog<\/a><\/li>\n<li><a href=\"https:\/\/www.tensorflow.org\/datasets\/keras_example\">Training a neural network on MNIST with Keras using TensorFlow Datasets<\/a><\/li>\n<\/ul>\n<h3 id=\"Books\">APIs<\/h3>\n<ul>\n<li><a href=\"https:\/\/www.kaggle.com\/docs\/api\">Kaggle API and tools<\/a><\/li>\n<li><a href=\"https:\/\/www.tensorflow.org\/datasets\">TensorFlow datasets<\/a><\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/classes.html#module-sklearn.datasets\">scikit-learn datasets<\/a><\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/classes.html#samples-generator\">scikit-learn samples generator<\/a><\/li>\n<\/ul>\n<h2 id=\"Summary\">Summary<\/h2>\n<p>In this tutorial, you discovered various options for loading a common dataset or generating one in Python.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How to use the dataset API in scikit-learn, Seaborn, and TensorFlow to load common machine learning datasets<\/li>\n<li>The small differences in the format of the dataset returned by different APIs and how to use them<\/li>\n<li>How to generate a dataset using scikit-learn<\/li>\n<\/ul>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/a-guide-to-getting-datasets-for-machine-learning-in-python\/\">A Guide to Getting Datasets for Machine Learning in Python<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/a-guide-to-getting-datasets-for-machine-learning-in-python\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Adrian Tam Compared to other programming exercises, a machine learning project is a blend of code and data. You need both to achieve the [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2022\/04\/03\/a-guide-to-getting-datasets-for-machine-learning-in-python\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":5533,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/5532"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=5532"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/5532\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/5533"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=5532"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=5532"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=5532"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}