{"id":4428,"date":"2021-02-23T18:00:49","date_gmt":"2021-02-23T18:00:49","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2021\/02\/23\/sensitivity-analysis-of-dataset-size-vs-model-performance\/"},"modified":"2021-02-23T18:00:49","modified_gmt":"2021-02-23T18:00:49","slug":"sensitivity-analysis-of-dataset-size-vs-model-performance","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2021\/02\/23\/sensitivity-analysis-of-dataset-size-vs-model-performance\/","title":{"rendered":"Sensitivity Analysis of Dataset Size vs. Model Performance"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Machine learning model performance often improves with dataset size for predictive modeling.<\/p>\n<p>This depends on the specific datasets and on the choice of model, although it often means that using more data can result in better performance and that discoveries made using smaller datasets to estimate model performance often scale to using larger datasets.<\/p>\n<p>The problem is the relationship is unknown for a given dataset and model, and may not exist for some datasets and models. Additionally, if such a relationship does exist, there may be a point or points of diminishing returns where adding more data may not improve model performance or where datasets are too small to effectively capture the capability of a model at a larger scale.<\/p>\n<p>These issues can be addressed by performing a <strong>sensitivity analysis<\/strong> to quantify the relationship between dataset size and model performance. Once calculated, we can interpret the results of the analysis and make decisions about how much data is enough, and how small a dataset may be to effectively estimate performance on larger datasets.<\/p>\n<p>In this tutorial, you will discover how to perform a sensitivity analysis of dataset size vs. model performance.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>Selecting a dataset size for machine learning is a challenging open problem.<\/li>\n<li>Sensitivity analysis provides an approach to quantifying the relationship between model performance and dataset size for a given model and prediction problem.<\/li>\n<li>How to perform a sensitivity analysis of dataset size and interpret the results.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_12216\" style=\"width: 810px\" class=\"wp-caption aligncenter\">\n<img decoding=\"async\" aria-describedby=\"caption-attachment-12216\" loading=\"lazy\" class=\"size-full wp-image-12216\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/05\/Sensitivity-Analysis-of-Dataset-Size-vs.-Model-Performance.jpg\" alt=\"Sensitivity Analysis of Dataset Size vs. Model Performance\" width=\"800\" height=\"531\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2021\/05\/Sensitivity-Analysis-of-Dataset-Size-vs.-Model-Performance.jpg 800w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2021\/05\/Sensitivity-Analysis-of-Dataset-Size-vs.-Model-Performance-300x199.jpg 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2021\/05\/Sensitivity-Analysis-of-Dataset-Size-vs.-Model-Performance-768x510.jpg 768w\" sizes=\"(max-width: 800px) 100vw, 800px\"><\/p>\n<p id=\"caption-attachment-12216\" class=\"wp-caption-text\">Sensitivity Analysis of Dataset Size vs. Model Performance<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/graeme\/10628420113\/\">Graeme Churchard<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into three parts; they are:<\/p>\n<ol>\n<li>Dataset Size Sensitivity Analysis<\/li>\n<li>Synthetic Prediction Task and Baseline Model<\/li>\n<li>Sensitivity Analysis of Dataset Size<\/li>\n<\/ol>\n<h2>Dataset Size Sensitivity Analysis<\/h2>\n<p>The amount of training data required for a machine learning predictive model is an open question.<\/p>\n<p>It depends on your choice of model, on the way you prepare the data, and on the specifics of the data itself.<\/p>\n<p>For more on the challenge of selecting a training dataset size, see the tutorial:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/much-training-data-required-machine-learning\/\">How Much Training Data is Required for Machine Learning?<\/a><\/li>\n<\/ul>\n<p>One way to approach this problem is to perform a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Sensitivity_analysis\">sensitivity analysis<\/a> and discover how the performance of your model on your dataset varies with more or less data.<\/p>\n<p>This might involve evaluating the same model with different sized datasets and looking for a relationship between dataset size and performance or a point of diminishing returns.<\/p>\n<p>Typically, there is a strong relationship between training dataset size and model performance, especially for nonlinear models. The relationship often involves an improvement in performance to a point and a general reduction in the expected variance of the model as the dataset size is increased.<\/p>\n<p>Knowing this relationship for your model and dataset can be helpful for a number of reasons, such as:<\/p>\n<ul>\n<li>Evaluate more models.<\/li>\n<li>Find a better model.<\/li>\n<li>Decide to gather more data.<\/li>\n<\/ul>\n<p>You can evaluate a large number of models and model configurations quickly on a smaller sample of the dataset with confidence that the performance will likely generalize in a specific way to a larger training dataset.<\/p>\n<p>This may allow evaluating many more models and configurations than you may otherwise be able to given the time available, and in turn, perhaps discover a better overall performing model.<\/p>\n<p>You may also be able to generalize and estimate the expected performance of model performance to much larger datasets and estimate whether it is worth the effort or expense of gathering more training data.<\/p>\n<p>Now that we are familiar with the idea of performing a sensitivity analysis of model performance to dataset size, let\u2019s look at a worked example.<\/p>\n<h2>Synthetic Prediction Task and Baseline Model<\/h2>\n<p>Before we dive into a sensitivity analysis, let\u2019s select a dataset and baseline model for the investigation.<\/p>\n<p>We will use a synthetic binary (two-class) classification dataset in this tutorial. This is ideal as it allows us to scale the number of generated samples for the same problem as needed.<\/p>\n<p>The <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_classification.html\">make_classification() scikit-learn function<\/a> can be used to create a synthetic classification dataset. In this case, we will use 20 input features (columns) and generate 1,000 samples (rows). The seed for the pseudo-random number generator is fixed to ensure the same base \u201cproblem\u201d is used each time samples are generated.<\/p>\n<p>The example below generates the synthetic classification dataset and summarizes the shape of the generated data.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># test classification dataset\r\nfrom sklearn.datasets import make_classification\r\n# define dataset\r\nX, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)\r\n# summarize the dataset\r\nprint(X.shape, y.shape)<\/pre>\n<p>Running the example generates the data and reports the size of the input and output components, confirming the expected shape.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">(1000, 20) (1000,)<\/pre>\n<p>Next, we can evaluate a predictive model on this dataset.<\/p>\n<p>We will use a decision tree (<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.tree.DecisionTreeClassifier.html\">DecisionTreeClassifier<\/a>) as the predictive model. It was chosen because it is a nonlinear algorithm and has a high variance, which means that we would expect performance to improve with increases in the size of the training dataset.<\/p>\n<p>We will use a best practice of <a href=\"https:\/\/machinelearningmastery.com\/repeated-k-fold-cross-validation-with-python\/\">repeated stratified k-fold cross-validation<\/a> to evaluate the model on the dataset, with 3 repeats and 10 folds.<\/p>\n<p>The complete example of evaluating the decision tree model on the synthetic classification dataset is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># evaluate a decision tree model on the synthetic classification dataset\r\nfrom sklearn.datasets import make_classification\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.tree import DecisionTreeClassifier\r\n# load dataset\r\nX, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)\r\n# define model evaluation procedure\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n# define model\r\nmodel = DecisionTreeClassifier()\r\n# evaluate model\r\nscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n# report performance\r\nprint('Mean Accuracy: %.3f (%.3f)' % (scores.mean(), scores.std()))<\/pre>\n<p>Running the example creates the dataset then estimates the performance of the model on the problem using the chosen test harness.<\/p>\n<p><strong>Note<\/strong>: Your <a href=\"https:\/\/machinelearningmastery.com\/different-results-each-time-in-machine-learning\/\">results may vary<\/a> given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.<\/p>\n<p>In this case, we can see that the mean classification accuracy is about 82.7%.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">Mean Accuracy: 0.827 (0.042)<\/pre>\n<p>Next, let\u2019s look at how we might perform a sensitivity analysis of dataset size on model performance.<\/p>\n<h2>Sensitivity Analysis of Dataset Size<\/h2>\n<p>The previous section showed how to evaluate a chosen model on the available dataset.<\/p>\n<p>It raises questions, such as:<\/p>\n<blockquote>\n<p>Will the model perform better on more data?<\/p>\n<\/blockquote>\n<p>More generally, we may have sophisticated questions such as:<\/p>\n<blockquote>\n<p>Does the estimated performance hold on smaller or larger samples from the problem domain?<\/p>\n<\/blockquote>\n<p>These are hard questions to answer, but we can approach them by using a sensitivity analysis. Specifically, we can use a sensitivity analysis to learn:<\/p>\n<blockquote>\n<p>How sensitive is model performance to dataset size?<\/p>\n<\/blockquote>\n<p>Or more generally:<\/p>\n<blockquote>\n<p>What is the relationship of dataset size to model performance?<\/p>\n<\/blockquote>\n<p>There are many ways to perform a sensitivity analysis, but perhaps the simplest approach is to define a test harness to evaluate model performance and then evaluate the same model on the same problem with differently sized datasets.<\/p>\n<p>This will allow the train and test portions of the dataset to increase with the size of the overall dataset.<\/p>\n<p>To make the code easier to read, we will split it up into functions.<\/p>\n<p>First, we can define a function that will prepare (or load) the dataset of a given size. The number of rows in the dataset is specified by an argument to the function.<\/p>\n<p>If you are using this code as a template, this function can be changed to load your dataset from file and select a random sample of a given size.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># load dataset\r\ndef load_dataset(n_samples):\r\n\t# define the dataset\r\n\tX, y = make_classification(n_samples=int(n_samples), n_features=20, n_informative=15, n_redundant=5, random_state=1)\r\n\treturn X, y<\/pre>\n<p>Next, we need a function to evaluate a model on a loaded dataset.<\/p>\n<p>We will define a function that takes a dataset and returns a summary of the performance of the model evaluated using the test harness on the dataset.<\/p>\n<p>This function is listed below, taking the input and output elements of a dataset and returning the mean and standard deviation of the decision tree model on the dataset.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># evaluate a model\r\ndef evaluate_model(X, y):\r\n\t# define model evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# define model\r\n\tmodel = DecisionTreeClassifier()\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n\t# return summary stats\r\n\treturn [scores.mean(), scores.std()]<\/pre>\n<p>Next, we can define a range of different dataset sizes to evaluate.<\/p>\n<p>The sizes should be chosen proportional to the amount of data you have available and the amount of running time you are willing to expend.<\/p>\n<p>In this case, we will keep the sizes modest to limit running time, from 50 to one million rows on a rough log10 scale.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# define number of samples to consider\r\nsizes = [50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000, 1000000]<\/pre>\n<p>Next, we can enumerate each dataset size, create the dataset, evaluate a model on the dataset, and store the results for later analysis.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# evaluate each number of samples\r\nmeans, stds = list(), list()\r\nfor n_samples in sizes:\r\n\t# get a dataset\r\n\tX, y = load_dataset(n_samples)\r\n\t# evaluate a model on this dataset size\r\n\tmean, std = evaluate_model(X, y)\r\n\t# store\r\n\tmeans.append(mean)\r\n\tstds.append(std)<\/pre>\n<p>Next, we can summarize the relationship between the dataset size and model performance.<\/p>\n<p>In this case, we will simply plot the result with error bars so we can spot any trends visually.<\/p>\n<p>We will use the standard deviation as a measure of uncertainty on the estimated model performance. This can be achieved by multiplying the value by 2 to cover approximately 95% of the expected performance if the performance follows a normal distribution.<\/p>\n<p>This can be shown on the plot as an error bar around the mean expected performance for a dataset size.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# define error bar as 2 standard deviations from the mean or 95%\r\nerr = [min(1, s * 2) for s in stds]\r\n# plot dataset size vs mean performance with error bars\r\npyplot.errorbar(sizes, means, yerr=err, fmt='-o')<\/pre>\n<p>To make the plot more readable, we can change the scale of the x-axis to log, given that our dataset sizes are on a rough log10 scale.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# change the scale of the x-axis to log\r\nax = pyplot.gca()\r\nax.set_xscale(\"log\", nonpositive='clip')\r\n# show the plot\r\npyplot.show()<\/pre>\n<p>And that\u2019s it.<\/p>\n<p>We would generally expect mean model performance to increase with dataset size. We would also expect the uncertainty in model performance to decrease with dataset size.<\/p>\n<p>Tying this all together, the complete example of performing a sensitivity analysis of dataset size on model performance is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># sensitivity analysis of model performance to dataset size\r\nfrom sklearn.datasets import make_classification\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.tree import DecisionTreeClassifier\r\nfrom matplotlib import pyplot\r\n\r\n# load dataset\r\ndef load_dataset(n_samples):\r\n\t# define the dataset\r\n\tX, y = make_classification(n_samples=int(n_samples), n_features=20, n_informative=15, n_redundant=5, random_state=1)\r\n\treturn X, y\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y):\r\n\t# define model evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# define model\r\n\tmodel = DecisionTreeClassifier()\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n\t# return summary stats\r\n\treturn [scores.mean(), scores.std()]\r\n\r\n# define number of samples to consider\r\nsizes = [50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000, 1000000]\r\n# evaluate each number of samples\r\nmeans, stds = list(), list()\r\nfor n_samples in sizes:\r\n\t# get a dataset\r\n\tX, y = load_dataset(n_samples)\r\n\t# evaluate a model on this dataset size\r\n\tmean, std = evaluate_model(X, y)\r\n\t# store\r\n\tmeans.append(mean)\r\n\tstds.append(std)\r\n\t# summarize performance\r\n\tprint('&gt;%d: %.3f (%.3f)' % (n_samples, mean, std))\r\n# define error bar as 2 standard deviations from the mean or 95%\r\nerr = [min(1, s * 2) for s in stds]\r\n# plot dataset size vs mean performance with error bars\r\npyplot.errorbar(sizes, means, yerr=err, fmt='-o')\r\n# change the scale of the x-axis to log\r\nax = pyplot.gca()\r\nax.set_xscale(\"log\", nonpositive='clip')\r\n# show the plot\r\npyplot.show()<\/pre>\n<p>Running the example reports the status along the way of dataset size vs. estimated model performance.<\/p>\n<p><strong>Note<\/strong>: Your <a href=\"https:\/\/machinelearningmastery.com\/different-results-each-time-in-machine-learning\/\">results may vary<\/a> given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.<\/p>\n<p>In this case, we can see the expected trend of increasing mean model performance with dataset size and decreasing model variance measured using the standard deviation of classification accuracy.<\/p>\n<p>We can see that there is perhaps a point of diminishing returns in estimating model performance at perhaps 10,000 or 50,000 rows.<\/p>\n<p>Specifically, we do see an improvement in performance with more rows, but we can probably capture this relationship with little variance with 10K or 50K rows of data.<\/p>\n<p>We can also see a drop-off in estimated performance with 1,000,000 rows of data, suggesting that we are probably maxing out the capability of the model above 100,000 rows and are instead measuring statistical noise in the estimate.<\/p>\n<p>This might mean an upper bound on expected performance and likely that more data beyond this point will not improve the specific model and configuration on the chosen test harness.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">&gt;50: 0.673 (0.141)\r\n&gt;100: 0.703 (0.135)\r\n&gt;500: 0.809 (0.055)\r\n&gt;1000: 0.826 (0.044)\r\n&gt;5000: 0.835 (0.016)\r\n&gt;10000: 0.866 (0.011)\r\n&gt;50000: 0.900 (0.005)\r\n&gt;100000: 0.912 (0.003)\r\n&gt;500000: 0.938 (0.001)\r\n&gt;1000000: 0.936 (0.001)<\/pre>\n<p>The plot makes the relationship between dataset size and estimated model performance much clearer.<\/p>\n<p>The relationship is nearly linear with a log dataset size. The change in the uncertainty shown as the error bar also dramatically decreases on the plot from very large values with 50 or 100 samples, to modest values with 5,000 and 10,000 samples and practically gone beyond these sizes.<\/p>\n<p>Given the modest spread with 5,000 and 10,000 samples and the practically log-linear relationship, we could probably get away with using 5K or 10K rows to approximate model performance.<\/p>\n<div id=\"attachment_12215\" style=\"width: 1290px\" class=\"wp-caption aligncenter\">\n<img decoding=\"async\" aria-describedby=\"caption-attachment-12215\" loading=\"lazy\" class=\"size-full wp-image-12215\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/02\/Line-Plot-with-Error-Bars-of-Dataset-Size-vs-Model-Performance.png\" alt=\"Line Plot With Error Bars of Dataset Size vs. Model Performance\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2021\/02\/Line-Plot-with-Error-Bars-of-Dataset-Size-vs-Model-Performance.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2021\/02\/Line-Plot-with-Error-Bars-of-Dataset-Size-vs-Model-Performance-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2021\/02\/Line-Plot-with-Error-Bars-of-Dataset-Size-vs-Model-Performance-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2021\/02\/Line-Plot-with-Error-Bars-of-Dataset-Size-vs-Model-Performance-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-12215\" class=\"wp-caption-text\">Line Plot With Error Bars of Dataset Size vs. Model Performance<\/p>\n<\/div>\n<p>We could use these findings as the basis for testing additional model configurations and even different model types.<\/p>\n<p>The danger is that different models may perform very differently with more or less data and it may be wise to repeat the sensitivity analysis with a different chosen model to confirm the relationship holds. Alternately, it may be interesting to repeat the analysis with a suite of different model types.<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Tutorials<\/h3>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/sensitivity-analysis-history-size-forecast-skill-arima-python\/\">Sensitivity Analysis of History Size to Forecast Skill with ARIMA in Python<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/much-training-data-required-machine-learning\/\">How Much Training Data is Required for Machine Learning?<\/a><\/li>\n<\/ul>\n<h3>APIs<\/h3>\n<ul>\n<li>\n<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_classification.html\">sklearn.datasets.make_classification API<\/a>.<\/li>\n<li>\n<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.tree.DecisionTreeClassifier.html\">sklearn.tree.DecisionTreeClassifier API<\/a>.<\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li>\n<a href=\"https:\/\/en.wikipedia.org\/wiki\/Sensitivity_analysis\">Sensitivity analysis, Wikipedia<\/a>.<\/li>\n<li>\n<a href=\"https:\/\/en.wikipedia.org\/wiki\/68%E2%80%9395%E2%80%9399.7_rule\">68\u201395\u201399.7 rule, Wikipedia<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to perform a sensitivity analysis of dataset size vs. model performance.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>Selecting a dataset size for machine learning is a challenging open problem.<\/li>\n<li>Sensitivity analysis provides an approach to quantifying the relationship between model performance and dataset size for a given model and prediction problem.<\/li>\n<li>How to perform a sensitivity analysis of dataset size and interpret the results.<\/li>\n<\/ul>\n<p><strong>Do you have any questions?<\/strong><br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/sensitivity-analysis-of-dataset-size-vs-model-performance\/\">Sensitivity Analysis of Dataset Size vs. Model Performance<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/sensitivity-analysis-of-dataset-size-vs-model-performance\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Machine learning model performance often improves with dataset size for predictive modeling. This depends on the specific datasets and on the choice [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2021\/02\/23\/sensitivity-analysis-of-dataset-size-vs-model-performance\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":4429,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4428"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=4428"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4428\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/4429"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=4428"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=4428"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=4428"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}