{"id":1184,"date":"2018-10-18T18:00:13","date_gmt":"2018-10-18T18:00:13","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2018\/10\/18\/how-to-develop-machine-learning-models-for-multivariate-multi-step-air-pollution-time-series-forecasting\/"},"modified":"2018-10-18T18:00:13","modified_gmt":"2018-10-18T18:00:13","slug":"how-to-develop-machine-learning-models-for-multivariate-multi-step-air-pollution-time-series-forecasting","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2018\/10\/18\/how-to-develop-machine-learning-models-for-multivariate-multi-step-air-pollution-time-series-forecasting\/","title":{"rendered":"How to Develop Machine Learning Models for Multivariate Multi-Step Air Pollution Time Series Forecasting"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Real-world time series forecasting is challenging for a whole host of reasons not limited to problem features such as having multiple input variables, the requirement to predict multiple time steps, and the need to perform the same type of prediction for multiple physical sites.<\/p>\n<p>The EMC Data Science Global Hackathon dataset, or the \u2018Air Quality Prediction\u2019 dataset for short, describes weather conditions at multiple sites and requires a prediction of air quality measurements over the subsequent three days.<\/p>\n<p>Machine learning algorithms can be applied to time series forecasting problems and offer benefits such as the ability to handle multiple input variables with noisy complex dependencies.<\/p>\n<p>In this tutorial, you will discover how to develop machine learning models for multi-step time series forecasting of air pollution data.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>How to impute missing values and transform time series data so that it can be modeled by supervised learning algorithms.<\/li>\n<li>How to develop and evaluate a suite of linear algorithms for multi-step time series forecasting.<\/li>\n<li>How to develop and evaluate a suite of nonlinear algorithms for multi-step time series forecasting.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_6337\" style=\"width: 650px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6337\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/10\/How-to-Develop-Machine-Learning-Models-for-Multivariate-Multi-Step-Air-Pollution-Time-Series-Forecasting.jpg\" alt=\"How to Develop Machine Learning Models for Multivariate Multi-Step Air Pollution Time Series Forecasting\" width=\"640\" height=\"480\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/How-to-Develop-Machine-Learning-Models-for-Multivariate-Multi-Step-Air-Pollution-Time-Series-Forecasting.jpg 640w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/How-to-Develop-Machine-Learning-Models-for-Multivariate-Multi-Step-Air-Pollution-Time-Series-Forecasting-300x225.jpg 300w\" sizes=\"(max-width: 640px) 100vw, 640px\"><\/p>\n<p class=\"wp-caption-text\">How to Develop Machine Learning Models for Multivariate Multi-Step Air Pollution Time Series Forecasting<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/akeg\/376488289\/\">Eric Schmuttenmaer<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into nine parts; they are:<\/p>\n<ol>\n<li>Problem Description<\/li>\n<li>Model Evaluation<\/li>\n<li>Machine Learning Modeling<\/li>\n<li>Machine Learning Data Preparation<\/li>\n<li>Model Evaluation Test Harness<\/li>\n<li>Evaluate Linear Algorithms<\/li>\n<li>Evaluate Nonlinear Algorithms<\/li>\n<li>Tune Lag Size<\/li>\n<\/ol>\n<h2>Problem Description<\/h2>\n<p>The Air Quality Prediction dataset describes weather conditions at multiple sites and requires a prediction of air quality measurements over the subsequent three days.<\/p>\n<p>Specifically, weather observations such as temperature, pressure, wind speed, and wind direction are provided hourly for eight days for multiple sites. The objective is to predict air quality measurements for the next 3 days at multiple sites. The forecast lead times are not contiguous; instead, specific lead times must be forecast over the 72 hour forecast period. They are:<\/p>\n<pre class=\"crayon-plain-tag\">+1, +2, +3, +4, +5, +10, +17, +24, +48, +72<\/pre>\n<p>Further, the dataset is divided into disjoint but contiguous chunks of data, with eight days of data followed by three days that require a forecast.<\/p>\n<p>Not all observations are available at all sites or chunks and not all output variables are available at all sites and chunks. There are large portions of missing data that must be addressed.<\/p>\n<p>The dataset was used as the basis for a <a href=\"https:\/\/www.kaggle.com\/c\/dsg-hackathon\">short duration machine learning competition<\/a> (or hackathon) on the Kaggle website in 2012.<\/p>\n<p>Submissions for the competition were evaluated against the true observations that were withheld from participants and scored using Mean Absolute Error (MAE). Submissions required the value of -1,000,000 to be specified in those cases where a forecast was not possible due to missing data. In fact, a template of where to insert missing values was provided and required to be adopted for all submissions (what a pain).<\/p>\n<p>A winning entrant achieved a MAE of 0.21058 on the withheld test set (<a href=\"https:\/\/www.kaggle.com\/c\/dsg-hackathon\/leaderboard\">private leaderboard<\/a>) using random forest on lagged observations. A writeup of this solution is available in the post:<\/p>\n<ul>\n<li><a href=\"http:\/\/blog.kaggle.com\/2012\/05\/01\/chucking-everything-into-a-random-forest-ben-hamner-on-winning-the-air-quality-prediction-hackathon\/\">Chucking everything into a Random Forest: Ben Hamner on Winning The Air Quality Prediction Hackathon<\/a>, 2012.<\/li>\n<\/ul>\n<p>In this tutorial, we will explore how to develop naive forecasts for the problem that can be used as a baseline to determine whether a model has skill on the problem or not.<\/p>\n<\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Need help with Deep Learning for Time Series?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/14531ee73f72a2%3A164f8be4f346dc\/5630742793027584\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"14531ee73f72a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/14531ee73f72a2%3A164f8be4f346dc\/5630742793027584\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1534880695.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<h2>Model Evaluation<\/h2>\n<p>Before we can evaluate naive forecasting methods, we must develop a test harness.<\/p>\n<p>This includes at least how the data will be prepared and how forecasts will be evaluated.<\/p>\n<h3>Load Dataset<\/h3>\n<p>The first step is to download the dataset and load it into memory.<\/p>\n<p>The dataset can be downloaded for free from the Kaggle website. You may have to create an account and log in, in order to be able to download the dataset.<\/p>\n<p>Download the entire dataset, e.g. \u201c<em>Download All<\/em>\u201d to your workstation and unzip the archive in your current working directory with the folder named \u2018<em>AirQualityPrediction<\/em>\u2018.<\/p>\n<ul>\n<li><a href=\"https:\/\/www.kaggle.com\/c\/dsg-hackathon\/data\">EMC Data Science Global Hackathon (Air Quality Prediction) Data<\/a><\/li>\n<\/ul>\n<p>Our focus will be the \u2018<em>TrainingData.csv<\/em>\u2018 file that contains the training dataset, specifically data in chunks where each chunk is eight contiguous days of observations and target variables.<\/p>\n<p>We can load the data file into memory using the Pandas <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/generated\/pandas.read_csv.html\">read_csv() function<\/a> and specify the header row on line 0.<\/p>\n<pre class=\"crayon-plain-tag\"># load dataset\r\ndataset = read_csv('AirQualityPrediction\/TrainingData.csv', header=0)<\/pre>\n<p>We can group data by the \u2018chunkID\u2019 variable (column index 1).<\/p>\n<p>First, let\u2019s get a list of the unique chunk identifiers.<\/p>\n<pre class=\"crayon-plain-tag\">chunk_ids = unique(values[:, 1])<\/pre>\n<p>We can then collect all rows for each chunk identifier and store them in a dictionary for easy access.<\/p>\n<pre class=\"crayon-plain-tag\">chunks = dict()\r\n# sort rows by chunk id\r\nfor chunk_id in chunk_ids:\r\n\tselection = values[:, chunk_ix] == chunk_id\r\n\tchunks[chunk_id] = values[selection, :]<\/pre>\n<p>Below defines a function named <em>to_chunks()<\/em> that takes a NumPy array of the loaded data and returns a dictionary of <em>chunk_id<\/em> to rows for the chunk.<\/p>\n<pre class=\"crayon-plain-tag\"># split the dataset by 'chunkID', return a dict of id to rows\r\ndef to_chunks(values, chunk_ix=1):\r\n\tchunks = dict()\r\n\t# get the unique chunk ids\r\n\tchunk_ids = unique(values[:, chunk_ix])\r\n\t# group rows by chunk id\r\n\tfor chunk_id in chunk_ids:\r\n\t\tselection = values[:, chunk_ix] == chunk_id\r\n\t\tchunks[chunk_id] = values[selection, :]\r\n\treturn chunks<\/pre>\n<p>The complete example that loads the dataset and splits it into chunks is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># load data and split into chunks\r\nfrom numpy import unique\r\nfrom pandas import read_csv\r\n\r\n# split the dataset by 'chunkID', return a dict of id to rows\r\ndef to_chunks(values, chunk_ix=1):\r\n\tchunks = dict()\r\n\t# get the unique chunk ids\r\n\tchunk_ids = unique(values[:, chunk_ix])\r\n\t# group rows by chunk id\r\n\tfor chunk_id in chunk_ids:\r\n\t\tselection = values[:, chunk_ix] == chunk_id\r\n\t\tchunks[chunk_id] = values[selection, :]\r\n\treturn chunks\r\n\r\n# load dataset\r\ndataset = read_csv('AirQualityPrediction\/TrainingData.csv', header=0)\r\n# group data by chunks\r\nvalues = dataset.values\r\nchunks = to_chunks(values)\r\nprint('Total Chunks: %d' % len(chunks))<\/pre>\n<p>Running the example prints the number of chunks in the dataset.<\/p>\n<pre class=\"crayon-plain-tag\">Total Chunks: 208<\/pre>\n<\/p>\n<h3>Data Preparation<\/h3>\n<p>Now that we know how to load the data and split it into chunks, we can separate into train and test datasets.<\/p>\n<p>Each chunk covers an interval of eight days of hourly observations, although the number of actual observations within each chunk may vary widely.<\/p>\n<p>We can split each chunk into the first five days of observations for training and the last three for test.<\/p>\n<p>Each observation has a row called \u2018<em>position_within_chunk<\/em>\u2018 that varies from 1 to 192 (8 days * 24 hours). We can therefore take all rows with a value in this column that is less than or equal to 120 (5 * 24) as training data and any values more than 120 as test data.<\/p>\n<p>Further, any chunks that don\u2019t have any observations in the train or test split can be dropped as not viable.<\/p>\n<p>When working with the naive models, we are only interested in the target variables, and none of the input meteorological variables. Therefore, we can remove the input data and have the train and test data only comprised of the 39 target variables for each chunk, as well as the position within chunk and hour of observation.<\/p>\n<p>The <em>split_train_test()<\/em> function below implements this behavior; given a dictionary of chunks, it will split each into a list of train and test chunk data.<\/p>\n<pre class=\"crayon-plain-tag\"># split each chunk into train\/test sets\r\ndef split_train_test(chunks, row_in_chunk_ix=2):\r\n\ttrain, test = list(), list()\r\n\t# first 5 days of hourly observations for train\r\n\tcut_point = 5 * 24\r\n\t# enumerate chunks\r\n\tfor k,rows in chunks.items():\r\n\t\t# split chunk rows by 'position_within_chunk'\r\n\t\ttrain_rows = rows[rows[:,row_in_chunk_ix] <= cut_point, :]\r\n\t\ttest_rows = rows[rows[:,row_in_chunk_ix] > cut_point, :]\r\n\t\tif len(train_rows) == 0 or len(test_rows) == 0:\r\n\t\t\tprint('>dropping chunk=%d: train=%s, test=%s' % (k, train_rows.shape, test_rows.shape))\r\n\t\t\tcontinue\r\n\t\t# store with chunk id, position in chunk, hour and all targets\r\n\t\tindices = [1,2,5] + [x for x in range(56,train_rows.shape[1])]\r\n\t\ttrain.append(train_rows[:, indices])\r\n\t\ttest.append(test_rows[:, indices])\r\n\treturn train, test<\/pre>\n<p>We do not require the entire test dataset; instead, we only require the observations at specific lead times over the three day period, specifically the lead times:<\/p>\n<pre class=\"crayon-plain-tag\">+1, +2, +3, +4, +5, +10, +17, +24, +48, +72<\/pre>\n<p>Where, each lead time is relative to the end of the training period.<\/p>\n<p>First, we can put these lead times into a function for easy reference:<\/p>\n<pre class=\"crayon-plain-tag\"># return a list of relative forecast lead times\r\ndef get_lead_times():\r\n\treturn [1, 2 ,3, 4, 5, 10, 17, 24, 48, 72]<\/pre>\n<p>Next, we can reduce the test dataset down to just the data at the preferred lead times.<\/p>\n<p>We can do that by looking at the \u2018<em>position_within_chunk<\/em>\u2018 column and using the lead time as an offset from the end of the training dataset, e.g. 120 + 1, 120 +2, etc.<\/p>\n<p>If we find a matching row in the test set, it is saved, otherwise a row of NaN observations is generated.<\/p>\n<p>The function <em>to_forecasts()<\/em> below implements this and returns a NumPy array with one row for each forecast lead time for each chunk.<\/p>\n<pre class=\"crayon-plain-tag\"># convert the rows in a test chunk to forecasts\r\ndef to_forecasts(test_chunks, row_in_chunk_ix=1):\r\n\t# get lead times\r\n\tlead_times = get_lead_times()\r\n\t# first 5 days of hourly observations for train\r\n\tcut_point = 5 * 24\r\n\tforecasts = list()\r\n\t# enumerate each chunk\r\n\tfor rows in test_chunks:\r\n\t\tchunk_id = rows[0, 0]\r\n\t\t# enumerate each lead time\r\n\t\tfor tau in lead_times:\r\n\t\t\t# determine the row in chunk we want for the lead time\r\n\t\t\toffset = cut_point + tau\r\n\t\t\t# retrieve data for the lead time using row number in chunk\r\n\t\t\trow_for_tau = rows[rows[:,row_in_chunk_ix]==offset, :]\r\n\t\t\t# check if we have data\r\n\t\t\tif len(row_for_tau) == 0:\r\n\t\t\t\t# create a mock row [chunk, position, hour] + [nan...]\r\n\t\t\t\trow = [chunk_id, offset, nan] + [nan for _ in range(39)]\r\n\t\t\t\tforecasts.append(row)\r\n\t\t\telse:\r\n\t\t\t\t# store the forecast row\r\n\t\t\t\tforecasts.append(row_for_tau[0])\r\n\treturn array(forecasts)<\/pre>\n<p>We can tie all of this together and split the dataset into train and test sets and save the results to new files.<\/p>\n<p>The complete code example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># split data into train and test sets\r\nfrom numpy import unique\r\nfrom numpy import nan\r\nfrom numpy import array\r\nfrom numpy import savetxt\r\nfrom pandas import read_csv\r\n\r\n# split the dataset by 'chunkID', return a dict of id to rows\r\ndef to_chunks(values, chunk_ix=1):\r\n\tchunks = dict()\r\n\t# get the unique chunk ids\r\n\tchunk_ids = unique(values[:, chunk_ix])\r\n\t# group rows by chunk id\r\n\tfor chunk_id in chunk_ids:\r\n\t\tselection = values[:, chunk_ix] == chunk_id\r\n\t\tchunks[chunk_id] = values[selection, :]\r\n\treturn chunks\r\n\r\n# split each chunk into train\/test sets\r\ndef split_train_test(chunks, row_in_chunk_ix=2):\r\n\ttrain, test = list(), list()\r\n\t# first 5 days of hourly observations for train\r\n\tcut_point = 5 * 24\r\n\t# enumerate chunks\r\n\tfor k,rows in chunks.items():\r\n\t\t# split chunk rows by 'position_within_chunk'\r\n\t\ttrain_rows = rows[rows[:,row_in_chunk_ix] <= cut_point, :]\r\n\t\ttest_rows = rows[rows[:,row_in_chunk_ix] > cut_point, :]\r\n\t\tif len(train_rows) == 0 or len(test_rows) == 0:\r\n\t\t\tprint('>dropping chunk=%d: train=%s, test=%s' % (k, train_rows.shape, test_rows.shape))\r\n\t\t\tcontinue\r\n\t\t# store with chunk id, position in chunk, hour and all targets\r\n\t\tindices = [1,2,5] + [x for x in range(56,train_rows.shape[1])]\r\n\t\ttrain.append(train_rows[:, indices])\r\n\t\ttest.append(test_rows[:, indices])\r\n\treturn train, test\r\n\r\n# return a list of relative forecast lead times\r\ndef get_lead_times():\r\n\treturn [1, 2 ,3, 4, 5, 10, 17, 24, 48, 72]\r\n\r\n# convert the rows in a test chunk to forecasts\r\ndef to_forecasts(test_chunks, row_in_chunk_ix=1):\r\n\t# get lead times\r\n\tlead_times = get_lead_times()\r\n\t# first 5 days of hourly observations for train\r\n\tcut_point = 5 * 24\r\n\tforecasts = list()\r\n\t# enumerate each chunk\r\n\tfor rows in test_chunks:\r\n\t\tchunk_id = rows[0, 0]\r\n\t\t# enumerate each lead time\r\n\t\tfor tau in lead_times:\r\n\t\t\t# determine the row in chunk we want for the lead time\r\n\t\t\toffset = cut_point + tau\r\n\t\t\t# retrieve data for the lead time using row number in chunk\r\n\t\t\trow_for_tau = rows[rows[:,row_in_chunk_ix]==offset, :]\r\n\t\t\t# check if we have data\r\n\t\t\tif len(row_for_tau) == 0:\r\n\t\t\t\t# create a mock row [chunk, position, hour] + [nan...]\r\n\t\t\t\trow = [chunk_id, offset, nan] + [nan for _ in range(39)]\r\n\t\t\t\tforecasts.append(row)\r\n\t\t\telse:\r\n\t\t\t\t# store the forecast row\r\n\t\t\t\tforecasts.append(row_for_tau[0])\r\n\treturn array(forecasts)\r\n\r\n# load dataset\r\ndataset = read_csv('AirQualityPrediction\/TrainingData.csv', header=0)\r\n# group data by chunks\r\nvalues = dataset.values\r\nchunks = to_chunks(values)\r\n# split into train\/test\r\ntrain, test = split_train_test(chunks)\r\n# flatten training chunks to rows\r\ntrain_rows = array([row for rows in train for row in rows])\r\n# print(train_rows.shape)\r\nprint('Train Rows: %s' % str(train_rows.shape))\r\n# reduce train to forecast lead times only\r\ntest_rows = to_forecasts(test)\r\nprint('Test Rows: %s' % str(test_rows.shape))\r\n# save datasets\r\nsavetxt('AirQualityPrediction\/naive_train.csv', train_rows, delimiter=',')\r\nsavetxt('AirQualityPrediction\/naive_test.csv', test_rows, delimiter=',')<\/pre>\n<p>Running the example first comments that chunk 69 is removed from the dataset for having insufficient data.<\/p>\n<p>We can then see that we have 42 columns in each of the train and test sets, one for the chunk id, position within chunk, hour of day, and the 39 training variables.<\/p>\n<p>We can also see the dramatically smaller version of the test dataset with rows only at the forecast lead times.<\/p>\n<p>The new train and test datasets are saved in the \u2018<em>naive_train.csv<\/em>\u2018 and \u2018<em>naive_test.csv<\/em>\u2018 files respectively.<\/p>\n<pre class=\"crayon-plain-tag\">>dropping chunk=69: train=(0, 95), test=(28, 95)\r\nTrain Rows: (23514, 42)\r\nTest Rows: (2070, 42)<\/pre>\n<\/p>\n<h3>Forecast Evaluation<\/h3>\n<p>Once forecasts have been made, they need to be evaluated.<\/p>\n<p>It is helpful to have a simpler format when evaluating forecasts. For example, we will use the three-dimensional structure of <em>[chunks][variables][time]<\/em>, where variable is the target variable number from 0 to 38 and time is the lead time index from 0 to 9.<\/p>\n<p>Models are expected to make predictions in this format.<\/p>\n<p>We can also restructure the test dataset to have this dataset for comparison. The <em>prepare_test_forecasts()<\/em> function below implements this.<\/p>\n<pre class=\"crayon-plain-tag\"># convert the test dataset in chunks to [chunk][variable][time] format\r\ndef prepare_test_forecasts(test_chunks):\r\n\tpredictions = list()\r\n\t# enumerate chunks to forecast\r\n\tfor rows in test_chunks:\r\n\t\t# enumerate targets for chunk\r\n\t\tchunk_predictions = list()\r\n\t\tfor j in range(3, rows.shape[1]):\r\n\t\t\tyhat = rows[:, j]\r\n\t\t\tchunk_predictions.append(yhat)\r\n\t\tchunk_predictions = array(chunk_predictions)\r\n\t\tpredictions.append(chunk_predictions)\r\n\treturn array(predictions)<\/pre>\n<p>We will evaluate a model using the mean absolute error, or MAE. This is the metric that was used in the competition and is a sensible choice given the non-Gaussian distribution of the target variables.<\/p>\n<p>If a lead time contains no data in the test set (e.g. <em>NaN<\/em>), then no error will be calculated for that forecast. If the lead time does have data in the test set but no data in the forecast, then the full magnitude of the observation will be taken as error. Finally, if the test set has an observation and a forecast was made, then the absolute difference will be recorded as the error.<\/p>\n<p>The <em>calculate_error()<\/em> function implements these rules and returns the error for a given forecast.<\/p>\n<pre class=\"crayon-plain-tag\"># calculate the error between an actual and predicted value\r\ndef calculate_error(actual, predicted):\r\n\t# give the full actual value if predicted is nan\r\n\tif isnan(predicted):\r\n\t\treturn abs(actual)\r\n\t# calculate abs difference\r\n\treturn abs(actual - predicted)<\/pre>\n<p>Errors are summed across all chunks and all lead times, then averaged.<\/p>\n<p>The overall MAE will be calculated, but we will also calculate a MAE for each forecast lead time. This can help with model selection generally as some models may perform differently at different lead times.<\/p>\n<p>The evaluate_forecasts() function below implements this, calculating the MAE and per-lead time MAE for the provided predictions and expected values in <em>[chunk][variable][time]<\/em> format.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate a forecast in the format [chunk][variable][time]\r\ndef evaluate_forecasts(predictions, testset):\r\n\tlead_times = get_lead_times()\r\n\ttotal_mae, times_mae = 0.0, [0.0 for _ in range(len(lead_times))]\r\n\ttotal_c, times_c = 0, [0 for _ in range(len(lead_times))]\r\n\t# enumerate test chunks\r\n\tfor i in range(len(test_chunks)):\r\n\t\t# convert to forecasts\r\n\t\tactual = testset[i]\r\n\t\tpredicted = predictions[i]\r\n\t\t# enumerate target variables\r\n\t\tfor j in range(predicted.shape[0]):\r\n\t\t\t# enumerate lead times\r\n\t\t\tfor k in range(len(lead_times)):\r\n\t\t\t\t# skip if actual in nan\r\n\t\t\t\tif isnan(actual[j, k]):\r\n\t\t\t\t\tcontinue\r\n\t\t\t\t# calculate error\r\n\t\t\t\terror = calculate_error(actual[j, k], predicted[j, k])\r\n\t\t\t\t# update statistics\r\n\t\t\t\ttotal_mae += error\r\n\t\t\t\ttimes_mae[k] += error\r\n\t\t\t\ttotal_c += 1\r\n\t\t\t\ttimes_c[k] += 1\r\n\t# normalize summed absolute errors\r\n\ttotal_mae \/= total_c\r\n\ttimes_mae = [times_mae[i]\/times_c[i] for i in range(len(times_mae))]\r\n\treturn total_mae, times_mae<\/pre>\n<p>Once we have the evaluation of a model, we can present it.<\/p>\n<p>The <em>summarize_error()<\/em> function below first prints a one-line summary of a model\u2019s performance then creates a plot of MAE per forecast lead time.<\/p>\n<pre class=\"crayon-plain-tag\"># summarize scores\r\ndef summarize_error(name, total_mae, times_mae):\r\n\t# print summary\r\n\tlead_times = get_lead_times()\r\n\tformatted = ['+%d %.3f' % (lead_times[i], times_mae[i]) for i in range(len(lead_times))]\r\n\ts_scores = ', '.join(formatted)\r\n\tprint('%s: [%.3f MAE] %s' % (name, total_mae, s_scores))\r\n\t# plot summary\r\n\tpyplot.plot([str(x) for x in lead_times], times_mae, marker='.')\r\n\tpyplot.show()<\/pre>\n<p>We are now ready to start exploring the performance of naive forecasting methods.<\/p>\n<p>Machine Learning Modeling<\/p>\n<p>The problem can be modeled with machine learning.<\/p>\n<p>Most machine learning models do not directly support the notion of observations over time. Instead, the lag observations must be treated as input features in order to make predictions.<\/p>\n<p>This is a benefit of machine learning algorithms for time series forecasting. Specifically, that they are able to support large numbers of input features. These could be lag observations for one or multiple input time series.<\/p>\n<p>Other general benefits of machine learning algorithms for time series forecasting over classical methods include:<\/p>\n<ul>\n<li>Ability to support noisy features and noise in the relationships between variables.<\/li>\n<li>Ability to handle irrelevant features.<\/li>\n<li>Ability to support complex relationships between variables.<\/li>\n<\/ul>\n<p>A challenge with this dataset is the need to make multi-step forecasts. There are two main approaches that machine learning methods can be used to make multi-step forecasts; they are:<\/p>\n<ul>\n<li><strong>Direct<\/strong>. A separate model is developed to forecast each forecast lead time.<\/li>\n<li><strong>Recursive<\/strong>. A single model is developed to make one-step forecasts, and the model is used recursively where prior forecasts are used as input to forecast the subsequent lead time.<\/li>\n<\/ul>\n<p>The recursive approach can make sense when forecasting a short contiguous block of lead times, whereas the direct approach may make more sense when forecasting discontiguous lead times. The direct approach may be more appropriate for the air pollution forecast problem given that we are interested in forecasting a mixture of 10 contiguous and discontiguous lead times over a three-day period.<\/p>\n<p>The dataset has 39 target variables, and we develop one model per target variable, per forecast lead time. That means that we require (39 * 10) 390 machine learning models.<\/p>\n<p>Key to the use of machine learning algorithms for time series forecasting is the choice of input data. We can think about three main sources of data that can be used as input and mapped to each forecast lead time for a target variable; they are:<\/p>\n<ul>\n<li><strong>Univariate data<\/strong>, e.g. lag observations from the target variable that is being forecasted.<\/li>\n<li><strong>Multivariate data<\/strong>, e.g. lag observations from other variables (weather and targets).<\/li>\n<li><strong>Metadata<\/strong>, e.g. data about the date or time being forecast.<\/li>\n<\/ul>\n<p>Data can be drawn from across all chunks, providing a rich dataset for learning a mapping from inputs to the target forecast lead time.<\/p>\n<p>The 39 target variables are actually comprised of 12 variables across 14 sites.<\/p>\n<p>Because of the way the data is provided, the default approach to modeling is to treat each variable-site as independent. It may be possible to collapse data by variable and use the same models for a variable across multiple sites.<\/p>\n<p>Some variables have been purposely mislabeled (e.g different data used variables with the same identifier). Nevertheless, perhaps these mislabeled variables can be identified and excluded from multi-site models.<\/p>\n<h2>Machine Learning Data Preparation<\/h2>\n<p>Before we can explore machine learning models of this dataset, we must prepare the data in such a way that we can fit models.<\/p>\n<p>This requires two data preparation steps:<\/p>\n<ul>\n<li>Handling missing data.<\/li>\n<li>Preparing input-output patterns.<\/li>\n<\/ul>\n<p>For now, we will focus on the 39 target variables and ignore the meteorological and metadata.<\/p>\n<h3>Handle Missing Data<\/h3>\n<p>Chunks are comprised of five days or less of hourly observations for 39 target variables.<\/p>\n<p>Many of the chunks do not have all five days of data, and none of the chunks have data for all 39 target variables.<\/p>\n<p>In those cases where a chunk has no data for a target variable, a forecast is not required.<\/p>\n<p>In those cases where a chunk does have some data for a target variable, but not all five days worth, there will be gaps in the series. These gaps may be a few hours to over a day of observations in length, sometimes even longer.<\/p>\n<p>Three candidate strategies for dealing with these gaps are as follows:<\/p>\n<ul>\n<li>Ignore the gaps.<\/li>\n<li>Use data without gaps.<\/li>\n<li>Fill the gaps.<\/li>\n<\/ul>\n<p>We could ignore the gaps. A problem with this would be that that data would not be contiguous when splitting data into inputs and outputs. When training a model, the inputs will not be consistent, but could mean the last n hours of data, or data spread across the last <em>n<\/em> days. This inconsistency will make learning a mapping from inputs to outputs very noisy and perhaps more difficult for the model than it needs to be.<\/p>\n<p>We could use only the data without gaps. This is a good option. A risk is that we may not have much or enough data with which to fit a model.<\/p>\n<p>Finally, we could fill the gaps. This is called data imputation and there are many strategies that could be used to fill the gaps. Three methods that may perform well include:<\/p>\n<ul>\n<li>Persisting the last observed value forward (linear).<\/li>\n<li>Use the median value for the hour of day within the chunk.<\/li>\n<li>Use the median value for the hour of day across chunks.<\/li>\n<\/ul>\n<p>In this tutorial, we will use the latter approach and fill the gaps by using the median for the time of day across chunks. This method seems to result in more training samples and better model performance after a little testing.<\/p>\n<p>For a given variable, there may be missing observations defined by missing rows. Specifically, each observation has a \u2018<em>position_within_chunk<\/em>\u2018. We expect each chunk in the training dataset to have 120 observations, with \u2018<em>positions_within_chunk<\/em>\u2018 from 1 to 120 inclusively.<\/p>\n<p>Therefore, we can create an array of 120 <em>NaN<\/em> values for each variable, mark all observations in the chunk using the \u2018<em>positions_within_chunk<\/em>\u2018 values, and anything left will be marked\u00a0<em>NaN<\/em>. We can then plot each variable and look for gaps.<\/p>\n<p>The <em>variable_to_series()<\/em> function below will take the rows for a chunk and a given column index for the target variable and will return a series of 120 time steps for the variable with all available data marked with the value from the chunk.<\/p>\n<pre class=\"crayon-plain-tag\"># layout a variable with breaks in the data for missing positions\r\ndef variable_to_series(chunk_train, col_ix, n_steps=5*24):\r\n\t# lay out whole series\r\n\tdata = [nan for _ in range(n_steps)]\r\n\t# mark all available data\r\n\tfor i in range(len(chunk_train)):\r\n\t\t# get position in chunk\r\n\t\tposition = int(chunk_train[i, 1] - 1)\r\n\t\t# store data\r\n\t\tdata[position] = chunk_train[i, col_ix]\r\n\treturn data<\/pre>\n<p>We need to calculate a parallel series of the hour of day for each chunk that we can use for imputing hour specific data for each variable in the chunk.<\/p>\n<p>Given a series of partially filled hours of day, the <em>interpolate_hours()<\/em> function below will fill in the missing hours of day. It does this by finding the first marked hour, then counting forward, filling in the hour of day, then performing the same operation backwards.<\/p>\n<pre class=\"crayon-plain-tag\"># interpolate series of hours (in place) in 24 hour time\r\ndef interpolate_hours(hours):\r\n\t# find the first hour\r\n\tix = -1\r\n\tfor i in range(len(hours)):\r\n\t\tif not isnan(hours[i]):\r\n\t\t\tix = i\r\n\t\t\tbreak\r\n\t# fill-forward\r\n\thour = hours[ix]\r\n\tfor i in range(ix+1, len(hours)):\r\n\t\t# increment hour\r\n\t\thour += 1\r\n\t\t# check for a fill\r\n\t\tif isnan(hours[i]):\r\n\t\t\thours[i] = hour % 24\r\n\t# fill-backward\r\n\thour = hours[ix]\r\n\tfor i in range(ix-1, -1, -1):\r\n\t\t# decrement hour\r\n\t\thour -= 1\r\n\t\t# check for a fill\r\n\t\tif isnan(hours[i]):\r\n\t\t\thours[i] = hour % 24<\/pre>\n<p>We can call the same <em>variable_to_series()<\/em> function (above) to create the series of hours with missing values (column index 2), then call <em>interpolate_hours()<\/em> to fill in the gaps.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare sequence of hours for the chunk\r\nhours = variable_to_series(rows, 2)\r\n# interpolate hours\r\ninterpolate_hours(hours)<\/pre>\n<p>We can then pass the hours to any impute function that may make use of it.<\/p>\n<p>We can now try filling in missing values in a chunk with values within the same series with the same hour. Specifically, we will find all rows with the same hour on the series and calculate the median value.<\/p>\n<p>The <em>impute_missing()<\/em> below takes all of the rows in a chunk, the prepared sequence of hours of the day for the chunk, and the series with missing values for a variable and the column index for a variable.<\/p>\n<p>It first checks to see if the series is all missing data and returns immediately if this is the case as no impute can be performed. It then enumerates over the time steps of the series and when it detects a time step with no data, it collects all rows in the series with data for the same hour and calculates the median value.<\/p>\n<pre class=\"crayon-plain-tag\"># impute missing data\r\ndef impute_missing(train_chunks, rows, hours, series, col_ix):\r\n\t# impute missing using the median value for hour in all series\r\n\timputed = list()\r\n\tfor i in range(len(series)):\r\n\t\tif isnan(series[i]):\r\n\t\t\t# collect all rows across all chunks for the hour\r\n\t\t\tall_rows = list()\r\n\t\t\tfor rows in train_chunks:\r\n\t\t\t\t[all_rows.append(row) for row in rows[rows[:,2]==hours[i]]]\r\n\t\t\t# calculate the central tendency for target\r\n\t\t\tall_rows = array(all_rows)\r\n\t\t\t# fill with median value\r\n\t\t\tvalue = nanmedian(all_rows[:, col_ix])\r\n\t\t\tif isnan(value):\r\n\t\t\t\tvalue = 0.0\r\n\t\t\timputed.append(value)\r\n\t\telse:\r\n\t\t\timputed.append(series[i])\r\n\treturn imputed<\/pre>\n<\/p>\n<h3>Supervised Representation<\/h3>\n<p>We need to transform the series for each target variable into rows with inputs and outputs so that we can fit supervised machine learning algorithms.<\/p>\n<p>Specifically, we have a series, like:<\/p>\n<pre class=\"crayon-plain-tag\">[1, 2, 3, 4, 5, 6, 7, 8, 9]<\/pre>\n<p>When forecasting the lead time of +1 using 2 lag variables, we would split the series into input (<em>X<\/em>) and output (<em>y<\/em>) patterns as follows:<\/p>\n<pre class=\"crayon-plain-tag\">X,\t\t\ty\r\n1, 2,\t\t3\r\n2, 3,\t\t4\r\n3, 4,\t\t5\r\n4, 5,\t\t6\r\n5, 6,\t\t7\r\n6, 7,\t\t8\r\n7, 8,\t\t9<\/pre>\n<p>This first requires that we choose a number of lag observations to use as input. There is no right answer; instead, it is a good idea to test different numbers and see what works.<\/p>\n<p>We then must perform the splitting of the series into the supervised learning format for each of the 10 forecast lead times. For example, forecasting +24 with 2 lag observations might look like:<\/p>\n<pre class=\"crayon-plain-tag\">X,\t\t\ty\r\n1, 2,\t\t24<\/pre>\n<p>This process is then repeated for each of the 39 target variables.<\/p>\n<p>The patterns prepared for each lead time for each target variable can then be aggregated across chunks to provide a training dataset for a model.<\/p>\n<p>We must also prepare a test dataset. That is, input data (<em>X<\/em>) for each target variable for each chunk so that we can use it as input to forecast the lead times in the test dataset. If we chose a lag of 2, then the test dataset would be comprised of the last two observations for each target variable for each chunk. Pretty straightforward.<\/p>\n<p>We can start off by defining a function that will create input-output patterns for a given complete (imputed) series.<\/p>\n<p>The <em>supervised_for_lead_time()<\/em> function below will take a series, a number of lag observations to use as input, and a forecast lead time to predict, then will return a list of input\/out rows drawn from the series.<\/p>\n<pre class=\"crayon-plain-tag\"># created input\/output patterns from a sequence\r\ndef supervised_for_lead_time(series, n_lag, lead_time):\r\n\tsamples = list()\r\n\t# enumerate observations and create input\/output patterns\r\n\tfor i in range(n_lag, len(series)):\r\n\t\tend_ix = i + (lead_time - 1)\r\n\t\t# check if can create a pattern\r\n\t\tif end_ix >= len(series):\r\n\t\t\tbreak\r\n\t\t# retrieve input and output\r\n\t\tstart_ix = i - n_lag\r\n\t\trow = series[start_ix:i] + [series[end_ix]]\r\n\t\tsamples.append(row)\r\n\treturn samples<\/pre>\n<p>It is important to understand this piece.<\/p>\n<p>We can test this function and explore different numbers of lag variables and forecast lead times on a small test dataset.<\/p>\n<p>Below is a complete example that generates a series of 20 integers and creates a series with two input lags and forecasts the +6 lead time.<\/p>\n<pre class=\"crayon-plain-tag\"># test supervised to input\/output patterns\r\nfrom numpy import array\r\n\r\n# created input\/output patterns from a sequence\r\ndef supervised_for_lead_time(series, n_lag, lead_time):\r\n\tdata = list()\r\n\t# enumerate observations and create input\/output patterns\r\n\tfor i in range(n_lag, len(series)):\r\n\t\tend_ix = i + (lead_time - 1)\r\n\t\t# check if can create a pattern\r\n\t\tif end_ix >= len(series):\r\n\t\t\tbreak\r\n\t\t# retrieve input and output\r\n\t\tstart_ix = i - n_lag\r\n\t\trow = series[start_ix:i] + [series[end_ix]]\r\n\t\tdata.append(row)\r\n\treturn array(data)\r\n\r\n# define test dataset\r\ndata = [x for x in range(20)]\r\n# convert to supervised format\r\nresult = supervised_for_lead_time(data, 2, 6)\r\n# display result\r\nprint(result)<\/pre>\n<p>Running the example prints the resulting patterns showing lag observations and their associated forecast lead time.<\/p>\n<p>Experiment with this example to get comfortable with this data transform as it is key to modeling time series using machine learning algorithms.<\/p>\n<pre class=\"crayon-plain-tag\">[[ 0  1  7]\r\n [ 1  2  8]\r\n [ 2  3  9]\r\n [ 3  4 10]\r\n [ 4  5 11]\r\n [ 5  6 12]\r\n [ 6  7 13]\r\n [ 7  8 14]\r\n [ 8  9 15]\r\n [ 9 10 16]\r\n [10 11 17]\r\n [11 12 18]\r\n [12 13 19]]<\/pre>\n<p>We can now call <em>supervised_for_lead_time()<\/em> for each forecast lead time for a given target variable series.<\/p>\n<p>The <em>target_to_supervised()<\/em> function below implements this. First the target variable is converted into a series and imputed using the functions developed in the previous section. Then training samples are created for each target lead time. A test sample for the target variable is also created.<\/p>\n<p>Both the training data for each forecast lead time and the test input data are then returned for this target variable.<\/p>\n<pre class=\"crayon-plain-tag\"># create supervised learning data for each lead time for this target\r\ndef target_to_supervised(chunks, rows, hours, col_ix, n_lag):\r\n\ttrain_lead_times = list()\r\n\t# get series\r\n\tseries = variable_to_series(rows, col_ix)\r\n\tif not has_data(series):\r\n\t\treturn None, [nan for _ in range(n_lag)]\r\n\t# impute\r\n\timputed = impute_missing(chunks, rows, hours, series, col_ix)\r\n\t# prepare test sample for chunk-variable\r\n\ttest_sample = array(imputed[-n_lag:])\r\n\t# enumerate lead times\r\n\tlead_times = get_lead_times()\r\n\tfor lead_time in lead_times:\r\n\t\t# make input\/output data from series\r\n\t\ttrain_samples = supervised_for_lead_time(imputed, n_lag, lead_time)\r\n\t\ttrain_lead_times.append(train_samples)\r\n\treturn train_lead_times, test_sample<\/pre>\n<p>We have the pieces; we now need to define the function to drive the data preparation process.<\/p>\n<p>This function builds up the train and test datasets.<\/p>\n<p>The approach is to enumerate each target variable and gather the training data for each lead time from across all of the chunks. At the same time, we collect the samples required as input when making a prediction for the test dataset.<\/p>\n<p>The result is a training dataset that has the dimensions <em>[var][lead time][sample]<\/em> where the final dimension are the rows of training samples for a forecast lead time for a target variable. The function also returns the test dataset with the dimensions <em>[chunk][var][sample]<\/em> where the final dimension is the input data for making a prediction for a target variable for a chunk.<\/p>\n<p>The <em>data_prep()<\/em> function below implements this behavior and takes the data in chunk format and a specified number of lag observations to use as input.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare training [var][lead time][sample] and test [chunk][var][sample]\r\ndef data_prep(chunks, n_lag, n_vars=39):\r\n\tlead_times = get_lead_times()\r\n\ttrain_data = [[list() for _ in range(len(lead_times))] for _ in range(n_vars)]\r\n\ttest_data = [[list() for _ in range(n_vars)] for _ in range(len(chunks))]\r\n\t# enumerate targets for chunk\r\n\tfor var in range(n_vars):\r\n\t\t# convert target number into column number\r\n\t\tcol_ix = 3 + var\r\n\t\t# enumerate chunks to forecast\r\n\t\tfor c_id in range(len(chunks)):\r\n\t\t\trows = chunks[c_id]\r\n\t\t\t# prepare sequence of hours for the chunk\r\n\t\t\thours = variable_to_series(rows, 2)\r\n\t\t\t# interpolate hours\r\n\t\t\tinterpolate_hours(hours)\r\n\t\t\t# check for no data\r\n\t\t\tif not has_data(rows[:, col_ix]):\r\n\t\t\t\tcontinue\r\n\t\t\t# convert series into training data for each lead time\r\n\t\t\ttrain, test_sample = target_to_supervised(chunks, rows, hours, col_ix, n_lag)\r\n\t\t\t# store test sample for this var-chunk\r\n\t\t\ttest_data[c_id][var] = test_sample\r\n\t\t\tif train is not None:\r\n\t\t\t\t# store samples per lead time\r\n\t\t\t\tfor lead_time in range(len(lead_times)):\r\n\t\t\t\t\t# add all rows to the existing list of rows\r\n\t\t\t\t\ttrain_data[var][lead_time].extend(train[lead_time])\r\n\t\t# convert all rows for each var-lead time to a numpy array\r\n\t\tfor lead_time in range(len(lead_times)):\r\n\t\t\ttrain_data[var][lead_time] = array(train_data[var][lead_time])\r\n\treturn array(train_data), array(test_data)<\/pre>\n<\/p>\n<h3>Complete Example<\/h3>\n<p>We can tie everything together and prepare a train and test dataset with a supervised learning format for machine learning algorithms.<\/p>\n<p>We will use the prior 12 hours of lag observations as input when predicting each forecast lead time.<\/p>\n<p>The resulting train and test datasets are then saved as binary NumPy arrays.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare data\r\nfrom numpy import loadtxt\r\nfrom numpy import nan\r\nfrom numpy import isnan\r\nfrom numpy import count_nonzero\r\nfrom numpy import unique\r\nfrom numpy import array\r\nfrom numpy import nanmedian\r\nfrom numpy import save\r\n\r\n# split the dataset by 'chunkID', return a list of chunks\r\ndef to_chunks(values, chunk_ix=0):\r\n\tchunks = list()\r\n\t# get the unique chunk ids\r\n\tchunk_ids = unique(values[:, chunk_ix])\r\n\t# group rows by chunk id\r\n\tfor chunk_id in chunk_ids:\r\n\t\tselection = values[:, chunk_ix] == chunk_id\r\n\t\tchunks.append(values[selection, :])\r\n\treturn chunks\r\n\r\n# return a list of relative forecast lead times\r\ndef get_lead_times():\r\n\treturn [1, 2, 3, 4, 5, 10, 17, 24, 48, 72]\r\n\r\n# interpolate series of hours (in place) in 24 hour time\r\ndef interpolate_hours(hours):\r\n\t# find the first hour\r\n\tix = -1\r\n\tfor i in range(len(hours)):\r\n\t\tif not isnan(hours[i]):\r\n\t\t\tix = i\r\n\t\t\tbreak\r\n\t# fill-forward\r\n\thour = hours[ix]\r\n\tfor i in range(ix+1, len(hours)):\r\n\t\t# increment hour\r\n\t\thour += 1\r\n\t\t# check for a fill\r\n\t\tif isnan(hours[i]):\r\n\t\t\thours[i] = hour % 24\r\n\t# fill-backward\r\n\thour = hours[ix]\r\n\tfor i in range(ix-1, -1, -1):\r\n\t\t# decrement hour\r\n\t\thour -= 1\r\n\t\t# check for a fill\r\n\t\tif isnan(hours[i]):\r\n\t\t\thours[i] = hour % 24\r\n\r\n# return true if the array has any non-nan values\r\ndef has_data(data):\r\n\treturn count_nonzero(isnan(data)) < len(data)\r\n\r\n# impute missing data\r\ndef impute_missing(train_chunks, rows, hours, series, col_ix):\r\n\t# impute missing using the median value for hour in all series\r\n\timputed = list()\r\n\tfor i in range(len(series)):\r\n\t\tif isnan(series[i]):\r\n\t\t\t# collect all rows across all chunks for the hour\r\n\t\t\tall_rows = list()\r\n\t\t\tfor rows in train_chunks:\r\n\t\t\t\t[all_rows.append(row) for row in rows[rows[:,2]==hours[i]]]\r\n\t\t\t# calculate the central tendency for target\r\n\t\t\tall_rows = array(all_rows)\r\n\t\t\t# fill with median value\r\n\t\t\tvalue = nanmedian(all_rows[:, col_ix])\r\n\t\t\tif isnan(value):\r\n\t\t\t\tvalue = 0.0\r\n\t\t\timputed.append(value)\r\n\t\telse:\r\n\t\t\timputed.append(series[i])\r\n\treturn imputed\r\n\r\n# layout a variable with breaks in the data for missing positions\r\ndef variable_to_series(chunk_train, col_ix, n_steps=5*24):\r\n\t# lay out whole series\r\n\tdata = [nan for _ in range(n_steps)]\r\n\t# mark all available data\r\n\tfor i in range(len(chunk_train)):\r\n\t\t# get position in chunk\r\n\t\tposition = int(chunk_train[i, 1] - 1)\r\n\t\t# store data\r\n\t\tdata[position] = chunk_train[i, col_ix]\r\n\treturn data\r\n\r\n# created input\/output patterns from a sequence\r\ndef supervised_for_lead_time(series, n_lag, lead_time):\r\n\tsamples = list()\r\n\t# enumerate observations and create input\/output patterns\r\n\tfor i in range(n_lag, len(series)):\r\n\t\tend_ix = i + (lead_time - 1)\r\n\t\t# check if can create a pattern\r\n\t\tif end_ix >= len(series):\r\n\t\t\tbreak\r\n\t\t# retrieve input and output\r\n\t\tstart_ix = i - n_lag\r\n\t\trow = series[start_ix:i] + [series[end_ix]]\r\n\t\tsamples.append(row)\r\n\treturn samples\r\n\r\n# create supervised learning data for each lead time for this target\r\ndef target_to_supervised(chunks, rows, hours, col_ix, n_lag):\r\n\ttrain_lead_times = list()\r\n\t# get series\r\n\tseries = variable_to_series(rows, col_ix)\r\n\tif not has_data(series):\r\n\t\treturn None, [nan for _ in range(n_lag)]\r\n\t# impute\r\n\timputed = impute_missing(chunks, rows, hours, series, col_ix)\r\n\t# prepare test sample for chunk-variable\r\n\ttest_sample = array(imputed[-n_lag:])\r\n\t# enumerate lead times\r\n\tlead_times = get_lead_times()\r\n\tfor lead_time in lead_times:\r\n\t\t# make input\/output data from series\r\n\t\ttrain_samples = supervised_for_lead_time(imputed, n_lag, lead_time)\r\n\t\ttrain_lead_times.append(train_samples)\r\n\treturn train_lead_times, test_sample\r\n\r\n# prepare training [var][lead time][sample] and test [chunk][var][sample]\r\ndef data_prep(chunks, n_lag, n_vars=39):\r\n\tlead_times = get_lead_times()\r\n\ttrain_data = [[list() for _ in range(len(lead_times))] for _ in range(n_vars)]\r\n\ttest_data = [[list() for _ in range(n_vars)] for _ in range(len(chunks))]\r\n\t# enumerate targets for chunk\r\n\tfor var in range(n_vars):\r\n\t\t# convert target number into column number\r\n\t\tcol_ix = 3 + var\r\n\t\t# enumerate chunks to forecast\r\n\t\tfor c_id in range(len(chunks)):\r\n\t\t\trows = chunks[c_id]\r\n\t\t\t# prepare sequence of hours for the chunk\r\n\t\t\thours = variable_to_series(rows, 2)\r\n\t\t\t# interpolate hours\r\n\t\t\tinterpolate_hours(hours)\r\n\t\t\t# check for no data\r\n\t\t\tif not has_data(rows[:, col_ix]):\r\n\t\t\t\tcontinue\r\n\t\t\t# convert series into training data for each lead time\r\n\t\t\ttrain, test_sample = target_to_supervised(chunks, rows, hours, col_ix, n_lag)\r\n\t\t\t# store test sample for this var-chunk\r\n\t\t\ttest_data[c_id][var] = test_sample\r\n\t\t\tif train is not None:\r\n\t\t\t\t# store samples per lead time\r\n\t\t\t\tfor lead_time in range(len(lead_times)):\r\n\t\t\t\t\t# add all rows to the existing list of rows\r\n\t\t\t\t\ttrain_data[var][lead_time].extend(train[lead_time])\r\n\t\t# convert all rows for each var-lead time to a numpy array\r\n\t\tfor lead_time in range(len(lead_times)):\r\n\t\t\ttrain_data[var][lead_time] = array(train_data[var][lead_time])\r\n\treturn array(train_data), array(test_data)\r\n\r\n# load dataset\r\ntrain = loadtxt('AirQualityPrediction\/naive_train.csv', delimiter=',')\r\ntest = loadtxt('AirQualityPrediction\/naive_test.csv', delimiter=',')\r\n# group data by chunks\r\ntrain_chunks = to_chunks(train)\r\ntest_chunks = to_chunks(test)\r\n# convert training data into supervised learning data\r\nn_lag = 12\r\ntrain_data, test_data = data_prep(train_chunks, n_lag)\r\nprint(train_data.shape, test_data.shape)\r\n# save train and test sets to file\r\nsave('AirQualityPrediction\/supervised_train.npy', train_data)\r\nsave('AirQualityPrediction\/supervised_test.npy', test_data)<\/pre>\n<p>Running the example may take a minute.<\/p>\n<p>The result are two binary files containing the train and test datasets that we can load in the following sections for training and evaluating machine learning algorithms on the problem.<\/p>\n<h2>Model Evaluation Test Harness<\/h2>\n<p>Before we can start evaluating algorithms, we need some more elements of the test harness.<\/p>\n<p>First, we need to be able to fit a scikit-learn model on training data. The <em>fit_model()<\/em> function below will make a clone of the model configuration and fit it on the provided training data. We will need to fit many (360) versions of each configured model, so this function will be called a lot.<\/p>\n<pre class=\"crayon-plain-tag\"># fit a single model\r\ndef fit_model(model, X, y):\r\n\t# clone the model configuration\r\n\tlocal_model = clone(model)\r\n\t# fit the model\r\n\tlocal_model.fit(X, y)\r\n\treturn local_model<\/pre>\n<p>Next, we need to fit a model for each variable and forecast lead time combination.<\/p>\n<p>We can do this by enumerating the training dataset first by the variables and then by the lead times. We can then fit a model and store it in a list of lists with the same structure, specifically: <em>[var][time][model]<\/em>.<\/p>\n<p>The <em>fit_models()<\/em> function below implements this.<\/p>\n<pre class=\"crayon-plain-tag\"># fit one model for each variable and each forecast lead time [var][time][model]\r\ndef fit_models(model, train):\r\n\t# prepare structure for saving models\r\n\tmodels = [[list() for _ in range(train.shape[1])] for _ in range(train.shape[0])]\r\n\t# enumerate vars\r\n\tfor i in range(train.shape[0]):\r\n\t\t# enumerate lead times\r\n\t\tfor j in range(train.shape[1]):\r\n\t\t\t# get data\r\n\t\t\tdata = train[i, j]\r\n\t\t\tX, y = data[:, :-1], data[:, -1]\r\n\t\t\t# fit model\r\n\t\t\tlocal_model = fit_model(model, X, y)\r\n\t\t\tmodels[i][j].append(local_model)\r\n\treturn models<\/pre>\n<p>Fitting models is the slow part and could benefit from being parallelized, such as with the Joblib library. This is left as an extension.<\/p>\n<p>Once the models are fit, they can be used to make predictions for the test dataset.<\/p>\n<p>The prepared test dataset is organized first by chunk, and then by target variable. Making predictions is fast and involves first checking that a prediction can be made (we have input data) and if so, using the appropriate models for the target variable. Each of the 10 forecast lead times for the variable will then be predicted with each of the direct models for those lead times.<\/p>\n<p>The <em>make_predictions()<\/em> function below implements this, taking the list of lists of models and the loaded test dataset as arguments and returning an array of forecasts with the structure <em>[chunks][var][time]<\/em>.<\/p>\n<pre class=\"crayon-plain-tag\"># return forecasts as [chunks][var][time]\r\ndef make_predictions(models, test):\r\n\tlead_times = get_lead_times()\r\n\tpredictions = list()\r\n\t# enumerate chunks\r\n\tfor i in range(test.shape[0]):\r\n\t\t# enumerate variables\r\n\t\tchunk_predictions = list()\r\n\t\tfor j in range(test.shape[1]):\r\n\t\t\t# get the input pattern for this chunk and target\r\n\t\t\tpattern = test[i,j]\r\n\t\t\t# assume a nan forecast\r\n\t\t\tforecasts = array([nan for _ in range(len(lead_times))])\r\n\t\t\t# check we can make a forecast\r\n\t\t\tif has_data(pattern):\r\n\t\t\t\tpattern = pattern.reshape((1, len(pattern)))\r\n\t\t\t\t# forecast each lead time\r\n\t\t\t\tforecasts = list()\r\n\t\t\t\tfor k in range(len(lead_times)):\r\n\t\t\t\t\tyhat = models[j][k][0].predict(pattern)\r\n\t\t\t\t\tforecasts.append(yhat[0])\r\n\t\t\t\tforecasts = array(forecasts)\r\n\t\t\t# save forecasts fore each lead time for this variable\r\n\t\t\tchunk_predictions.append(forecasts)\r\n\t\t# save forecasts for this chunk\r\n\t\tchunk_predictions = array(chunk_predictions)\r\n\t\tpredictions.append(chunk_predictions)\r\n\treturn array(predictions)<\/pre>\n<p>We need a list of models to evaluate.<\/p>\n<p>We can define a generic <em>get_models()<\/em> function that is responsible for defining a dictionary of model-names mapped to configured scikit-learn model objects.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare a list of ml models\r\ndef get_models(models=dict()):\r\n\t# ...\r\n\treturn models<\/pre>\n<p>Finally, we need a function to drive the model evaluation process.<\/p>\n<p>Given the dictionary of models, enumerate the models, first fitting the matrix of models on the training data, making predictions of the test dataset, evaluating the predictions, and summarizing the results.<\/p>\n<p>The <em>evaluate_models()<\/em> function below implements this.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate a suite of models\r\ndef evaluate_models(models, train, test, actual):\r\n\tfor name, model in models.items():\r\n\t\t# fit models\r\n\t\tfits = fit_models(model, train)\r\n\t\t# make predictions\r\n\t\tpredictions = make_predictions(fits, test)\r\n\t\t# evaluate forecast\r\n\t\ttotal_mae, _ = evaluate_forecasts(predictions, actual)\r\n\t\t# summarize forecast\r\n\t\tsummarize_error(name, total_mae)<\/pre>\n<p>We now have everything we need to evaluate machine learning models.<\/p>\n<h2>Evaluate Linear Algorithms<\/h2>\n<p>In this section, we will spot check a suite of linear machine learning algorithms.<\/p>\n<p>Linear algorithms are those that assume that the output is a linear function of the input variables. This is much like the assumptions of classical time series forecasting models like ARIMA.<\/p>\n<p>Spot checking means evaluating a suite of models in order to get a rough idea of what works. We are interested in any models that outperform a simple autoregression model AR(2) that achieves a MAE error of about 0.487.<\/p>\n<p>We will test eight linear machine learning algorithms with their default configuration; specifically:<\/p>\n<ul>\n<li>Linear Regression<\/li>\n<li>Lasso Linear Regression<\/li>\n<li>Ridge Regression<\/li>\n<li>Elastic Net Regression<\/li>\n<li>Huber Regression<\/li>\n<li>Lasso Lars Linear Regression<\/li>\n<li>Passive Aggressive Regression<\/li>\n<li>Stochastic Gradient Descent Regression<\/li>\n<\/ul>\n<p>We can define these models in the <em>get_models()<\/em> function.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare a list of ml models\r\ndef get_models(models=dict()):\r\n\t# linear models\r\n\tmodels['lr'] = LinearRegression()\r\n\tmodels['lasso'] = Lasso()\r\n\tmodels['ridge'] = Ridge()\r\n\tmodels['en'] = ElasticNet()\r\n\tmodels['huber'] = HuberRegressor()\r\n\tmodels['llars'] = LassoLars()\r\n\tmodels['pa'] = PassiveAggressiveRegressor(max_iter=1000, tol=1e-3)\r\n\tmodels['sgd'] = SGDRegressor(max_iter=1000, tol=1e-3)\r\n\tprint('Defined %d models' % len(models))\r\n\treturn models<\/pre>\n<p>The complete code example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate linear algorithms\r\nfrom numpy import load\r\nfrom numpy import loadtxt\r\nfrom numpy import nan\r\nfrom numpy import isnan\r\nfrom numpy import count_nonzero\r\nfrom numpy import unique\r\nfrom numpy import array\r\nfrom sklearn.base import clone\r\nfrom sklearn.linear_model import LinearRegression\r\nfrom sklearn.linear_model import Lasso\r\nfrom sklearn.linear_model import Ridge\r\nfrom sklearn.linear_model import ElasticNet\r\nfrom sklearn.linear_model import HuberRegressor\r\nfrom sklearn.linear_model import LassoLars\r\nfrom sklearn.linear_model import PassiveAggressiveRegressor\r\nfrom sklearn.linear_model import SGDRegressor\r\n\r\n# split the dataset by 'chunkID', return a list of chunks\r\ndef to_chunks(values, chunk_ix=0):\r\n\tchunks = list()\r\n\t# get the unique chunk ids\r\n\tchunk_ids = unique(values[:, chunk_ix])\r\n\t# group rows by chunk id\r\n\tfor chunk_id in chunk_ids:\r\n\t\tselection = values[:, chunk_ix] == chunk_id\r\n\t\tchunks.append(values[selection, :])\r\n\treturn chunks\r\n\r\n# return true if the array has any non-nan values\r\ndef has_data(data):\r\n\treturn count_nonzero(isnan(data)) < len(data)\r\n\r\n# return a list of relative forecast lead times\r\ndef get_lead_times():\r\n\treturn [1, 2, 3, 4, 5, 10, 17, 24, 48, 72]\r\n\r\n# fit a single model\r\ndef fit_model(model, X, y):\r\n\t# clone the model configuration\r\n\tlocal_model = clone(model)\r\n\t# fit the model\r\n\tlocal_model.fit(X, y)\r\n\treturn local_model\r\n\r\n# fit one model for each variable and each forecast lead time [var][time][model]\r\ndef fit_models(model, train):\r\n\t# prepare structure for saving models\r\n\tmodels = [[list() for _ in range(train.shape[1])] for _ in range(train.shape[0])]\r\n\t# enumerate vars\r\n\tfor i in range(train.shape[0]):\r\n\t\t# enumerate lead times\r\n\t\tfor j in range(train.shape[1]):\r\n\t\t\t# get data\r\n\t\t\tdata = train[i, j]\r\n\t\t\tX, y = data[:, :-1], data[:, -1]\r\n\t\t\t# fit model\r\n\t\t\tlocal_model = fit_model(model, X, y)\r\n\t\t\tmodels[i][j].append(local_model)\r\n\treturn models\r\n\r\n# return forecasts as [chunks][var][time]\r\ndef make_predictions(models, test):\r\n\tlead_times = get_lead_times()\r\n\tpredictions = list()\r\n\t# enumerate chunks\r\n\tfor i in range(test.shape[0]):\r\n\t\t# enumerate variables\r\n\t\tchunk_predictions = list()\r\n\t\tfor j in range(test.shape[1]):\r\n\t\t\t# get the input pattern for this chunk and target\r\n\t\t\tpattern = test[i,j]\r\n\t\t\t# assume a nan forecast\r\n\t\t\tforecasts = array([nan for _ in range(len(lead_times))])\r\n\t\t\t# check we can make a forecast\r\n\t\t\tif has_data(pattern):\r\n\t\t\t\tpattern = pattern.reshape((1, len(pattern)))\r\n\t\t\t\t# forecast each lead time\r\n\t\t\t\tforecasts = list()\r\n\t\t\t\tfor k in range(len(lead_times)):\r\n\t\t\t\t\tyhat = models[j][k][0].predict(pattern)\r\n\t\t\t\t\tforecasts.append(yhat[0])\r\n\t\t\t\tforecasts = array(forecasts)\r\n\t\t\t# save forecasts for each lead time for this variable\r\n\t\t\tchunk_predictions.append(forecasts)\r\n\t\t# save forecasts for this chunk\r\n\t\tchunk_predictions = array(chunk_predictions)\r\n\t\tpredictions.append(chunk_predictions)\r\n\treturn array(predictions)\r\n\r\n# convert the test dataset in chunks to [chunk][variable][time] format\r\ndef prepare_test_forecasts(test_chunks):\r\n\tpredictions = list()\r\n\t# enumerate chunks to forecast\r\n\tfor rows in test_chunks:\r\n\t\t# enumerate targets for chunk\r\n\t\tchunk_predictions = list()\r\n\t\tfor j in range(3, rows.shape[1]):\r\n\t\t\tyhat = rows[:, j]\r\n\t\t\tchunk_predictions.append(yhat)\r\n\t\tchunk_predictions = array(chunk_predictions)\r\n\t\tpredictions.append(chunk_predictions)\r\n\treturn array(predictions)\r\n\r\n# calculate the error between an actual and predicted value\r\ndef calculate_error(actual, predicted):\r\n\t# give the full actual value if predicted is nan\r\n\tif isnan(predicted):\r\n\t\treturn abs(actual)\r\n\t# calculate abs difference\r\n\treturn abs(actual - predicted)\r\n\r\n# evaluate a forecast in the format [chunk][variable][time]\r\ndef evaluate_forecasts(predictions, testset):\r\n\tlead_times = get_lead_times()\r\n\ttotal_mae, times_mae = 0.0, [0.0 for _ in range(len(lead_times))]\r\n\ttotal_c, times_c = 0, [0 for _ in range(len(lead_times))]\r\n\t# enumerate test chunks\r\n\tfor i in range(len(test_chunks)):\r\n\t\t# convert to forecasts\r\n\t\tactual = testset[i]\r\n\t\tpredicted = predictions[i]\r\n\t\t# enumerate target variables\r\n\t\tfor j in range(predicted.shape[0]):\r\n\t\t\t# enumerate lead times\r\n\t\t\tfor k in range(len(lead_times)):\r\n\t\t\t\t# skip if actual in nan\r\n\t\t\t\tif isnan(actual[j, k]):\r\n\t\t\t\t\tcontinue\r\n\t\t\t\t# calculate error\r\n\t\t\t\terror = calculate_error(actual[j, k], predicted[j, k])\r\n\t\t\t\t# update statistics\r\n\t\t\t\ttotal_mae += error\r\n\t\t\t\ttimes_mae[k] += error\r\n\t\t\t\ttotal_c += 1\r\n\t\t\t\ttimes_c[k] += 1\r\n\t# normalize summed absolute errors\r\n\ttotal_mae \/= total_c\r\n\ttimes_mae = [times_mae[i]\/times_c[i] for i in range(len(times_mae))]\r\n\treturn total_mae, times_mae\r\n\r\n# summarize scores\r\ndef summarize_error(name, total_mae):\r\n\tprint('%s: %.3f MAE' % (name, total_mae))\r\n\r\n# prepare a list of ml models\r\ndef get_models(models=dict()):\r\n\t# linear models\r\n\tmodels['lr'] = LinearRegression()\r\n\tmodels['lasso'] = Lasso()\r\n\tmodels['ridge'] = Ridge()\r\n\tmodels['en'] = ElasticNet()\r\n\tmodels['huber'] = HuberRegressor()\r\n\tmodels['llars'] = LassoLars()\r\n\tmodels['pa'] = PassiveAggressiveRegressor(max_iter=1000, tol=1e-3)\r\n\tmodels['sgd'] = SGDRegressor(max_iter=1000, tol=1e-3)\r\n\tprint('Defined %d models' % len(models))\r\n\treturn models\r\n\r\n# evaluate a suite of models\r\ndef evaluate_models(models, train, test, actual):\r\n\tfor name, model in models.items():\r\n\t\t# fit models\r\n\t\tfits = fit_models(model, train)\r\n\t\t# make predictions\r\n\t\tpredictions = make_predictions(fits, test)\r\n\t\t# evaluate forecast\r\n\t\ttotal_mae, _ = evaluate_forecasts(predictions, actual)\r\n\t\t# summarize forecast\r\n\t\tsummarize_error(name, total_mae)\r\n\r\n# load supervised datasets\r\ntrain = load('AirQualityPrediction\/supervised_train.npy')\r\ntest = load('AirQualityPrediction\/supervised_test.npy')\r\nprint(train.shape, test.shape)\r\n# load test chunks for validation\r\ntestset = loadtxt('AirQualityPrediction\/naive_test.csv', delimiter=',')\r\ntest_chunks = to_chunks(testset)\r\nactual = prepare_test_forecasts(test_chunks)\r\n# prepare list of models\r\nmodels = get_models()\r\n# evaluate models\r\nevaluate_models(models, train, test, actual)<\/pre>\n<p>Running the example prints the MAE for each of the evaluated algorithms.<\/p>\n<p>We can see that many of the algorithms show skill compared to a simple AR model, achieving a MAE below 0.487.<\/p>\n<p>Huber regression seems to perform the best (with default configuration), achieving a MAE of 0.434.<\/p>\n<p>This is interesting as <a href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.linear_model.HuberRegressor.html\">Huber regression<\/a>, or <a href=\"https:\/\/en.wikipedia.org\/wiki\/Robust_regression\">robust regression<\/a> with Huber loss, is a method that is designed to be robust to outliers in the training dataset. It may suggest that the other methods may perform better with a little more data preparation, such as standardization and\/or outlier removal.<\/p>\n<pre class=\"crayon-plain-tag\">lr: 0.454 MAE\r\nlasso: 0.624 MAE\r\nridge: 0.454 MAE\r\nen: 0.595 MAE\r\nhuber: 0.434 MAE\r\nllars: 0.631 MAE\r\npa: 0.833 MAE\r\nsgd: 0.457 MAE<\/pre>\n<\/p>\n<h2>Nonlinear Algorithms<\/h2>\n<p>We can use the same framework to evaluate the performance of a suite of nonlinear and ensemble machine learning algorithms.<\/p>\n<p>Specifically:<\/p>\n<p><strong>Nonlinear Algorithms<\/strong><\/p>\n<ul>\n<li>k-Nearest Neighbors<\/li>\n<li>Classification and Regression Trees<\/li>\n<li>Extra Tree<\/li>\n<li>Support Vector Regression<\/li>\n<\/ul>\n<p><strong>Ensemble Algorithms<\/strong><\/p>\n<ul>\n<li>Adaboost<\/li>\n<li>Bagged Decision Trees<\/li>\n<li>Random Forest<\/li>\n<li>Extra Trees<\/li>\n<li>Gradient Boosting Machines<\/li>\n<\/ul>\n<p>The <em>get_models()<\/em> function below defines these nine models.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare a list of ml models\r\ndef get_models(models=dict()):\r\n\t# non-linear models\r\n\tmodels['knn'] = KNeighborsRegressor(n_neighbors=7)\r\n\tmodels['cart'] = DecisionTreeRegressor()\r\n\tmodels['extra'] = ExtraTreeRegressor()\r\n\tmodels['svmr'] = SVR()\r\n\t# # ensemble models\r\n\tn_trees = 100\r\n\tmodels['ada'] = AdaBoostRegressor(n_estimators=n_trees)\r\n\tmodels['bag'] = BaggingRegressor(n_estimators=n_trees)\r\n\tmodels['rf'] = RandomForestRegressor(n_estimators=n_trees)\r\n\tmodels['et'] = ExtraTreesRegressor(n_estimators=n_trees)\r\n\tmodels['gbm'] = GradientBoostingRegressor(n_estimators=n_trees)\r\n\tprint('Defined %d models' % len(models))\r\n\treturn models<\/pre>\n<p>The complete code listing is provided below.<\/p>\n<pre class=\"crayon-plain-tag\"># spot check nonlinear algorithms\r\nfrom numpy import load\r\nfrom numpy import loadtxt\r\nfrom numpy import nan\r\nfrom numpy import isnan\r\nfrom numpy import count_nonzero\r\nfrom numpy import unique\r\nfrom numpy import array\r\nfrom sklearn.base import clone\r\nfrom sklearn.neighbors import KNeighborsRegressor\r\nfrom sklearn.tree import DecisionTreeRegressor\r\nfrom sklearn.tree import ExtraTreeRegressor\r\nfrom sklearn.svm import SVR\r\nfrom sklearn.ensemble import AdaBoostRegressor\r\nfrom sklearn.ensemble import BaggingRegressor\r\nfrom sklearn.ensemble import RandomForestRegressor\r\nfrom sklearn.ensemble import ExtraTreesRegressor\r\nfrom sklearn.ensemble import GradientBoostingRegressor\r\n\r\n# split the dataset by 'chunkID', return a list of chunks\r\ndef to_chunks(values, chunk_ix=0):\r\n\tchunks = list()\r\n\t# get the unique chunk ids\r\n\tchunk_ids = unique(values[:, chunk_ix])\r\n\t# group rows by chunk id\r\n\tfor chunk_id in chunk_ids:\r\n\t\tselection = values[:, chunk_ix] == chunk_id\r\n\t\tchunks.append(values[selection, :])\r\n\treturn chunks\r\n\r\n# return true if the array has any non-nan values\r\ndef has_data(data):\r\n\treturn count_nonzero(isnan(data)) < len(data)\r\n\r\n# return a list of relative forecast lead times\r\ndef get_lead_times():\r\n\treturn [1, 2, 3, 4, 5, 10, 17, 24, 48, 72]\r\n\r\n# fit a single model\r\ndef fit_model(model, X, y):\r\n\t# clone the model configuration\r\n\tlocal_model = clone(model)\r\n\t# fit the model\r\n\tlocal_model.fit(X, y)\r\n\treturn local_model\r\n\r\n# fit one model for each variable and each forecast lead time [var][time][model]\r\ndef fit_models(model, train):\r\n\t# prepare structure for saving models\r\n\tmodels = [[list() for _ in range(train.shape[1])] for _ in range(train.shape[0])]\r\n\t# enumerate vars\r\n\tfor i in range(train.shape[0]):\r\n\t\t# enumerate lead times\r\n\t\tfor j in range(train.shape[1]):\r\n\t\t\t# get data\r\n\t\t\tdata = train[i, j]\r\n\t\t\tX, y = data[:, :-1], data[:, -1]\r\n\t\t\t# fit model\r\n\t\t\tlocal_model = fit_model(model, X, y)\r\n\t\t\tmodels[i][j].append(local_model)\r\n\treturn models\r\n\r\n# return forecasts as [chunks][var][time]\r\ndef make_predictions(models, test):\r\n\tlead_times = get_lead_times()\r\n\tpredictions = list()\r\n\t# enumerate chunks\r\n\tfor i in range(test.shape[0]):\r\n\t\t# enumerate variables\r\n\t\tchunk_predictions = list()\r\n\t\tfor j in range(test.shape[1]):\r\n\t\t\t# get the input pattern for this chunk and target\r\n\t\t\tpattern = test[i,j]\r\n\t\t\t# assume a nan forecast\r\n\t\t\tforecasts = array([nan for _ in range(len(lead_times))])\r\n\t\t\t# check we can make a forecast\r\n\t\t\tif has_data(pattern):\r\n\t\t\t\tpattern = pattern.reshape((1, len(pattern)))\r\n\t\t\t\t# forecast each lead time\r\n\t\t\t\tforecasts = list()\r\n\t\t\t\tfor k in range(len(lead_times)):\r\n\t\t\t\t\tyhat = models[j][k][0].predict(pattern)\r\n\t\t\t\t\tforecasts.append(yhat[0])\r\n\t\t\t\tforecasts = array(forecasts)\r\n\t\t\t# save forecasts for each lead time for this variable\r\n\t\t\tchunk_predictions.append(forecasts)\r\n\t\t# save forecasts for this chunk\r\n\t\tchunk_predictions = array(chunk_predictions)\r\n\t\tpredictions.append(chunk_predictions)\r\n\treturn array(predictions)\r\n\r\n# convert the test dataset in chunks to [chunk][variable][time] format\r\ndef prepare_test_forecasts(test_chunks):\r\n\tpredictions = list()\r\n\t# enumerate chunks to forecast\r\n\tfor rows in test_chunks:\r\n\t\t# enumerate targets for chunk\r\n\t\tchunk_predictions = list()\r\n\t\tfor j in range(3, rows.shape[1]):\r\n\t\t\tyhat = rows[:, j]\r\n\t\t\tchunk_predictions.append(yhat)\r\n\t\tchunk_predictions = array(chunk_predictions)\r\n\t\tpredictions.append(chunk_predictions)\r\n\treturn array(predictions)\r\n\r\n# calculate the error between an actual and predicted value\r\ndef calculate_error(actual, predicted):\r\n\t# give the full actual value if predicted is nan\r\n\tif isnan(predicted):\r\n\t\treturn abs(actual)\r\n\t# calculate abs difference\r\n\treturn abs(actual - predicted)\r\n\r\n# evaluate a forecast in the format [chunk][variable][time]\r\ndef evaluate_forecasts(predictions, testset):\r\n\tlead_times = get_lead_times()\r\n\ttotal_mae, times_mae = 0.0, [0.0 for _ in range(len(lead_times))]\r\n\ttotal_c, times_c = 0, [0 for _ in range(len(lead_times))]\r\n\t# enumerate test chunks\r\n\tfor i in range(len(test_chunks)):\r\n\t\t# convert to forecasts\r\n\t\tactual = testset[i]\r\n\t\tpredicted = predictions[i]\r\n\t\t# enumerate target variables\r\n\t\tfor j in range(predicted.shape[0]):\r\n\t\t\t# enumerate lead times\r\n\t\t\tfor k in range(len(lead_times)):\r\n\t\t\t\t# skip if actual in nan\r\n\t\t\t\tif isnan(actual[j, k]):\r\n\t\t\t\t\tcontinue\r\n\t\t\t\t# calculate error\r\n\t\t\t\terror = calculate_error(actual[j, k], predicted[j, k])\r\n\t\t\t\t# update statistics\r\n\t\t\t\ttotal_mae += error\r\n\t\t\t\ttimes_mae[k] += error\r\n\t\t\t\ttotal_c += 1\r\n\t\t\t\ttimes_c[k] += 1\r\n\t# normalize summed absolute errors\r\n\ttotal_mae \/= total_c\r\n\ttimes_mae = [times_mae[i]\/times_c[i] for i in range(len(times_mae))]\r\n\treturn total_mae, times_mae\r\n\r\n# summarize scores\r\ndef summarize_error(name, total_mae):\r\n\tprint('%s: %.3f MAE' % (name, total_mae))\r\n\r\n# prepare a list of ml models\r\ndef get_models(models=dict()):\r\n\t# non-linear models\r\n\tmodels['knn'] = KNeighborsRegressor(n_neighbors=7)\r\n\tmodels['cart'] = DecisionTreeRegressor()\r\n\tmodels['extra'] = ExtraTreeRegressor()\r\n\tmodels['svmr'] = SVR()\r\n\t# # ensemble models\r\n\tn_trees = 100\r\n\tmodels['ada'] = AdaBoostRegressor(n_estimators=n_trees)\r\n\tmodels['bag'] = BaggingRegressor(n_estimators=n_trees)\r\n\tmodels['rf'] = RandomForestRegressor(n_estimators=n_trees)\r\n\tmodels['et'] = ExtraTreesRegressor(n_estimators=n_trees)\r\n\tmodels['gbm'] = GradientBoostingRegressor(n_estimators=n_trees)\r\n\tprint('Defined %d models' % len(models))\r\n\treturn models\r\n\r\n# evaluate a suite of models\r\ndef evaluate_models(models, train, test, actual):\r\n\tfor name, model in models.items():\r\n\t\t# fit models\r\n\t\tfits = fit_models(model, train)\r\n\t\t# make predictions\r\n\t\tpredictions = make_predictions(fits, test)\r\n\t\t# evaluate forecast\r\n\t\ttotal_mae, _ = evaluate_forecasts(predictions, actual)\r\n\t\t# summarize forecast\r\n\t\tsummarize_error(name, total_mae)\r\n\r\n# load supervised datasets\r\ntrain = load('AirQualityPrediction\/supervised_train.npy')\r\ntest = load('AirQualityPrediction\/supervised_test.npy')\r\nprint(train.shape, test.shape)\r\n# load test chunks for validation\r\ntestset = loadtxt('AirQualityPrediction\/naive_test.csv', delimiter=',')\r\ntest_chunks = to_chunks(testset)\r\nactual = prepare_test_forecasts(test_chunks)\r\n# prepare list of models\r\nmodels = get_models()\r\n# evaluate models\r\nevaluate_models(models, train, test, actual)<\/pre>\n<p>Running the example, we can see that many algorithms performed well compared to the baseline of an autoregression algorithm, although none performed as well as Huber regression in the previous section.<\/p>\n<p>Both support vector regression and perhaps gradient boosting machines may be worth further investigation of achieving MAEs of 0.437 and 0.450 respectively.<\/p>\n<pre class=\"crayon-plain-tag\">knn: 0.484 MAE\r\ncart: 0.631 MAE\r\nextra: 0.630 MAE\r\nsvmr: 0.437 MAE\r\nada: 0.717 MAE\r\nbag: 0.471 MAE\r\nrf: 0.470 MAE\r\net: 0.469 MAE\r\ngbm: 0.450 MAE<\/pre>\n<\/p>\n<h2>Tune Lag Size<\/h2>\n<p>In the previous spot check experiments, the number of lag observations was arbitrarily fixed at 12.<\/p>\n<p>We can vary the number of lag observations and evaluate the effect on MAE. Some algorithms may require more or fewer prior observations, but general trends may hold across algorithms.<\/p>\n<p>Prepare the supervised learning dataset with a range of different numbers of lag observations and fit and evaluate the HuberRegressor on each.<\/p>\n<p>I experimented with the following number of lag observations:<\/p>\n<pre class=\"crayon-plain-tag\">[1, 3, 6, 12, 24, 36, 48]<\/pre>\n<p>The results were as follows:<\/p>\n<pre class=\"crayon-plain-tag\">1:\t0.451\r\n3:\t0.445\r\n6:\t0.441\r\n12:\t0.434\r\n24:\t0.423\r\n36:\t0.422\r\n48:\t0.439<\/pre>\n<p>A plot of these results is provided below.<\/p>\n<div id=\"attachment_6336\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6336\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/07\/Line-Plot-of-Number-of-Lag-Observations-vs-MAE-for-Huber-Regression.png\" alt=\"Line Plot of Number of Lag Observations vs MAE for Huber Regression\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/Line-Plot-of-Number-of-Lag-Observations-vs-MAE-for-Huber-Regression.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/Line-Plot-of-Number-of-Lag-Observations-vs-MAE-for-Huber-Regression-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/Line-Plot-of-Number-of-Lag-Observations-vs-MAE-for-Huber-Regression-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/Line-Plot-of-Number-of-Lag-Observations-vs-MAE-for-Huber-Regression-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plot of Number of Lag Observations vs MAE for Huber Regression<\/p>\n<\/div>\n<p>We can see a general trend of decreasing overall MAE with the increase in the number of lag observations, at least to a point after which error begins to rise again.<\/p>\n<p>The results suggest, at least for the HuberRegressor algorithm, that 36 lag observations may be a good configuration achieving a MAE of 0.422.<\/p>\n<h2>Extensions<\/h2>\n<p>This section lists some ideas for extending the tutorial that you may wish to explore.<\/p>\n<ul>\n<li><strong>Data Preparation<\/strong>. Explore whether simple data preparation such as standardization or statistical outlier removal can improve model performance.<\/li>\n<li><strong>Engineered Features<\/strong>. Explore whether engineered features such as median value for forecasted hour of day can improve model performance<\/li>\n<li><strong>Meteorological Variables<\/strong>. Explore whether adding lag meteorological variables to the models can improve performance.<\/li>\n<li><strong>Cross-Site Models<\/strong>. Explore whether combining target variables of the same type and re-using the models across sites results in a performance improvement.<\/li>\n<li><strong>Algorithm Tuning<\/strong>. Explore whether tuning the hyperparameters of some of the better performing algorithms can result in performance improvements.<\/li>\n<\/ul>\n<p>If you explore any of these extensions, I\u2019d love to know.<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<ul>\n<li><a href=\"https:\/\/www.kaggle.com\/c\/dsg-hackathon\/data\">EMC Data Science Global Hackathon (Air Quality Prediction)<\/a><\/li>\n<li><a href=\"http:\/\/blog.kaggle.com\/2012\/05\/01\/chucking-everything-into-a-random-forest-ben-hamner-on-winning-the-air-quality-prediction-hackathon\/\">Chucking everything into a Random Forest: Ben Hamner on Winning The Air Quality Prediction Hackathon<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/benhamner\/Air-Quality-Prediction-Hackathon-Winning-Model\">Winning Code for the EMC Data Science Global Hackathon (Air Quality Prediction)<\/a><\/li>\n<li><a href=\"https:\/\/www.kaggle.com\/c\/dsg-hackathon\/discussion\/1821\">General approaches to partitioning the models?<\/a><\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to develop machine learning models for multi-step time series forecasting of air pollution data.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How to impute missing values and transform time series data so that it can be modeled by supervised learning algorithms.<\/li>\n<li>How to develop and evaluate a suite of linear algorithms for multi-step time series forecasting.<\/li>\n<li>How to develop and evaluate a suite of nonlinear algorithms for multi-step time series forecasting.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/how-to-develop-machine-learning-models-for-multivariate-multi-step-air-pollution-time-series-forecasting\/\">How to Develop Machine Learning Models for Multivariate Multi-Step Air Pollution Time Series Forecasting<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/how-to-develop-machine-learning-models-for-multivariate-multi-step-air-pollution-time-series-forecasting\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Real-world time series forecasting is challenging for a whole host of reasons not limited to problem features such as having multiple input [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2018\/10\/18\/how-to-develop-machine-learning-models-for-multivariate-multi-step-air-pollution-time-series-forecasting\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":1185,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1184"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=1184"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1184\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/1185"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=1184"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=1184"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=1184"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}