{"id":1170,"date":"2018-10-16T18:00:01","date_gmt":"2018-10-16T18:00:01","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2018\/10\/16\/how-to-develop-autoregressive-forecasting-models-for-multi-step-air-pollution-time-series-forecasting\/"},"modified":"2018-10-16T18:00:01","modified_gmt":"2018-10-16T18:00:01","slug":"how-to-develop-autoregressive-forecasting-models-for-multi-step-air-pollution-time-series-forecasting","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2018\/10\/16\/how-to-develop-autoregressive-forecasting-models-for-multi-step-air-pollution-time-series-forecasting\/","title":{"rendered":"How to Develop Autoregressive Forecasting Models for Multi-Step Air Pollution Time Series Forecasting"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Real-world time series forecasting is challenging for a whole host of reasons not limited to problem features such as having multiple input variables, the requirement to predict multiple time steps, and the need to perform the same type of prediction for multiple physical sites.<\/p>\n<p>The EMC Data Science Global Hackathon dataset, or the \u2018Air Quality Prediction\u2019 dataset for short, describes weather conditions at multiple sites and requires a prediction of air quality measurements over the subsequent three days.<\/p>\n<p>Before diving into sophisticated machine learning and deep learning methods for time series forecasting, it is important to find the limits of classical methods, such as developing autoregressive models using the AR or ARIMA method.<\/p>\n<p>In this tutorial, you will discover how to develop autoregressive models for multi-step time series forecasting for a multivariate air pollution time series.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>How to analyze and impute missing values for time series data.<\/li>\n<li>How to develop and evaluate an autoregressive model for multi-step time series forecasting.<\/li>\n<li>How to improve an autoregressive model using alternate data imputation methods.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_6747\" style=\"width: 650px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6747\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/01\/Impact-of-Dataset-Size-on-Deep-Learning-Model-Skill-And-Performance-Estimates.jpg\" alt=\"Impact of Dataset Size on Deep Learning Model Skill And Performance Estimates\" width=\"640\" height=\"425\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/01\/Impact-of-Dataset-Size-on-Deep-Learning-Model-Skill-And-Performance-Estimates.jpg 640w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/01\/Impact-of-Dataset-Size-on-Deep-Learning-Model-Skill-And-Performance-Estimates-300x199.jpg 300w\" sizes=\"(max-width: 640px) 100vw, 640px\"><\/p>\n<p class=\"wp-caption-text\">Impact of Dataset Size on Deep Learning Model Skill And Performance Estimates<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/eneas\/13632855754\/\">Eneas De Troya<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into six parts; they are:<\/p>\n<ol>\n<li>Problem Description<\/li>\n<li>Model Evaluation<\/li>\n<li>Data Analysis<\/li>\n<li>Develop an Autoregressive Model<\/li>\n<li>Autoregressive Model with Global Impute Strategy<\/li>\n<\/ol>\n<h2>Problem Description<\/h2>\n<p>The Air Quality Prediction dataset describes weather conditions at multiple sites and requires a prediction of air quality measurements over the subsequent three days.<\/p>\n<p>Specifically, weather observations such as temperature, pressure, wind speed, and wind direction are provided hourly for eight days for multiple sites. The objective is to predict air quality measurements for the next 3 days at multiple sites. The forecast lead times are not contiguous; instead, specific lead times must be forecast over the 72 hour forecast period. They are:<\/p>\n<pre class=\"crayon-plain-tag\">+1, +2, +3, +4, +5, +10, +17, +24, +48, +72<\/pre>\n<p>Further, the dataset is divided into disjoint but contiguous chunks of data, with eight days of data followed by three days that require a forecast.<\/p>\n<p>Not all observations are available at all sites or chunks and not all output variables are available at all sites and chunks. There are large portions of missing data that must be addressed.<\/p>\n<p>The dataset was used as the basis for a <a href=\"https:\/\/www.kaggle.com\/c\/dsg-hackathon\">short duration machine learning competition<\/a> (or hackathon) on the Kaggle website in 2012.<\/p>\n<p>Submissions for the competition were evaluated against the true observations that were withheld from participants and scored using Mean Absolute Error (MAE). Submissions required the value of -1,000,000 to be specified in those cases where a forecast was not possible due to missing data. In fact, a template of where to insert missing values was provided and required to be adopted for all submissions (what a pain).<\/p>\n<p>A winning entrant achieved a MAE of 0.21058 on the withheld test set (<a href=\"https:\/\/www.kaggle.com\/c\/dsg-hackathon\/leaderboard\">private leaderboard<\/a>) using random forest on lagged observations. A writeup of this solution is available in the post:<\/p>\n<ul>\n<li><a href=\"http:\/\/blog.kaggle.com\/2012\/05\/01\/chucking-everything-into-a-random-forest-ben-hamner-on-winning-the-air-quality-prediction-hackathon\/\">Chucking everything into a Random Forest: Ben Hamner on Winning The Air Quality Prediction Hackathon<\/a>, 2012.<\/li>\n<\/ul>\n<p>In this tutorial, we will explore how to develop naive forecasts for the problem that can be used as a baseline to determine whether a model has skill on the problem or not.<\/p>\n<\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Need help with Deep Learning for Time Series?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/14531ee73f72a2%3A164f8be4f346dc\/5630742793027584\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"14531ee73f72a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/14531ee73f72a2%3A164f8be4f346dc\/5630742793027584\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1534880695.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<h2>Model Evaluation<\/h2>\n<p>Before we can evaluate naive forecasting methods, we must develop a test harness.<\/p>\n<p>This includes at least how the data will be prepared and how forecasts will be evaluated.<\/p>\n<h3>Load Dataset<\/h3>\n<p>The first step is to download the dataset and load it into memory.<\/p>\n<p>The dataset can be downloaded for free from the Kaggle website. You may have to create an account and log in, in order to be able to download the dataset.<\/p>\n<p>Download the entire dataset, e.g. \u201c<em>Download All<\/em>\u201d to your workstation and unzip the archive in your current working directory with the folder named \u2018<em>AirQualityPrediction<\/em>\u2018.<\/p>\n<ul>\n<li><a href=\"https:\/\/www.kaggle.com\/c\/dsg-hackathon\/data\">EMC Data Science Global Hackathon (Air Quality Prediction) Data<\/a><\/li>\n<\/ul>\n<p>Our focus will be the \u2018<em>TrainingData.csv<\/em>\u2018 file that contains the training dataset, specifically data in chunks where each chunk is eight contiguous days of observations and target variables.<\/p>\n<p>We can load the data file into memory using the Pandas <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/generated\/pandas.read_csv.html\">read_csv() function<\/a> and specify the header row on line 0.<\/p>\n<pre class=\"crayon-plain-tag\"># load dataset\r\ndataset = read_csv('AirQualityPrediction\/TrainingData.csv', header=0)<\/pre>\n<p>We can group data by the \u2018chunkID\u2019 variable (column index 1).<\/p>\n<p>First, let\u2019s get a list of the unique chunk identifiers.<\/p>\n<pre class=\"crayon-plain-tag\">chunk_ids = unique(values[:, 1])<\/pre>\n<p>We can then collect all rows for each chunk identifier and store them in a dictionary for easy access.<\/p>\n<pre class=\"crayon-plain-tag\">chunks = dict()\r\n# sort rows by chunk id\r\nfor chunk_id in chunk_ids:\r\n\tselection = values[:, chunk_ix] == chunk_id\r\n\tchunks[chunk_id] = values[selection, :]<\/pre>\n<p>Below defines a function named <em>to_chunks()<\/em> that takes a NumPy array of the loaded data and returns a dictionary of <em>chunk_id<\/em> to rows for the chunk.<\/p>\n<pre class=\"crayon-plain-tag\"># split the dataset by 'chunkID', return a dict of id to rows\r\ndef to_chunks(values, chunk_ix=1):\r\n\tchunks = dict()\r\n\t# get the unique chunk ids\r\n\tchunk_ids = unique(values[:, chunk_ix])\r\n\t# group rows by chunk id\r\n\tfor chunk_id in chunk_ids:\r\n\t\tselection = values[:, chunk_ix] == chunk_id\r\n\t\tchunks[chunk_id] = values[selection, :]\r\n\treturn chunks<\/pre>\n<p>The complete example that loads the dataset and splits it into chunks is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># load data and split into chunks\r\nfrom numpy import unique\r\nfrom pandas import read_csv\r\n\r\n# split the dataset by 'chunkID', return a dict of id to rows\r\ndef to_chunks(values, chunk_ix=1):\r\n\tchunks = dict()\r\n\t# get the unique chunk ids\r\n\tchunk_ids = unique(values[:, chunk_ix])\r\n\t# group rows by chunk id\r\n\tfor chunk_id in chunk_ids:\r\n\t\tselection = values[:, chunk_ix] == chunk_id\r\n\t\tchunks[chunk_id] = values[selection, :]\r\n\treturn chunks\r\n\r\n# load dataset\r\ndataset = read_csv('AirQualityPrediction\/TrainingData.csv', header=0)\r\n# group data by chunks\r\nvalues = dataset.values\r\nchunks = to_chunks(values)\r\nprint('Total Chunks: %d' % len(chunks))<\/pre>\n<p>Running the example prints the number of chunks in the dataset.<\/p>\n<pre class=\"crayon-plain-tag\">Total Chunks: 208<\/pre>\n<\/p>\n<h3>Data Preparation<\/h3>\n<p>Now that we know how to load the data and split it into chunks, we can separate into train and test datasets.<\/p>\n<p>Each chunk covers an interval of eight days of hourly observations, although the number of actual observations within each chunk may vary widely.<\/p>\n<p>We can split each chunk into the first five days of observations for training and the last three for test.<\/p>\n<p>Each observation has a row called \u2018<em>position_within_chunk<\/em>\u2018 that varies from 1 to 192 (8 days * 24 hours). We can therefore take all rows with a value in this column that is less than or equal to 120 (5 * 24) as training data and any values more than 120 as test data.<\/p>\n<p>Further, any chunks that don\u2019t have any observations in the train or test split can be dropped as not viable.<\/p>\n<p>When working with the naive models, we are only interested in the target variables, and none of the input meteorological variables. Therefore, we can remove the input data and have the train and test data only comprised of the 39 target variables for each chunk, as well as the position within chunk and hour of observation.<\/p>\n<p>The <em>split_train_test()<\/em> function below implements this behavior; given a dictionary of chunks, it will split each into a list of train and test chunk data.<\/p>\n<pre class=\"crayon-plain-tag\"># split each chunk into train\/test sets\r\ndef split_train_test(chunks, row_in_chunk_ix=2):\r\n\ttrain, test = list(), list()\r\n\t# first 5 days of hourly observations for train\r\n\tcut_point = 5 * 24\r\n\t# enumerate chunks\r\n\tfor k,rows in chunks.items():\r\n\t\t# split chunk rows by 'position_within_chunk'\r\n\t\ttrain_rows = rows[rows[:,row_in_chunk_ix] <= cut_point, :]\r\n\t\ttest_rows = rows[rows[:,row_in_chunk_ix] > cut_point, :]\r\n\t\tif len(train_rows) == 0 or len(test_rows) == 0:\r\n\t\t\tprint('>dropping chunk=%d: train=%s, test=%s' % (k, train_rows.shape, test_rows.shape))\r\n\t\t\tcontinue\r\n\t\t# store with chunk id, position in chunk, hour and all targets\r\n\t\tindices = [1,2,5] + [x for x in range(56,train_rows.shape[1])]\r\n\t\ttrain.append(train_rows[:, indices])\r\n\t\ttest.append(test_rows[:, indices])\r\n\treturn train, test<\/pre>\n<p>We do not require the entire test dataset; instead, we only require the observations at specific lead times over the three day period, specifically the lead times:<\/p>\n<pre class=\"crayon-plain-tag\">+1, +2, +3, +4, +5, +10, +17, +24, +48, +72<\/pre>\n<p>Where, each lead time is relative to the end of the training period.<\/p>\n<p>First, we can put these lead times into a function for easy reference:<\/p>\n<pre class=\"crayon-plain-tag\"># return a list of relative forecast lead times\r\ndef get_lead_times():\r\n\treturn [1, 2 ,3, 4, 5, 10, 17, 24, 48, 72]<\/pre>\n<p>Next, we can reduce the test dataset down to just the data at the preferred lead times.<\/p>\n<p>We can do that by looking at the \u2018<em>position_within_chunk<\/em>\u2018 column and using the lead time as an offset from the end of the training dataset, e.g. 120 + 1, 120 +2, etc.<\/p>\n<p>If we find a matching row in the test set, it is saved, otherwise a row of NaN observations is generated.<\/p>\n<p>The function <em>to_forecasts()<\/em> below implements this and returns a NumPy array with one row for each forecast lead time for each chunk.<\/p>\n<pre class=\"crayon-plain-tag\"># convert the rows in a test chunk to forecasts\r\ndef to_forecasts(test_chunks, row_in_chunk_ix=1):\r\n\t# get lead times\r\n\tlead_times = get_lead_times()\r\n\t# first 5 days of hourly observations for train\r\n\tcut_point = 5 * 24\r\n\tforecasts = list()\r\n\t# enumerate each chunk\r\n\tfor rows in test_chunks:\r\n\t\tchunk_id = rows[0, 0]\r\n\t\t# enumerate each lead time\r\n\t\tfor tau in lead_times:\r\n\t\t\t# determine the row in chunk we want for the lead time\r\n\t\t\toffset = cut_point + tau\r\n\t\t\t# retrieve data for the lead time using row number in chunk\r\n\t\t\trow_for_tau = rows[rows[:,row_in_chunk_ix]==offset, :]\r\n\t\t\t# check if we have data\r\n\t\t\tif len(row_for_tau) == 0:\r\n\t\t\t\t# create a mock row [chunk, position, hour] + [nan...]\r\n\t\t\t\trow = [chunk_id, offset, nan] + [nan for _ in range(39)]\r\n\t\t\t\tforecasts.append(row)\r\n\t\t\telse:\r\n\t\t\t\t# store the forecast row\r\n\t\t\t\tforecasts.append(row_for_tau[0])\r\n\treturn array(forecasts)<\/pre>\n<p>We can tie all of this together and split the dataset into train and test sets and save the results to new files.<\/p>\n<p>The complete code example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># split data into train and test sets\r\nfrom numpy import unique\r\nfrom numpy import nan\r\nfrom numpy import array\r\nfrom numpy import savetxt\r\nfrom pandas import read_csv\r\n\r\n# split the dataset by 'chunkID', return a dict of id to rows\r\ndef to_chunks(values, chunk_ix=1):\r\n\tchunks = dict()\r\n\t# get the unique chunk ids\r\n\tchunk_ids = unique(values[:, chunk_ix])\r\n\t# group rows by chunk id\r\n\tfor chunk_id in chunk_ids:\r\n\t\tselection = values[:, chunk_ix] == chunk_id\r\n\t\tchunks[chunk_id] = values[selection, :]\r\n\treturn chunks\r\n\r\n# split each chunk into train\/test sets\r\ndef split_train_test(chunks, row_in_chunk_ix=2):\r\n\ttrain, test = list(), list()\r\n\t# first 5 days of hourly observations for train\r\n\tcut_point = 5 * 24\r\n\t# enumerate chunks\r\n\tfor k,rows in chunks.items():\r\n\t\t# split chunk rows by 'position_within_chunk'\r\n\t\ttrain_rows = rows[rows[:,row_in_chunk_ix] <= cut_point, :]\r\n\t\ttest_rows = rows[rows[:,row_in_chunk_ix] > cut_point, :]\r\n\t\tif len(train_rows) == 0 or len(test_rows) == 0:\r\n\t\t\tprint('>dropping chunk=%d: train=%s, test=%s' % (k, train_rows.shape, test_rows.shape))\r\n\t\t\tcontinue\r\n\t\t# store with chunk id, position in chunk, hour and all targets\r\n\t\tindices = [1,2,5] + [x for x in range(56,train_rows.shape[1])]\r\n\t\ttrain.append(train_rows[:, indices])\r\n\t\ttest.append(test_rows[:, indices])\r\n\treturn train, test\r\n\r\n# return a list of relative forecast lead times\r\ndef get_lead_times():\r\n\treturn [1, 2 ,3, 4, 5, 10, 17, 24, 48, 72]\r\n\r\n# convert the rows in a test chunk to forecasts\r\ndef to_forecasts(test_chunks, row_in_chunk_ix=1):\r\n\t# get lead times\r\n\tlead_times = get_lead_times()\r\n\t# first 5 days of hourly observations for train\r\n\tcut_point = 5 * 24\r\n\tforecasts = list()\r\n\t# enumerate each chunk\r\n\tfor rows in test_chunks:\r\n\t\tchunk_id = rows[0, 0]\r\n\t\t# enumerate each lead time\r\n\t\tfor tau in lead_times:\r\n\t\t\t# determine the row in chunk we want for the lead time\r\n\t\t\toffset = cut_point + tau\r\n\t\t\t# retrieve data for the lead time using row number in chunk\r\n\t\t\trow_for_tau = rows[rows[:,row_in_chunk_ix]==offset, :]\r\n\t\t\t# check if we have data\r\n\t\t\tif len(row_for_tau) == 0:\r\n\t\t\t\t# create a mock row [chunk, position, hour] + [nan...]\r\n\t\t\t\trow = [chunk_id, offset, nan] + [nan for _ in range(39)]\r\n\t\t\t\tforecasts.append(row)\r\n\t\t\telse:\r\n\t\t\t\t# store the forecast row\r\n\t\t\t\tforecasts.append(row_for_tau[0])\r\n\treturn array(forecasts)\r\n\r\n# load dataset\r\ndataset = read_csv('AirQualityPrediction\/TrainingData.csv', header=0)\r\n# group data by chunks\r\nvalues = dataset.values\r\nchunks = to_chunks(values)\r\n# split into train\/test\r\ntrain, test = split_train_test(chunks)\r\n# flatten training chunks to rows\r\ntrain_rows = array([row for rows in train for row in rows])\r\n# print(train_rows.shape)\r\nprint('Train Rows: %s' % str(train_rows.shape))\r\n# reduce train to forecast lead times only\r\ntest_rows = to_forecasts(test)\r\nprint('Test Rows: %s' % str(test_rows.shape))\r\n# save datasets\r\nsavetxt('AirQualityPrediction\/naive_train.csv', train_rows, delimiter=',')\r\nsavetxt('AirQualityPrediction\/naive_test.csv', test_rows, delimiter=',')<\/pre>\n<p>Running the example first comments that chunk 69 is removed from the dataset for having insufficient data.<\/p>\n<p>We can then see that we have 42 columns in each of the train and test sets, one for the chunk id, position within chunk, hour of day, and the 39 training variables.<\/p>\n<p>We can also see the dramatically smaller version of the test dataset with rows only at the forecast lead times.<\/p>\n<p>The new train and test datasets are saved in the \u2018<em>naive_train.csv<\/em>\u2018 and \u2018<em>naive_test.csv<\/em>\u2018 files respectively.<\/p>\n<pre class=\"crayon-plain-tag\">>dropping chunk=69: train=(0, 95), test=(28, 95)\r\nTrain Rows: (23514, 42)\r\nTest Rows: (2070, 42)<\/pre>\n<\/p>\n<h3>Forecast Evaluation<\/h3>\n<p>Once forecasts have been made, they need to be evaluated.<\/p>\n<p>It is helpful to have a simpler format when evaluating forecasts. For example, we will use the three-dimensional structure of <em>[chunks][variables][time]<\/em>, where variable is the target variable number from 0 to 38 and time is the lead time index from 0 to 9.<\/p>\n<p>Models are expected to make predictions in this format.<\/p>\n<p>We can also restructure the test dataset to have this dataset for comparison. The <em>prepare_test_forecasts()<\/em> function below implements this.<\/p>\n<pre class=\"crayon-plain-tag\"># convert the test dataset in chunks to [chunk][variable][time] format\r\ndef prepare_test_forecasts(test_chunks):\r\n\tpredictions = list()\r\n\t# enumerate chunks to forecast\r\n\tfor rows in test_chunks:\r\n\t\t# enumerate targets for chunk\r\n\t\tchunk_predictions = list()\r\n\t\tfor j in range(3, rows.shape[1]):\r\n\t\t\tyhat = rows[:, j]\r\n\t\t\tchunk_predictions.append(yhat)\r\n\t\tchunk_predictions = array(chunk_predictions)\r\n\t\tpredictions.append(chunk_predictions)\r\n\treturn array(predictions)<\/pre>\n<p>We will evaluate a model using the mean absolute error, or MAE. This is the metric that was used in the competition and is a sensible choice given the non-Gaussian distribution of the target variables.<\/p>\n<p>If a lead time contains no data in the test set (e.g. <em>NaN<\/em>), then no error will be calculated for that forecast. If the lead time does have data in the test set but no data in the forecast, then the full magnitude of the observation will be taken as error. Finally, if the test set has an observation and a forecast was made, then the absolute difference will be recorded as the error.<\/p>\n<p>The <em>calculate_error()<\/em> function implements these rules and returns the error for a given forecast.<\/p>\n<pre class=\"crayon-plain-tag\"># calculate the error between an actual and predicted value\r\ndef calculate_error(actual, predicted):\r\n\t# give the full actual value if predicted is nan\r\n\tif isnan(predicted):\r\n\t\treturn abs(actual)\r\n\t# calculate abs difference\r\n\treturn abs(actual - predicted)<\/pre>\n<p>Errors are summed across all chunks and all lead times, then averaged.<\/p>\n<p>The overall MAE will be calculated, but we will also calculate a MAE for each forecast lead time. This can help with model selection generally as some models may perform differently at different lead times.<\/p>\n<p>The evaluate_forecasts() function below implements this, calculating the MAE and per-lead time MAE for the provided predictions and expected values in <em>[chunk][variable][time]<\/em> format.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate a forecast in the format [chunk][variable][time]\r\ndef evaluate_forecasts(predictions, testset):\r\n\tlead_times = get_lead_times()\r\n\ttotal_mae, times_mae = 0.0, [0.0 for _ in range(len(lead_times))]\r\n\ttotal_c, times_c = 0, [0 for _ in range(len(lead_times))]\r\n\t# enumerate test chunks\r\n\tfor i in range(len(test_chunks)):\r\n\t\t# convert to forecasts\r\n\t\tactual = testset[i]\r\n\t\tpredicted = predictions[i]\r\n\t\t# enumerate target variables\r\n\t\tfor j in range(predicted.shape[0]):\r\n\t\t\t# enumerate lead times\r\n\t\t\tfor k in range(len(lead_times)):\r\n\t\t\t\t# skip if actual in nan\r\n\t\t\t\tif isnan(actual[j, k]):\r\n\t\t\t\t\tcontinue\r\n\t\t\t\t# calculate error\r\n\t\t\t\terror = calculate_error(actual[j, k], predicted[j, k])\r\n\t\t\t\t# update statistics\r\n\t\t\t\ttotal_mae += error\r\n\t\t\t\ttimes_mae[k] += error\r\n\t\t\t\ttotal_c += 1\r\n\t\t\t\ttimes_c[k] += 1\r\n\t# normalize summed absolute errors\r\n\ttotal_mae \/= total_c\r\n\ttimes_mae = [times_mae[i]\/times_c[i] for i in range(len(times_mae))]\r\n\treturn total_mae, times_mae<\/pre>\n<p>Once we have the evaluation of a model, we can present it.<\/p>\n<p>The <em>summarize_error()<\/em> function below first prints a one-line summary of a model\u2019s performance then creates a plot of MAE per forecast lead time.<\/p>\n<pre class=\"crayon-plain-tag\"># summarize scores\r\ndef summarize_error(name, total_mae, times_mae):\r\n\t# print summary\r\n\tlead_times = get_lead_times()\r\n\tformatted = ['+%d %.3f' % (lead_times[i], times_mae[i]) for i in range(len(lead_times))]\r\n\ts_scores = ', '.join(formatted)\r\n\tprint('%s: [%.3f MAE] %s' % (name, total_mae, s_scores))\r\n\t# plot summary\r\n\tpyplot.plot([str(x) for x in lead_times], times_mae, marker='.')\r\n\tpyplot.show()<\/pre>\n<p>We are now ready to start exploring the performance of naive forecasting methods.<\/p>\n<h2>Data Analysis<\/h2>\n<p>The first step in fitting classical time series models to this data is to take a closer look at the data.<\/p>\n<p>There are 208 (actually 207) usable chunks of data, and each chunk has 39 time series to fit; that is a total of 8,073 separate models that would need to be fit on the data. That is a lot of models, but the models are trained on a relatively small amount of data, at most (5 * 24) or 120 observations and the model is linear so it will find a fit quickly.<\/p>\n<p>We have choices about how to configure models to the data; for example:<\/p>\n<ul>\n<li>One model configuration for all time series (simplest).<\/li>\n<li>One model configuration for all variables across chunks (reasonable).<\/li>\n<li>One model configuration per variable per chunk (most complex).<\/li>\n<\/ul>\n<p>We will investigate the simplest approach of one model configuration for all series, but you may want to explore one or more of the other approaches.<\/p>\n<p>This section is divided into three parts; they are:<\/p>\n<ol>\n<li>Missing Data<\/li>\n<li>Impute Missing Data<\/li>\n<li>Autocorrelation Plots<\/li>\n<\/ol>\n<h3>Missing Data<\/h3>\n<p>Classical time series methods require that the time series to be complete, e.g. that there are no missing values.<\/p>\n<p>Therefore the first step is to investigate how complete or incomplete the target variables are.<\/p>\n<p>For a given variable, there may be missing observations defined by missing rows. Specifically, each observation has a \u2018<em>position_within_chunk<\/em>\u2018. We expect each chunk in the training dataset to have 120 observations, with \u2018<em>positions_within_chunk<\/em>\u2018 from 1 to 120 inclusively.<\/p>\n<p>Therefore, we can create an array of 120 nan values for each variable, mark all observations in the chunk using the \u2018<em>positions_within_chunk<\/em>\u2018 values, and anything left will be marked <em>NaN<\/em>. We can then plot each variable and look for gaps.<\/p>\n<p>The <em>variable_to_series()<\/em> function below will take the rows for a chunk and a given column index for the target variable and will return a series of 120 time steps for the variable with all available data marked with the value from the chunk.<\/p>\n<pre class=\"crayon-plain-tag\"># layout a variable with breaks in the data for missing positions\r\ndef variable_to_series(chunk_train, col_ix, n_steps=5*24):\r\n\t# lay out whole series\r\n\tdata = [nan for _ in range(n_steps)]\r\n\t# mark all available data\r\n\tfor i in range(len(chunk_train)):\r\n\t\t# get position in chunk\r\n\t\tposition = int(chunk_train[i, 1] - 1)\r\n\t\t# store data\r\n\t\tdata[position] = chunk_train[i, col_ix]\r\n\treturn data<\/pre>\n<p>We can then call this function for each target variable in one chunk and create a line plot.<\/p>\n<p>The function below named <em>plot_variables()<\/em> will implement this and create a figure with 39 line plots stacked horizontally.<\/p>\n<pre class=\"crayon-plain-tag\"># plot variables horizontally with gaps for missing data\r\ndef plot_variables(chunk_train, n_vars=39):\r\n\tpyplot.figure()\r\n\tfor i in range(n_vars):\r\n\t\t# convert target number into column number\r\n\t\tcol_ix = 3 + i\r\n\t\t# mark missing obs for variable\r\n\t\tseries = variable_to_series(chunk_train, col_ix)\r\n\t\t# plot\r\n\t\tax = pyplot.subplot(n_vars, 1, i+1)\r\n\t\tax.set_xticklabels([])\r\n\t\tax.set_yticklabels([])\r\n\t\tpyplot.plot(series)\r\n\t# show plot\r\n\tpyplot.show()<\/pre>\n<p>Tying this together, the complete example is listed below. A plot of all variables in the first chunk is created.<\/p>\n<pre class=\"crayon-plain-tag\"># plot missing\r\nfrom numpy import loadtxt\r\nfrom numpy import nan\r\nfrom numpy import unique\r\nfrom matplotlib import pyplot\r\n\r\n# split the dataset by 'chunkID', return a list of chunks\r\ndef to_chunks(values, chunk_ix=0):\r\n\tchunks = list()\r\n\t# get the unique chunk ids\r\n\tchunk_ids = unique(values[:, chunk_ix])\r\n\t# group rows by chunk id\r\n\tfor chunk_id in chunk_ids:\r\n\t\tselection = values[:, chunk_ix] == chunk_id\r\n\t\tchunks.append(values[selection, :])\r\n\treturn chunks\r\n\r\n# layout a variable with breaks in the data for missing positions\r\ndef variable_to_series(chunk_train, col_ix, n_steps=5*24):\r\n\t# lay out whole series\r\n\tdata = [nan for _ in range(n_steps)]\r\n\t# mark all available data\r\n\tfor i in range(len(chunk_train)):\r\n\t\t# get position in chunk\r\n\t\tposition = int(chunk_train[i, 1] - 1)\r\n\t\t# store data\r\n\t\tdata[position] = chunk_train[i, col_ix]\r\n\treturn data\r\n\r\n# plot variables horizontally with gaps for missing data\r\ndef plot_variables(chunk_train, n_vars=39):\r\n\tpyplot.figure()\r\n\tfor i in range(n_vars):\r\n\t\t# convert target number into column number\r\n\t\tcol_ix = 3 + i\r\n\t\t# mark missing obs for variable\r\n\t\tseries = variable_to_series(chunk_train, col_ix)\r\n\t\t# plot\r\n\t\tax = pyplot.subplot(n_vars, 1, i+1)\r\n\t\tax.set_xticklabels([])\r\n\t\tax.set_yticklabels([])\r\n\t\tpyplot.plot(series)\r\n\t# show plot\r\n\tpyplot.show()\r\n\r\n# load dataset\r\ntrain = loadtxt('AirQualityPrediction\/naive_train.csv', delimiter=',')\r\n# group data by chunks\r\ntrain_chunks = to_chunks(train)\r\n# pick one chunk\r\nrows = train_chunks[0]\r\n# plot variables\r\nplot_variables(rows)<\/pre>\n<p>Running the example creates a figure with 39 line plots, one for each target variable in the first chunk.<\/p>\n<p>We can see a seasonal structure in many of the variables. This suggests it may be beneficial to perform a 24-hour seasonal differencing of each series prior to modeling.<\/p>\n<p>The plots are small, and you may need to increase the size of the figure to clearly see the data.<\/p>\n<p>We can see that there are variables for which we have no data. These can be detected and ignored as we cannot model or forecast them.<\/p>\n<p>We can see gaps in many of the series, but the gaps are short, lasting for a few hours at most. These could be imputed either with persistence of previous values or values at the same hours within the same series.<\/p>\n<div id=\"attachment_6324\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6324\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/07\/Line-Plots-for-All-Targets-in-Chunk-1-With-Missing-Values-Marked.png\" alt=\"Line Plots for All Targets in Chunk 1 With Missing Values Marked\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/Line-Plots-for-All-Targets-in-Chunk-1-With-Missing-Values-Marked.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/Line-Plots-for-All-Targets-in-Chunk-1-With-Missing-Values-Marked-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/Line-Plots-for-All-Targets-in-Chunk-1-With-Missing-Values-Marked-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/Line-Plots-for-All-Targets-in-Chunk-1-With-Missing-Values-Marked-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plots for All Targets in Chunk 1 With Missing Values Marked<\/p>\n<\/div>\n<p>Looking at a few other chunks randomly, many result in plots with much the same observations.<\/p>\n<p>This is not always the case though.<\/p>\n<p>Update the example to plot the 4th chunk in the dataset (index 3).<\/p>\n<pre class=\"crayon-plain-tag\"># pick one chunk\r\nrows = train_chunks[3]<\/pre>\n<p>The result is a figure that tells a very different story.<\/p>\n<p>We see gaps in the data that last for many hours, perhaps up to a day or more.<\/p>\n<p>These series will require dramatic repair before they can be used to fit a classical model.<\/p>\n<p>Imputing the missing data using persistence or observations within the series with the same hour will likely not be sufficient. They may have to be filled with average values taken across the entire training dataset.<\/p>\n<div id=\"attachment_6325\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6325\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/07\/Line-Plots-for-All-Targets-in-Chunk-4-With-Missing-Values-Marked.png\" alt=\"Line Plots for All Targets in Chunk 4 With Missing Values Marked\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/Line-Plots-for-All-Targets-in-Chunk-4-With-Missing-Values-Marked.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/Line-Plots-for-All-Targets-in-Chunk-4-With-Missing-Values-Marked-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/Line-Plots-for-All-Targets-in-Chunk-4-With-Missing-Values-Marked-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/Line-Plots-for-All-Targets-in-Chunk-4-With-Missing-Values-Marked-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plots for All Targets in Chunk 4 With Missing Values Marked<\/p>\n<\/div>\n<h3>Impute Missing Data<\/h3>\n<p>There are many ways to impute the missing data, and we cannot know which is best a priori.<\/p>\n<p>One approach would be to prepare the data using multiple different imputation methods and use the skill of the models fit on the data to help guide the best approach.<\/p>\n<p>Some imputation approaches already suggested include:<\/p>\n<ul>\n<li>Persist the last observation in the series, also called linear interpolation.<\/li>\n<li>Fill with values or average values within the series with the same hour of day.<\/li>\n<li>Fill with values or average values with the same hour of day across the training dataset.<\/li>\n<\/ul>\n<p>It may also be useful to use combinations, e.g. persist or fill from the series for small gaps and draw from the whole dataset for large gaps.<\/p>\n<p>We can also investigate the effect of imputing methods by filling in the missing data and looking at plots to see if the series looks reasonable. It\u2019s crude, effective, and fast.<\/p>\n<p>First, we need to calculate a parallel series of the hour of day for each chunk that we can use for imputing hour-specific data for each variable in the chunk.<\/p>\n<p>Given a series of partially filled hours of day, the <em>interpolate_hours()<\/em> function below will fill in the missing hours of day. It does this by finding the first marked hour, then counting forward, filling in the hour of day, then performing the same operation backwards.<\/p>\n<pre class=\"crayon-plain-tag\"># interpolate series of hours (in place) in 24 hour time\r\ndef interpolate_hours(hours):\r\n\t# find the first hour\r\n\tix = -1\r\n\tfor i in range(len(hours)):\r\n\t\tif not isnan(hours[i]):\r\n\t\t\tix = i\r\n\t\t\tbreak\r\n\t# fill-forward\r\n\thour = hours[ix]\r\n\tfor i in range(ix+1, len(hours)):\r\n\t\t# increment hour\r\n\t\thour += 1\r\n\t\t# check for a fill\r\n\t\tif isnan(hours[i]):\r\n\t\t\thours[i] = hour % 24\r\n\t# fill-backward\r\n\thour = hours[ix]\r\n\tfor i in range(ix-1, -1, -1):\r\n\t\t# decrement hour\r\n\t\thour -= 1\r\n\t\t# check for a fill\r\n\t\tif isnan(hours[i]):\r\n\t\t\thours[i] = hour % 24<\/pre>\n<p>I\u2019m sure there is a more Pythonic way to write this function, but I wanted to lay it all out to make it obvious what was going on.<\/p>\n<p>We can test this out on a mock list of hours with missing data. The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># interpolate hours\r\nfrom numpy import nan\r\nfrom numpy import isnan\r\n\r\n# interpolate series of hours (in place) in 24 hour time\r\ndef interpolate_hours(hours):\r\n\t# find the first hour\r\n\tix = -1\r\n\tfor i in range(len(hours)):\r\n\t\tif not isnan(hours[i]):\r\n\t\t\tix = i\r\n\t\t\tbreak\r\n\t# fill-forward\r\n\thour = hours[ix]\r\n\tfor i in range(ix+1, len(hours)):\r\n\t\t# increment hour\r\n\t\thour += 1\r\n\t\t# check for a fill\r\n\t\tif isnan(hours[i]):\r\n\t\t\thours[i] = hour % 24\r\n\t# fill-backward\r\n\thour = hours[ix]\r\n\tfor i in range(ix-1, -1, -1):\r\n\t\t# decrement hour\r\n\t\thour -= 1\r\n\t\t# check for a fill\r\n\t\tif isnan(hours[i]):\r\n\t\t\thours[i] = hour % 24\r\n\r\n# define hours with missing data\r\ndata = [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 0, nan, 2, nan, nan, nan, nan, nan, nan, 9, 10, 11, 12, 13, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]\r\nprint(data)\r\n# fill in missing hours\r\ninterpolate_hours(data)\r\nprint(data)<\/pre>\n<p>Running the example first prints the hour data with missing values, then the same sequence with all of the hours filled in correctly.<\/p>\n<pre class=\"crayon-plain-tag\">[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 0, nan, 2, nan, nan, nan, nan, nan, nan, 9, 10, 11, 12, 13, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]\r\n[11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 0, 1]<\/pre>\n<p>We can use this function to prepare a series of hours for a chunk that can be used to fill in missing values for a chunk using hour-specific information.<\/p>\n<p>We can call the same <em>variable_to_series()<\/em> function from the previous section to create the series of hours with missing values (column index 2), then call <em>interpolate_hours()<\/em> to fill in the gaps.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare sequence of hours for the chunk\r\nhours = variable_to_series(rows, 2)\r\n# interpolate hours\r\ninterpolate_hours(hours)<\/pre>\n<p>We can then pass the hours to any impute function that may make use of it.<\/p>\n<p>Let\u2019s try filling in missing values in a chunk with values within the same series with the same hour. Specifically, we will find all rows with the same hour on the series and calculate the median value.<\/p>\n<p>The <em>impute_missing()<\/em> below takes all of the rows in a chunk, the prepared sequence of hours of the day for the chunk, and the series with missing values for a variable and the column index for a variable.<\/p>\n<p>It first checks to see if the series is all missing data and returns immediately if this is the case as no impute can be performed. It then enumerates over the time steps of the series and when it detects a time step with no data, it collects all rows in the series with data for the same hour and calculates the median value.<\/p>\n<pre class=\"crayon-plain-tag\"># impute missing data\r\ndef impute_missing(rows, hours, series, col_ix):\r\n\t# count missing observations\r\n\tn_missing = count_nonzero(isnan(series))\r\n\t# calculate ratio of missing\r\n\tratio = n_missing \/ float(len(series)) * 100\r\n\t# check for no data\r\n\tif ratio == 100.0:\r\n\t\treturn series\r\n\t# impute missing using the median value for hour in the series\r\n\timputed = list()\r\n\tfor i in range(len(series)):\r\n\t\tif isnan(series[i]):\r\n\t\t\t# get all rows with the same hour\r\n\t\t\tmatches = rows[rows[:,2]==hours[i]]\r\n\t\t\t# fill with median value\r\n\t\t\tvalue = nanmedian(matches[:, col_ix])\r\n\t\t\timputed.append(value)\r\n\t\telse:\r\n\t\t\timputed.append(series[i])\r\n\treturn imputed<\/pre>\n<p>To see the impact of this impute strategy, we can update the <em>plot_variables()<\/em> function from the previous section to first plot the imputed series then plot the original series with missing values.<\/p>\n<p>This will allow the imputed values to shine through in the gaps of the original series and we can see if the results look reasonable.<\/p>\n<p>The updated version of the <em>plot_variables()<\/em> function is listed below with this change, calling the <em>impute_missing()<\/em> function to create the imputed version of the series and taking the hours series as an argument.<\/p>\n<pre class=\"crayon-plain-tag\"># plot variables horizontally with gaps for missing data\r\ndef plot_variables(chunk_train, hours, n_vars=39):\r\n\tpyplot.figure()\r\n\tfor i in range(n_vars):\r\n\t\t# convert target number into column number\r\n\t\tcol_ix = 3 + i\r\n\t\t# mark missing obs for variable\r\n\t\tseries = variable_to_series(chunk_train, col_ix)\r\n\t\tax = pyplot.subplot(n_vars, 1, i+1)\r\n\t\tax.set_xticklabels([])\r\n\t\tax.set_yticklabels([])\r\n\t\t# imputed\r\n\t\timputed = impute_missing(chunk_train, hours, series, col_ix)\r\n\t\t# plot imputed\r\n\t\tpyplot.plot(imputed)\r\n\t\t# plot with missing\r\n\t\tpyplot.plot(series)\r\n\t# show plot\r\n\tpyplot.show()<\/pre>\n<p>Tying all of this together, the complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># impute missing\r\nfrom numpy import loadtxt\r\nfrom numpy import nan\r\nfrom numpy import isnan\r\nfrom numpy import count_nonzero\r\nfrom numpy import unique\r\nfrom numpy import nanmedian\r\nfrom matplotlib import pyplot\r\n\r\n# split the dataset by 'chunkID', return a list of chunks\r\ndef to_chunks(values, chunk_ix=0):\r\n\tchunks = list()\r\n\t# get the unique chunk ids\r\n\tchunk_ids = unique(values[:, chunk_ix])\r\n\t# group rows by chunk id\r\n\tfor chunk_id in chunk_ids:\r\n\t\tselection = values[:, chunk_ix] == chunk_id\r\n\t\tchunks.append(values[selection, :])\r\n\treturn chunks\r\n\r\n# impute missing data\r\ndef impute_missing(rows, hours, series, col_ix):\r\n\t# count missing observations\r\n\tn_missing = count_nonzero(isnan(series))\r\n\t# calculate ratio of missing\r\n\tratio = n_missing \/ float(len(series)) * 100\r\n\t# check for no data\r\n\tif ratio == 100.0:\r\n\t\treturn series\r\n\t# impute missing using the median value for hour in the series\r\n\timputed = list()\r\n\tfor i in range(len(series)):\r\n\t\tif isnan(series[i]):\r\n\t\t\t# get all rows with the same hour\r\n\t\t\tmatches = rows[rows[:,2]==hours[i]]\r\n\t\t\t# fill with median value\r\n\t\t\tvalue = nanmedian(matches[:, col_ix])\r\n\t\t\timputed.append(value)\r\n\t\telse:\r\n\t\t\timputed.append(series[i])\r\n\treturn imputed\r\n\r\n# interpolate series of hours (in place) in 24 hour time\r\ndef interpolate_hours(hours):\r\n\t# find the first hour\r\n\tix = -1\r\n\tfor i in range(len(hours)):\r\n\t\tif not isnan(hours[i]):\r\n\t\t\tix = i\r\n\t\t\tbreak\r\n\t# fill-forward\r\n\thour = hours[ix]\r\n\tfor i in range(ix+1, len(hours)):\r\n\t\t# increment hour\r\n\t\thour += 1\r\n\t\t# check for a fill\r\n\t\tif isnan(hours[i]):\r\n\t\t\thours[i] = hour % 24\r\n\t# fill-backward\r\n\thour = hours[ix]\r\n\tfor i in range(ix-1, -1, -1):\r\n\t\t# decrement hour\r\n\t\thour -= 1\r\n\t\t# check for a fill\r\n\t\tif isnan(hours[i]):\r\n\t\t\thours[i] = hour % 24\r\n\r\n# layout a variable with breaks in the data for missing positions\r\ndef variable_to_series(chunk_train, col_ix, n_steps=5*24):\r\n\t# lay out whole series\r\n\tdata = [nan for _ in range(n_steps)]\r\n\t# mark all available data\r\n\tfor i in range(len(chunk_train)):\r\n\t\t# get position in chunk\r\n\t\tposition = int(chunk_train[i, 1] - 1)\r\n\t\t# store data\r\n\t\tdata[position] = chunk_train[i, col_ix]\r\n\treturn data\r\n\r\n# plot variables horizontally with gaps for missing data\r\ndef plot_variables(chunk_train, hours, n_vars=39):\r\n\tpyplot.figure()\r\n\tfor i in range(n_vars):\r\n\t\t# convert target number into column number\r\n\t\tcol_ix = 3 + i\r\n\t\t# mark missing obs for variable\r\n\t\tseries = variable_to_series(chunk_train, col_ix)\r\n\t\tax = pyplot.subplot(n_vars, 1, i+1)\r\n\t\tax.set_xticklabels([])\r\n\t\tax.set_yticklabels([])\r\n\t\t# imputed\r\n\t\timputed = impute_missing(chunk_train, hours, series, col_ix)\r\n\t\t# plot imputed\r\n\t\tpyplot.plot(imputed)\r\n\t\t# plot with missing\r\n\t\tpyplot.plot(series)\r\n\t# show plot\r\n\tpyplot.show()\r\n\r\n# load dataset\r\ntrain = loadtxt('AirQualityPrediction\/naive_train.csv', delimiter=',')\r\n# group data by chunks\r\ntrain_chunks = to_chunks(train)\r\n# pick one chunk\r\nrows = train_chunks[0]\r\n# prepare sequence of hours for the chunk\r\nhours = variable_to_series(rows, 2)\r\n# interpolate hours\r\ninterpolate_hours(hours)\r\n# plot variables\r\nplot_variables(rows, hours)<\/pre>\n<p>Running the example creates a single figure with 39 line plots: one for each target variable in the first chunk in the training dataset.<\/p>\n<p>We can see that the series is orange, showing the original data and the gaps have been imputed and are marked in blue.<\/p>\n<p>The blue segments seem reasonable.<\/p>\n<div id=\"attachment_6326\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6326\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/07\/Line-Plots-for-All-Targets-in-Chunk-1-With-Imputed-Missing-Values.png\" alt=\"Line Plots for All Targets in Chunk 1 With Imputed Missing Values\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/Line-Plots-for-All-Targets-in-Chunk-1-With-Imputed-Missing-Values.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/Line-Plots-for-All-Targets-in-Chunk-1-With-Imputed-Missing-Values-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/Line-Plots-for-All-Targets-in-Chunk-1-With-Imputed-Missing-Values-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/Line-Plots-for-All-Targets-in-Chunk-1-With-Imputed-Missing-Values-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plots for All Targets in Chunk 1 With Imputed Missing Values<\/p>\n<\/div>\n<p>We can try the same approach on the 4th chunk in the dataset that has a lot more missing data.<\/p>\n<pre class=\"crayon-plain-tag\"># pick one chunk\r\nrows = train_chunks[0]<\/pre>\n<p>Running the example creates the same kind of figure, but here we can see the large missing segments filled in with imputed values.<\/p>\n<p>Again, the sequences seem reasonable, even showing daily seasonal cycle structure where appropriate.<\/p>\n<div id=\"attachment_6327\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6327\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/07\/Line-Plots-for-All-Targets-in-Chunk-4-With-Imputed-Missing-Values.png\" alt=\"Line Plots for All Targets in Chunk 4 With Imputed Missing Values\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/Line-Plots-for-All-Targets-in-Chunk-4-With-Imputed-Missing-Values.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/Line-Plots-for-All-Targets-in-Chunk-4-With-Imputed-Missing-Values-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/Line-Plots-for-All-Targets-in-Chunk-4-With-Imputed-Missing-Values-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/Line-Plots-for-All-Targets-in-Chunk-4-With-Imputed-Missing-Values-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plots for All Targets in Chunk 4 With Imputed Missing Values<\/p>\n<\/div>\n<p>This looks like a good start; you can explore other imputation strategies and see how they compare either in terms of line plots or on the resulting model skill.<\/p>\n<h3>Autocorrelation Plots<\/h3>\n<p>Now that we know how to fill in the missing values, we can take a look at autocorrelation plots for the series data.<\/p>\n<p>Autocorrelation plots summarize the relationship of each observation with observations at prior time steps. Together with partial autocorrelation plots, they can be used to determine the configuration for an ARMA model.<\/p>\n<p>The statsmodels library provides the <a href=\"http:\/\/www.statsmodels.org\/dev\/generated\/statsmodels.graphics.tsaplots.plot_acf.html\">plot_acf()<\/a> and <a href=\"http:\/\/www.statsmodels.org\/dev\/generated\/statsmodels.graphics.tsaplots.plot_pacf.html\">plot_pacf()<\/a> functions that can be used to plot ACF and PACF plots respectively.<\/p>\n<p>We can update the <em>plot_variables()<\/em> to create these plots, one of each type for each of the 39 series. That is a lot of plots.<\/p>\n<p>We will stack all ACF plots on the left vertically and all PACF plots on the right vertically. That is two columns of 39 plots. We will limit the lags considered by the plot to 24 time steps (hours) and ignore the correlation of each variable with itself as it is redundant.<\/p>\n<p>The updated <em>plot_variables()<\/em> function for plotting ACF and PACF plots is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># plot acf and pacf plots for each imputed variable series\r\ndef plot_variables(chunk_train, hours, n_vars=39):\r\n\tpyplot.figure()\r\n\tn_plots = n_vars * 2\r\n\tj = 0\r\n\tlags = 24\r\n\tfor i in range(1, n_plots, 2):\r\n\t\t# convert target number into column number\r\n\t\tcol_ix = 3 + j\r\n\t\tj += 1\r\n\t\t# get series\r\n\t\tseries = variable_to_series(chunk_train, col_ix)\r\n\t\timputed = impute_missing(chunk_train, hours, series, col_ix)\r\n\t\t# acf\r\n\t\taxis = pyplot.subplot(n_vars, 2, i)\r\n\t\tplot_acf(imputed, ax=axis, lags=lags, zero=False)\r\n\t\taxis.set_title('')\r\n\t\taxis.set_xticklabels([])\r\n\t\taxis.set_yticklabels([])\r\n\t\t# pacf\r\n\t\taxis = pyplot.subplot(n_vars, 2, i+1)\r\n\t\tplot_pacf(imputed, ax=axis, lags=lags, zero=False)\r\n\t\taxis.set_title('')\r\n\t\taxis.set_xticklabels([])\r\n\t\taxis.set_yticklabels([])\r\n\t# show plot\r\n\tpyplot.show()<\/pre>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># acf and pacf plots\r\nfrom numpy import loadtxt\r\nfrom numpy import nan\r\nfrom numpy import isnan\r\nfrom numpy import count_nonzero\r\nfrom numpy import unique\r\nfrom numpy import nanmedian\r\nfrom matplotlib import pyplot\r\nfrom statsmodels.graphics.tsaplots import plot_acf\r\nfrom statsmodels.graphics.tsaplots import plot_pacf\r\n\r\n# split the dataset by 'chunkID', return a list of chunks\r\ndef to_chunks(values, chunk_ix=0):\r\n\tchunks = list()\r\n\t# get the unique chunk ids\r\n\tchunk_ids = unique(values[:, chunk_ix])\r\n\t# group rows by chunk id\r\n\tfor chunk_id in chunk_ids:\r\n\t\tselection = values[:, chunk_ix] == chunk_id\r\n\t\tchunks.append(values[selection, :])\r\n\treturn chunks\r\n\r\n# impute missing data\r\ndef impute_missing(rows, hours, series, col_ix):\r\n\t# count missing observations\r\n\tn_missing = count_nonzero(isnan(series))\r\n\t# calculate ratio of missing\r\n\tratio = n_missing \/ float(len(series)) * 100\r\n\t# check for no data\r\n\tif ratio == 100.0:\r\n\t\treturn series\r\n\t# impute missing using the median value for hour in the series\r\n\timputed = list()\r\n\tfor i in range(len(series)):\r\n\t\tif isnan(series[i]):\r\n\t\t\t# get all rows with the same hour\r\n\t\t\tmatches = rows[rows[:,2]==hours[i]]\r\n\t\t\t# fill with median value\r\n\t\t\tvalue = nanmedian(matches[:, col_ix])\r\n\t\t\timputed.append(value)\r\n\t\telse:\r\n\t\t\timputed.append(series[i])\r\n\treturn imputed\r\n\r\n# interpolate series of hours (in place) in 24 hour time\r\ndef interpolate_hours(hours):\r\n\t# find the first hour\r\n\tix = -1\r\n\tfor i in range(len(hours)):\r\n\t\tif not isnan(hours[i]):\r\n\t\t\tix = i\r\n\t\t\tbreak\r\n\t# fill-forward\r\n\thour = hours[ix]\r\n\tfor i in range(ix+1, len(hours)):\r\n\t\t# increment hour\r\n\t\thour += 1\r\n\t\t# check for a fill\r\n\t\tif isnan(hours[i]):\r\n\t\t\thours[i] = hour % 24\r\n\t# fill-backward\r\n\thour = hours[ix]\r\n\tfor i in range(ix-1, -1, -1):\r\n\t\t# decrement hour\r\n\t\thour -= 1\r\n\t\t# check for a fill\r\n\t\tif isnan(hours[i]):\r\n\t\t\thours[i] = hour % 24\r\n\r\n# layout a variable with breaks in the data for missing positions\r\ndef variable_to_series(chunk_train, col_ix, n_steps=5*24):\r\n\t# lay out whole series\r\n\tdata = [nan for _ in range(n_steps)]\r\n\t# mark all available data\r\n\tfor i in range(len(chunk_train)):\r\n\t\t# get position in chunk\r\n\t\tposition = int(chunk_train[i, 1] - 1)\r\n\t\t# store data\r\n\t\tdata[position] = chunk_train[i, col_ix]\r\n\treturn data\r\n\r\n# plot acf and pacf plots for each imputed variable series\r\ndef plot_variables(chunk_train, hours, n_vars=39):\r\n\tpyplot.figure()\r\n\tn_plots = n_vars * 2\r\n\tj = 0\r\n\tlags = 24\r\n\tfor i in range(1, n_plots, 2):\r\n\t\t# convert target number into column number\r\n\t\tcol_ix = 3 + j\r\n\t\tj += 1\r\n\t\t# get series\r\n\t\tseries = variable_to_series(chunk_train, col_ix)\r\n\t\timputed = impute_missing(chunk_train, hours, series, col_ix)\r\n\t\t# acf\r\n\t\taxis = pyplot.subplot(n_vars, 2, i)\r\n\t\tplot_acf(imputed, ax=axis, lags=lags, zero=False)\r\n\t\taxis.set_title('')\r\n\t\taxis.set_xticklabels([])\r\n\t\taxis.set_yticklabels([])\r\n\t\t# pacf\r\n\t\taxis = pyplot.subplot(n_vars, 2, i+1)\r\n\t\tplot_pacf(imputed, ax=axis, lags=lags, zero=False)\r\n\t\taxis.set_title('')\r\n\t\taxis.set_xticklabels([])\r\n\t\taxis.set_yticklabels([])\r\n\t# show plot\r\n\tpyplot.show()\r\n\r\n# load dataset\r\ntrain = loadtxt('AirQualityPrediction\/naive_train.csv', delimiter=',')\r\n# group data by chunks\r\ntrain_chunks = to_chunks(train)\r\n# pick one chunk\r\nrows = train_chunks[0]\r\n# prepare sequence of hours for the chunk\r\nhours = variable_to_series(rows, 2)\r\n# interpolate hours\r\ninterpolate_hours(hours)\r\n# plot variables\r\nplot_variables(rows, hours)<\/pre>\n<p>Running the example creates a figure with a lot of plots for the target variables in the first chunk of the training dataset.<\/p>\n<p>You may need to increase the size of the plot window to better see the details of each plot.<\/p>\n<p>We can see on the left that most ACF plots show significant correlations (dots above the significance region) at lags 1-2 steps, maybe lags 1-3 steps in some cases, with a slow, steady decrease over the lags<\/p>\n<p>Similarly, on the right, we can see significant lags in the PACF plot at 1-2 time steps with steep fall-off.<\/p>\n<p>This strongly suggests an autocorrelation process with an order of perhaps 1, 2, or 3, e.g. AR(3).<\/p>\n<p>In the ACF plots on the left we can also see a daily cycle in the correlations. This may suggest some benefit in a seasonal differencing of the data prior to modeling or the use of an AR model capable of seasonal differencing.<\/p>\n<div id=\"attachment_6328\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6328\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/07\/ACF-and-PACF-Plots-for-Target-Variables-in-Chunk-1.png\" alt=\"ACF and PACF Plots for Target Variables in Chunk 1\" width=\"1280\" height=\"2018\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/ACF-and-PACF-Plots-for-Target-Variables-in-Chunk-1.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/ACF-and-PACF-Plots-for-Target-Variables-in-Chunk-1-190x300.png 190w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/ACF-and-PACF-Plots-for-Target-Variables-in-Chunk-1-768x1211.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/ACF-and-PACF-Plots-for-Target-Variables-in-Chunk-1-650x1024.png 650w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">ACF and PACF Plots for Target Variables in Chunk 1<\/p>\n<\/div>\n<p>We can repeat this analysis of the target variables for other chunks and we see much the same picture.<\/p>\n<p>It suggests we may be able to get away with a general AR model configuration for all series across all chunks.<\/p>\n<h2>Develop an Autoregressive Model<\/h2>\n<p>In this section, we will develop an autoregressive model for the imputed target series data.<\/p>\n<p>The first step is to implement a general function for making a forecast for each chunk.<\/p>\n<p>The function tasks the training dataset and the input columns (chunk id, position in chunk, and hour) for the test set and returns forecasts for all chunks with the expected 3D format of <em>[chunk][variable][time]<\/em>.<\/p>\n<p>The function enumerates the chunks in the forecast, then enumerates the 39 target columns, calling another new function named <em>forecast_variable()<\/em> in order to make a prediction for each lead time for a given target variable.<\/p>\n<p>The complete function is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># forecast for each chunk, returns [chunk][variable][time]\r\ndef forecast_chunks(train_chunks, test_input):\r\n\tlead_times = get_lead_times()\r\n\tpredictions = list()\r\n\t# enumerate chunks to forecast\r\n\tfor i in range(len(train_chunks)):\r\n\t\t# prepare sequence of hours for the chunk\r\n\t\thours = variable_to_series(train_chunks[i], 2)\r\n\t\t# interpolate hours\r\n\t\tinterpolate_hours(hours)\r\n\t\t# enumerate targets for chunk\r\n\t\tchunk_predictions = list()\r\n\t\tfor j in range(39):\r\n\t\t\tyhat = forecast_variable(hours, train_chunks[i], test_input[i], lead_times, j)\r\n\t\t\tchunk_predictions.append(yhat)\r\n\t\tchunk_predictions = array(chunk_predictions)\r\n\t\tpredictions.append(chunk_predictions)\r\n\treturn array(predictions)<\/pre>\n<p>We can now implement a version of the <em>forecast_variable()<\/em>.<\/p>\n<p>For each variable, we first check if there is no data (e.g. all NaNs) and if so, we return a forecast that is a NaN for each forecast lead time.<\/p>\n<p>We then create a series from the variable using the <em>variable_to_series()<\/em> and then impute the missing values using the median within the series by calling <em>impute_missing()<\/em>, both of which were developed in the previous section.<\/p>\n<p>Finally, we call a new function named <em>fit_and_forecast()<\/em> that fits a model and predicts the 10 forecast lead times.<\/p>\n<pre class=\"crayon-plain-tag\"># forecast all lead times for one variable\r\ndef forecast_variable(hours, chunk_train, chunk_test, lead_times, target_ix):\r\n\t# convert target number into column number\r\n\tcol_ix = 3 + target_ix\r\n\t# check for no data\r\n\tif not has_data(chunk_train[:, col_ix]):\r\n\t\tforecast = [nan for _ in range(len(lead_times))]\r\n\t\treturn forecast\r\n\t# get series\r\n\tseries = variable_to_series(chunk_train, col_ix)\r\n\t# impute\r\n\timputed = impute_missing(chunk_train, hours, series, col_ix)\r\n\t# fit AR model and forecast\r\n\tforecast = fit_and_forecast(imputed)\r\n\treturn forecast<\/pre>\n<p>We will fit an AR model to a given imputed series. To do this, we will use the <a href=\"http:\/\/www.statsmodels.org\/dev\/generated\/statsmodels.tsa.arima_model.ARIMA.html\">statsmodels ARIMA class<\/a>. We will use ARIMA instead of AR to offer some flexibility if you would like to explore any of the family of ARIMA models.<\/p>\n<p>First, we must define the model, including the order of the autoregressive process, such as AR(1).<\/p>\n<pre class=\"crayon-plain-tag\"># define the model\r\nmodel = ARIMA(series, order=(1,0,0))<\/pre>\n<p>Next, the model is fit on the imputed series. We turn off the verbose information during the fit by setting <em>disp<\/em> to <em>False<\/em>.<\/p>\n<pre class=\"crayon-plain-tag\"># fit the model\r\nmodel_fit = model.fit(disp=False)<\/pre>\n<p>The fit model is then used to forecast the next 72 hours beyond the end of the series.<\/p>\n<pre class=\"crayon-plain-tag\"># forecast 72 hours\r\nyhat = model_fit.predict(len(series), len(series)+72)<\/pre>\n<p>We are only interested in specific lead times, so we prepare an array of those lead times, subtract 1 to turn them into array indices, then use them to select the values at the 10 forecast lead times in which we are interested.<\/p>\n<pre class=\"crayon-plain-tag\"># extract lead times\r\nlead_times = array(get_lead_times())\r\nindices = lead_times - 1\r\nreturn yhat[indices]<\/pre>\n<p>The statsmodels ARIMA models use linear algebra libraries to fit the model under the covers, and sometimes the fit process can be unstable on some data. As such, it can throw an exception or report a lot of warnings.<\/p>\n<p>We will trap exceptions and return a <em>NaN<\/em> forecast, and ignore all warnings during the fit and evaluation.<\/p>\n<p>The <em>fit_and_forecast()<\/em> function below ties all of this together.<\/p>\n<pre class=\"crayon-plain-tag\"># fit AR model and generate a forecast\r\ndef fit_and_forecast(series):\r\n\t# define the model\r\n\tmodel = ARIMA(series, order=(1,0,0))\r\n\t# return a nan forecast in case of exception\r\n\ttry:\r\n\t\t# ignore statsmodels warnings\r\n\t\twith catch_warnings():\r\n\t\t\tfilterwarnings(\"ignore\")\r\n\t\t\t# fit the model\r\n\t\t\tmodel_fit = model.fit(disp=False)\r\n\t\t\t# forecast 72 hours\r\n\t\t\tyhat = model_fit.predict(len(series), len(series)+72)\r\n\t\t\t# extract lead times\r\n\t\t\tlead_times = array(get_lead_times())\r\n\t\t\tindices = lead_times - 1\r\n\t\t\treturn yhat[indices]\r\n\texcept:\r\n\t\treturn [nan for _ in range(len(get_lead_times()))]<\/pre>\n<p>We are now ready to evaluate an autoregressive process for each of the 39 series in each of the 207 training chunks.<\/p>\n<p>We will start off by testing an AR(1) process.<\/p>\n<p>The complete code example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># autoregression forecast\r\nfrom numpy import loadtxt\r\nfrom numpy import nan\r\nfrom numpy import isnan\r\nfrom numpy import count_nonzero\r\nfrom numpy import unique\r\nfrom numpy import array\r\nfrom numpy import nanmedian\r\nfrom statsmodels.tsa.arima_model import ARIMA\r\nfrom matplotlib import pyplot\r\nfrom warnings import catch_warnings\r\nfrom warnings import filterwarnings\r\n\r\n# split the dataset by 'chunkID', return a list of chunks\r\ndef to_chunks(values, chunk_ix=0):\r\n\tchunks = list()\r\n\t# get the unique chunk ids\r\n\tchunk_ids = unique(values[:, chunk_ix])\r\n\t# group rows by chunk id\r\n\tfor chunk_id in chunk_ids:\r\n\t\tselection = values[:, chunk_ix] == chunk_id\r\n\t\tchunks.append(values[selection, :])\r\n\treturn chunks\r\n\r\n# return a list of relative forecast lead times\r\ndef get_lead_times():\r\n\treturn [1, 2, 3, 4, 5, 10, 17, 24, 48, 72]\r\n\r\n# interpolate series of hours (in place) in 24 hour time\r\ndef interpolate_hours(hours):\r\n\t# find the first hour\r\n\tix = -1\r\n\tfor i in range(len(hours)):\r\n\t\tif not isnan(hours[i]):\r\n\t\t\tix = i\r\n\t\t\tbreak\r\n\t# fill-forward\r\n\thour = hours[ix]\r\n\tfor i in range(ix+1, len(hours)):\r\n\t\t# increment hour\r\n\t\thour += 1\r\n\t\t# check for a fill\r\n\t\tif isnan(hours[i]):\r\n\t\t\thours[i] = hour % 24\r\n\t# fill-backward\r\n\thour = hours[ix]\r\n\tfor i in range(ix-1, -1, -1):\r\n\t\t# decrement hour\r\n\t\thour -= 1\r\n\t\t# check for a fill\r\n\t\tif isnan(hours[i]):\r\n\t\t\thours[i] = hour % 24\r\n\r\n# return true if the array has any non-nan values\r\ndef has_data(data):\r\n\treturn count_nonzero(isnan(data)) < len(data)\r\n\r\n# impute missing data\r\ndef impute_missing(rows, hours, series, col_ix):\r\n\t# impute missing using the median value for hour in the series\r\n\timputed = list()\r\n\tfor i in range(len(series)):\r\n\t\tif isnan(series[i]):\r\n\t\t\t# get all rows with the same hour\r\n\t\t\tmatches = rows[rows[:,2]==hours[i]]\r\n\t\t\t# fill with median value\r\n\t\t\tvalue = nanmedian(matches[:, col_ix])\r\n\t\t\tif isnan(value):\r\n\t\t\t\tvalue = 0.0\r\n\t\t\timputed.append(value)\r\n\t\telse:\r\n\t\t\timputed.append(series[i])\r\n\treturn imputed\r\n\r\n# layout a variable with breaks in the data for missing positions\r\ndef variable_to_series(chunk_train, col_ix, n_steps=5*24):\r\n\t# lay out whole series\r\n\tdata = [nan for _ in range(n_steps)]\r\n\t# mark all available data\r\n\tfor i in range(len(chunk_train)):\r\n\t\t# get position in chunk\r\n\t\tposition = int(chunk_train[i, 1] - 1)\r\n\t\t# store data\r\n\t\tdata[position] = chunk_train[i, col_ix]\r\n\treturn data\r\n\r\n# fit AR model and generate a forecast\r\ndef fit_and_forecast(series):\r\n\t# define the model\r\n\tmodel = ARIMA(series, order=(1,0,0))\r\n\t# return a nan forecast in case of exception\r\n\ttry:\r\n\t\t# ignore statsmodels warnings\r\n\t\twith catch_warnings():\r\n\t\t\tfilterwarnings(\"ignore\")\r\n\t\t\t# fit the model\r\n\t\t\tmodel_fit = model.fit(disp=False)\r\n\t\t\t# forecast 72 hours\r\n\t\t\tyhat = model_fit.predict(len(series), len(series)+72)\r\n\t\t\t# extract lead times\r\n\t\t\tlead_times = array(get_lead_times())\r\n\t\t\tindices = lead_times - 1\r\n\t\t\treturn yhat[indices]\r\n\texcept:\r\n\t\treturn [nan for _ in range(len(get_lead_times()))]\r\n\r\n# forecast all lead times for one variable\r\ndef forecast_variable(hours, chunk_train, chunk_test, lead_times, target_ix):\r\n\t# convert target number into column number\r\n\tcol_ix = 3 + target_ix\r\n\t# check for no data\r\n\tif not has_data(chunk_train[:, col_ix]):\r\n\t\tforecast = [nan for _ in range(len(lead_times))]\r\n\t\treturn forecast\r\n\t# get series\r\n\tseries = variable_to_series(chunk_train, col_ix)\r\n\t# impute\r\n\timputed = impute_missing(chunk_train, hours, series, col_ix)\r\n\t# fit AR model and forecast\r\n\tforecast = fit_and_forecast(imputed)\r\n\treturn forecast\r\n\r\n# forecast for each chunk, returns [chunk][variable][time]\r\ndef forecast_chunks(train_chunks, test_input):\r\n\tlead_times = get_lead_times()\r\n\tpredictions = list()\r\n\t# enumerate chunks to forecast\r\n\tfor i in range(len(train_chunks)):\r\n\t\t# prepare sequence of hours for the chunk\r\n\t\thours = variable_to_series(train_chunks[i], 2)\r\n\t\t# interpolate hours\r\n\t\tinterpolate_hours(hours)\r\n\t\t# enumerate targets for chunk\r\n\t\tchunk_predictions = list()\r\n\t\tfor j in range(39):\r\n\t\t\tyhat = forecast_variable(hours, train_chunks[i], test_input[i], lead_times, j)\r\n\t\t\tchunk_predictions.append(yhat)\r\n\t\tchunk_predictions = array(chunk_predictions)\r\n\t\tpredictions.append(chunk_predictions)\r\n\treturn array(predictions)\r\n\r\n# convert the test dataset in chunks to [chunk][variable][time] format\r\ndef prepare_test_forecasts(test_chunks):\r\n\tpredictions = list()\r\n\t# enumerate chunks to forecast\r\n\tfor rows in test_chunks:\r\n\t\t# enumerate targets for chunk\r\n\t\tchunk_predictions = list()\r\n\t\tfor j in range(3, rows.shape[1]):\r\n\t\t\tyhat = rows[:, j]\r\n\t\t\tchunk_predictions.append(yhat)\r\n\t\tchunk_predictions = array(chunk_predictions)\r\n\t\tpredictions.append(chunk_predictions)\r\n\treturn array(predictions)\r\n\r\n# calculate the error between an actual and predicted value\r\ndef calculate_error(actual, predicted):\r\n\t# give the full actual value if predicted is nan\r\n\tif isnan(predicted):\r\n\t\treturn abs(actual)\r\n\t# calculate abs difference\r\n\treturn abs(actual - predicted)\r\n\r\n# evaluate a forecast in the format [chunk][variable][time]\r\ndef evaluate_forecasts(predictions, testset):\r\n\tlead_times = get_lead_times()\r\n\ttotal_mae, times_mae = 0.0, [0.0 for _ in range(len(lead_times))]\r\n\ttotal_c, times_c = 0, [0 for _ in range(len(lead_times))]\r\n\t# enumerate test chunks\r\n\tfor i in range(len(test_chunks)):\r\n\t\t# convert to forecasts\r\n\t\tactual = testset[i]\r\n\t\tpredicted = predictions[i]\r\n\t\t# enumerate target variables\r\n\t\tfor j in range(predicted.shape[0]):\r\n\t\t\t# enumerate lead times\r\n\t\t\tfor k in range(len(lead_times)):\r\n\t\t\t\t# skip if actual in nan\r\n\t\t\t\tif isnan(actual[j, k]):\r\n\t\t\t\t\tcontinue\r\n\t\t\t\t# calculate error\r\n\t\t\t\terror = calculate_error(actual[j, k], predicted[j, k])\r\n\t\t\t\t# update statistics\r\n\t\t\t\ttotal_mae += error\r\n\t\t\t\ttimes_mae[k] += error\r\n\t\t\t\ttotal_c += 1\r\n\t\t\t\ttimes_c[k] += 1\r\n\t# normalize summed absolute errors\r\n\ttotal_mae \/= total_c\r\n\ttimes_mae = [times_mae[i]\/times_c[i] for i in range(len(times_mae))]\r\n\treturn total_mae, times_mae\r\n\r\n# summarize scores\r\ndef summarize_error(name, total_mae, times_mae):\r\n\t# print summary\r\n\tlead_times = get_lead_times()\r\n\tformatted = ['+%d %.3f' % (lead_times[i], times_mae[i]) for i in range(len(lead_times))]\r\n\ts_scores = ', '.join(formatted)\r\n\tprint('%s: [%.3f MAE] %s' % (name, total_mae, s_scores))\r\n\t# plot summary\r\n\tpyplot.plot([str(x) for x in lead_times], times_mae, marker='.')\r\n\tpyplot.show()\r\n\r\n# load dataset\r\ntrain = loadtxt('AirQualityPrediction\/naive_train.csv', delimiter=',')\r\ntest = loadtxt('AirQualityPrediction\/naive_test.csv', delimiter=',')\r\n# group data by chunks\r\ntrain_chunks = to_chunks(train)\r\ntest_chunks = to_chunks(test)\r\n# forecast\r\ntest_input = [rows[:, :3] for rows in test_chunks]\r\nforecast = forecast_chunks(train_chunks, test_input)\r\n# evaluate forecast\r\nactual = prepare_test_forecasts(test_chunks)\r\ntotal_mae, times_mae = evaluate_forecasts(forecast, actual)\r\n# summarize forecast\r\nsummarize_error('AR', total_mae, times_mae)<\/pre>\n<p>Running the example first reports the overall MAE for the test set, followed by the MAE for each forecast lead time.<\/p>\n<p>We can see that the model achieves a MAE of about 0.492, which is less than a MAE 0.520 achieved by a naive persistence model. This shows that indeed the approach has some skill.<\/p>\n<pre class=\"crayon-plain-tag\">AR: [0.492 MAE] +1 0.225, +2 0.342, +3 0.410, +4 0.475, +5 0.512, +10 0.593, +17 0.586, +24 0.588, +48 0.588, +72 0.604<\/pre>\n<p>A line plot of MAE per forecast lead time is created, showing the linear increase in forecast error with the increase in forecast lead time.<\/p>\n<div id=\"attachment_6329\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6329\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/07\/MAE-vs-Forecast-Lead-Time-for-AR1.png\" alt=\"MAE vs Forecast Lead Time for AR(1)\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/MAE-vs-Forecast-Lead-Time-for-AR1.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/MAE-vs-Forecast-Lead-Time-for-AR1-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/MAE-vs-Forecast-Lead-Time-for-AR1-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/MAE-vs-Forecast-Lead-Time-for-AR1-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">MAE vs Forecast Lead Time for AR(1)<\/p>\n<\/div>\n<p>We can change the code to test other AR models. Specifically the order of the ARIMA model in the <em>fit_and_forecast()<\/em> function.<\/p>\n<p>An AR(2) model can be defined as:<\/p>\n<pre class=\"crayon-plain-tag\">model = ARIMA(series, order=(2,0,0))<\/pre>\n<p>Running the code with this update shows a further drop in error to an overall MAE of about 0.490.<\/p>\n<pre class=\"crayon-plain-tag\">AR: [0.490 MAE] +1 0.229, +2 0.342, +3 0.412, +4 0.470, +5 0.503, +10 0.563, +17 0.576, +24 0.605, +48 0.597, +72 0.608<\/pre>\n<p>We can also try an AR(3):<\/p>\n<pre class=\"crayon-plain-tag\">model = ARIMA(series, order=(3,0,0))<\/pre>\n<p>Re-running the example with the update shows an increase in the overall MAE compared to an AR(2).<\/p>\n<p>An AR(2) might be a good global level configuration to use, although it is expected that models tailored to each variable or each series may perform better overall.<\/p>\n<pre class=\"crayon-plain-tag\">AR: [0.491 MAE] +1 0.232, +2 0.345, +3 0.412, +4 0.472, +5 0.504, +10 0.556, +17 0.575, +24 0.607, +48 0.599, +72 0.611<\/pre>\n<\/p>\n<h2>Autoregressive Model with Global Impute Strategy<\/h2>\n<p>We can evaluate the AR(2) model with an alternate imputation strategy.<\/p>\n<p>Instead of calculating the median value for the same hour across the series in the chunk, we can calculate the same value across the variable in all chunks.<\/p>\n<p>We can update the <em>impute_missing()<\/em> to take all training chunks as an argument, then collect rows from all chunks for a given hour in order to calculate the median value used to impute. The updated version of the function is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># impute missing data\r\ndef impute_missing(train_chunks, rows, hours, series, col_ix):\r\n\t# impute missing using the median value for hour in all series\r\n\timputed = list()\r\n\tfor i in range(len(series)):\r\n\t\tif isnan(series[i]):\r\n\t\t\t# collect all rows across all chunks for the hour\r\n\t\t\tall_rows = list()\r\n\t\t\tfor rows in train_chunks:\r\n\t\t\t\t[all_rows.append(row) for row in rows[rows[:,2]==hours[i]]]\r\n\t\t\t# calculate the central tendency for target\r\n\t\t\tall_rows = array(all_rows)\r\n\t\t\t# fill with median value\r\n\t\t\tvalue = nanmedian(all_rows[:, col_ix])\r\n\t\t\tif isnan(value):\r\n\t\t\t\tvalue = 0.0\r\n\t\t\timputed.append(value)\r\n\t\telse:\r\n\t\t\timputed.append(series[i])\r\n\treturn imputed<\/pre>\n<p>In order to pass the train_chunks to the <em>impute_missing()<\/em> function, we must update the <em>forecast_variable()<\/em> function to also take <em>train_chunks<\/em> as an argument and pass it along, and in turn update the <em>forecast_chunks()<\/em> function to pass <em>train_chunks<\/em>.<\/p>\n<p>The complete example using a global imputation strategy is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># autoregression forecast with global impute strategy\r\nfrom numpy import loadtxt\r\nfrom numpy import nan\r\nfrom numpy import isnan\r\nfrom numpy import count_nonzero\r\nfrom numpy import unique\r\nfrom numpy import array\r\nfrom numpy import nanmedian\r\nfrom statsmodels.tsa.arima_model import ARIMA\r\nfrom matplotlib import pyplot\r\nfrom warnings import catch_warnings\r\nfrom warnings import filterwarnings\r\n\r\n# split the dataset by 'chunkID', return a list of chunks\r\ndef to_chunks(values, chunk_ix=0):\r\n\tchunks = list()\r\n\t# get the unique chunk ids\r\n\tchunk_ids = unique(values[:, chunk_ix])\r\n\t# group rows by chunk id\r\n\tfor chunk_id in chunk_ids:\r\n\t\tselection = values[:, chunk_ix] == chunk_id\r\n\t\tchunks.append(values[selection, :])\r\n\treturn chunks\r\n\r\n# return a list of relative forecast lead times\r\ndef get_lead_times():\r\n\treturn [1, 2, 3, 4, 5, 10, 17, 24, 48, 72]\r\n\r\n# interpolate series of hours (in place) in 24 hour time\r\ndef interpolate_hours(hours):\r\n\t# find the first hour\r\n\tix = -1\r\n\tfor i in range(len(hours)):\r\n\t\tif not isnan(hours[i]):\r\n\t\t\tix = i\r\n\t\t\tbreak\r\n\t# fill-forward\r\n\thour = hours[ix]\r\n\tfor i in range(ix+1, len(hours)):\r\n\t\t# increment hour\r\n\t\thour += 1\r\n\t\t# check for a fill\r\n\t\tif isnan(hours[i]):\r\n\t\t\thours[i] = hour % 24\r\n\t# fill-backward\r\n\thour = hours[ix]\r\n\tfor i in range(ix-1, -1, -1):\r\n\t\t# decrement hour\r\n\t\thour -= 1\r\n\t\t# check for a fill\r\n\t\tif isnan(hours[i]):\r\n\t\t\thours[i] = hour % 24\r\n\r\n# return true if the array has any non-nan values\r\ndef has_data(data):\r\n\treturn count_nonzero(isnan(data)) < len(data)\r\n\r\n# impute missing data\r\ndef impute_missing(train_chunks, rows, hours, series, col_ix):\r\n\t# impute missing using the median value for hour in all series\r\n\timputed = list()\r\n\tfor i in range(len(series)):\r\n\t\tif isnan(series[i]):\r\n\t\t\t# collect all rows across all chunks for the hour\r\n\t\t\tall_rows = list()\r\n\t\t\tfor rows in train_chunks:\r\n\t\t\t\t[all_rows.append(row) for row in rows[rows[:,2]==hours[i]]]\r\n\t\t\t# calculate the central tendency for target\r\n\t\t\tall_rows = array(all_rows)\r\n\t\t\t# fill with median value\r\n\t\t\tvalue = nanmedian(all_rows[:, col_ix])\r\n\t\t\tif isnan(value):\r\n\t\t\t\tvalue = 0.0\r\n\t\t\timputed.append(value)\r\n\t\telse:\r\n\t\t\timputed.append(series[i])\r\n\treturn imputed\r\n\r\n# layout a variable with breaks in the data for missing positions\r\ndef variable_to_series(chunk_train, col_ix, n_steps=5*24):\r\n\t# lay out whole series\r\n\tdata = [nan for _ in range(n_steps)]\r\n\t# mark all available data\r\n\tfor i in range(len(chunk_train)):\r\n\t\t# get position in chunk\r\n\t\tposition = int(chunk_train[i, 1] - 1)\r\n\t\t# store data\r\n\t\tdata[position] = chunk_train[i, col_ix]\r\n\treturn data\r\n\r\n# fit AR model and generate a forecast\r\ndef fit_and_forecast(series):\r\n\t# define the model\r\n\tmodel = ARIMA(series, order=(2,0,0))\r\n\t# return a nan forecast in case of exception\r\n\ttry:\r\n\t\t# ignore statsmodels warnings\r\n\t\twith catch_warnings():\r\n\t\t\tfilterwarnings(\"ignore\")\r\n\t\t\t# fit the model\r\n\t\t\tmodel_fit = model.fit(disp=False)\r\n\t\t\t# forecast 72 hours\r\n\t\t\tyhat = model_fit.predict(len(series), len(series)+72)\r\n\t\t\t# extract lead times\r\n\t\t\tlead_times = array(get_lead_times())\r\n\t\t\tindices = lead_times - 1\r\n\t\t\treturn yhat[indices]\r\n\texcept:\r\n\t\treturn [nan for _ in range(len(get_lead_times()))]\r\n\r\n# forecast all lead times for one variable\r\ndef forecast_variable(hours, train_chunks, chunk_train, chunk_test, lead_times, target_ix):\r\n\t# convert target number into column number\r\n\tcol_ix = 3 + target_ix\r\n\t# check for no data\r\n\tif not has_data(chunk_train[:, col_ix]):\r\n\t\tforecast = [nan for _ in range(len(lead_times))]\r\n\t\treturn forecast\r\n\t# get series\r\n\tseries = variable_to_series(chunk_train, col_ix)\r\n\t# impute\r\n\timputed = impute_missing(train_chunks, chunk_train, hours, series, col_ix)\r\n\t# fit AR model and forecast\r\n\tforecast = fit_and_forecast(imputed)\r\n\treturn forecast\r\n\r\n# forecast for each chunk, returns [chunk][variable][time]\r\ndef forecast_chunks(train_chunks, test_input):\r\n\tlead_times = get_lead_times()\r\n\tpredictions = list()\r\n\t# enumerate chunks to forecast\r\n\tfor i in range(len(train_chunks)):\r\n\t\t# prepare sequence of hours for the chunk\r\n\t\thours = variable_to_series(train_chunks[i], 2)\r\n\t\t# interpolate hours\r\n\t\tinterpolate_hours(hours)\r\n\t\t# enumerate targets for chunk\r\n\t\tchunk_predictions = list()\r\n\t\tfor j in range(39):\r\n\t\t\tyhat = forecast_variable(hours, train_chunks, train_chunks[i], test_input[i], lead_times, j)\r\n\t\t\tchunk_predictions.append(yhat)\r\n\t\tchunk_predictions = array(chunk_predictions)\r\n\t\tpredictions.append(chunk_predictions)\r\n\treturn array(predictions)\r\n\r\n# convert the test dataset in chunks to [chunk][variable][time] format\r\ndef prepare_test_forecasts(test_chunks):\r\n\tpredictions = list()\r\n\t# enumerate chunks to forecast\r\n\tfor rows in test_chunks:\r\n\t\t# enumerate targets for chunk\r\n\t\tchunk_predictions = list()\r\n\t\tfor j in range(3, rows.shape[1]):\r\n\t\t\tyhat = rows[:, j]\r\n\t\t\tchunk_predictions.append(yhat)\r\n\t\tchunk_predictions = array(chunk_predictions)\r\n\t\tpredictions.append(chunk_predictions)\r\n\treturn array(predictions)\r\n\r\n# calculate the error between an actual and predicted value\r\ndef calculate_error(actual, predicted):\r\n\t# give the full actual value if predicted is nan\r\n\tif isnan(predicted):\r\n\t\treturn abs(actual)\r\n\t# calculate abs difference\r\n\treturn abs(actual - predicted)\r\n\r\n# evaluate a forecast in the format [chunk][variable][time]\r\ndef evaluate_forecasts(predictions, testset):\r\n\tlead_times = get_lead_times()\r\n\ttotal_mae, times_mae = 0.0, [0.0 for _ in range(len(lead_times))]\r\n\ttotal_c, times_c = 0, [0 for _ in range(len(lead_times))]\r\n\t# enumerate test chunks\r\n\tfor i in range(len(test_chunks)):\r\n\t\t# convert to forecasts\r\n\t\tactual = testset[i]\r\n\t\tpredicted = predictions[i]\r\n\t\t# enumerate target variables\r\n\t\tfor j in range(predicted.shape[0]):\r\n\t\t\t# enumerate lead times\r\n\t\t\tfor k in range(len(lead_times)):\r\n\t\t\t\t# skip if actual in nan\r\n\t\t\t\tif isnan(actual[j, k]):\r\n\t\t\t\t\tcontinue\r\n\t\t\t\t# calculate error\r\n\t\t\t\terror = calculate_error(actual[j, k], predicted[j, k])\r\n\t\t\t\t# update statistics\r\n\t\t\t\ttotal_mae += error\r\n\t\t\t\ttimes_mae[k] += error\r\n\t\t\t\ttotal_c += 1\r\n\t\t\t\ttimes_c[k] += 1\r\n\t# normalize summed absolute errors\r\n\ttotal_mae \/= total_c\r\n\ttimes_mae = [times_mae[i]\/times_c[i] for i in range(len(times_mae))]\r\n\treturn total_mae, times_mae\r\n\r\n# summarize scores\r\ndef summarize_error(name, total_mae, times_mae):\r\n\t# print summary\r\n\tlead_times = get_lead_times()\r\n\tformatted = ['+%d %.3f' % (lead_times[i], times_mae[i]) for i in range(len(lead_times))]\r\n\ts_scores = ', '.join(formatted)\r\n\tprint('%s: [%.3f MAE] %s' % (name, total_mae, s_scores))\r\n\t# plot summary\r\n\tpyplot.plot([str(x) for x in lead_times], times_mae, marker='.')\r\n\tpyplot.show()\r\n\r\n# load dataset\r\ntrain = loadtxt('AirQualityPrediction\/naive_train.csv', delimiter=',')\r\ntest = loadtxt('AirQualityPrediction\/naive_test.csv', delimiter=',')\r\n# group data by chunks\r\ntrain_chunks = to_chunks(train)\r\ntest_chunks = to_chunks(test)\r\n# forecast\r\ntest_input = [rows[:, :3] for rows in test_chunks]\r\nforecast = forecast_chunks(train_chunks, test_input)\r\n# evaluate forecast\r\nactual = prepare_test_forecasts(test_chunks)\r\ntotal_mae, times_mae = evaluate_forecasts(forecast, actual)\r\n# summarize forecast\r\nsummarize_error('AR', total_mae, times_mae)<\/pre>\n<p>Running the example shows a further drop in the overall MAE to about 0.487.<\/p>\n<p>It may be interesting to explore imputation strategies that alternate the method used to fill in missing values based on how much missing data a series has or the gap being filled.<\/p>\n<pre class=\"crayon-plain-tag\">AR: [0.487 MAE] +1 0.228, +2 0.339, +3 0.409, +4 0.469, +5 0.499, +10 0.560, +17 0.573, +24 0.600, +48 0.595, +72 0.606<\/pre>\n<p>A line plot of MAE vs. forecast lead time is also created.<\/p>\n<div id=\"attachment_6330\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6330\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/07\/MAE-vs-Forecast-Lead-Time-for-AR2-Impute-With-Global-Strategy.png\" alt=\"MAE vs Forecast Lead Time for AR(2) Impute With Global Strategy\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/MAE-vs-Forecast-Lead-Time-for-AR2-Impute-With-Global-Strategy.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/MAE-vs-Forecast-Lead-Time-for-AR2-Impute-With-Global-Strategy-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/MAE-vs-Forecast-Lead-Time-for-AR2-Impute-With-Global-Strategy-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/07\/MAE-vs-Forecast-Lead-Time-for-AR2-Impute-With-Global-Strategy-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">MAE vs Forecast Lead Time for AR(2) Impute With Global Strategy<\/p>\n<\/div>\n<h2>Extensions<\/h2>\n<p>This section lists some ideas for extending the tutorial that you may wish to explore.<\/p>\n<ul>\n<li><strong>Imputation Strategies<\/strong>. Develop and evaluate one additional alternate imputation strategy for the missing data in each series.<\/li>\n<li><strong>Data Preparation<\/strong>. Explore whether data preparation techniques applied to each can improve model skill, such as standardization, normalization, and power transforms.<\/li>\n<li><strong>Differencing<\/strong>. Explore whether differencing, such as 1-step or 24-step (seasonal differencing), can make each series stationary, and in turn result in better forecasts.<\/li>\n<\/ul>\n<p>If you explore any of these extensions, I\u2019d love to know.<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Posts<\/h3>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/standard-multivariate-multi-step-multi-site-time-series-forecasting-problem\/\">A Standard Multivariate, Multi-Step, and Multi-Site Time Series Forecasting Problem<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/gentle-introduction-autocorrelation-partial-autocorrelation\/\">A Gentle Introduction to Autocorrelation and Partial Autocorrelation<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/grid-search-arima-hyperparameters-with-python\/\">How to Grid Search ARIMA Model Hyperparameters with Python<\/a><\/li>\n<\/ul>\n<h3>API<\/h3>\n<ul>\n<li><a href=\"http:\/\/www.statsmodels.org\/dev\/generated\/statsmodels.graphics.tsaplots.plot_acf.html\">statsmodels.graphics.tsaplots.plot_acf API<\/a><\/li>\n<li><a href=\"http:\/\/www.statsmodels.org\/dev\/generated\/statsmodels.graphics.tsaplots.plot_pacf.html\">statsmodels.graphics.tsaplots.plot_pacf API<\/a><\/li>\n<li><a href=\"http:\/\/www.statsmodels.org\/dev\/generated\/statsmodels.tsa.arima_model.ARIMA.html\">statsmodels.tsa.arima_model.ARIMA API<\/a><\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li><a href=\"https:\/\/www.kaggle.com\/c\/dsg-hackathon\/data\">EMC Data Science Global Hackathon (Air Quality Prediction)<\/a><\/li>\n<li><a href=\"http:\/\/blog.kaggle.com\/2012\/05\/01\/chucking-everything-into-a-random-forest-ben-hamner-on-winning-the-air-quality-prediction-hackathon\/\">Chucking everything into a Random Forest: Ben Hamner on Winning The Air Quality Prediction Hackathon<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/benhamner\/Air-Quality-Prediction-Hackathon-Winning-Model\">Winning Code for the EMC Data Science Global Hackathon (Air Quality Prediction)<\/a><\/li>\n<li><a href=\"https:\/\/www.kaggle.com\/c\/dsg-hackathon\/discussion\/1821\">General approaches to partitioning the models?<\/a><\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to develop autoregressive models for multi-step time series forecasting for a multivariate air pollution time series.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How to analyze and impute missing values for time series data.<\/li>\n<li>How to develop and evaluate an autoregressive model for multi-step time series forecasting.<\/li>\n<li>How to improve an autoregressive model using alternate data imputation methods.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/how-to-develop-autoregressive-forecasting-models-for-multi-step-air-pollution-time-series-forecasting\/\">How to Develop Autoregressive Forecasting Models for Multi-Step Air Pollution Time Series Forecasting<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/how-to-develop-autoregressive-forecasting-models-for-multi-step-air-pollution-time-series-forecasting\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Real-world time series forecasting is challenging for a whole host of reasons not limited to problem features such as having multiple input [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2018\/10\/16\/how-to-develop-autoregressive-forecasting-models-for-multi-step-air-pollution-time-series-forecasting\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":1171,"comment_status":"registered_only","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1170"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=1170"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1170\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/1171"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=1170"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=1170"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=1170"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}