16  Time series: modeling and forecasting

Prediction is difficult, especially about the future.

— Niels Bohr (apocryphally)

16.1 Goals:

  • Use current tools for handling and visualizing time series
  • Calculate auto- and cross-correlations of time series
  • Decompose time series into components
  • Use linear regression methods for fitting and forecasting

16.2 Time series format and plotting

A time series is a special data set where each observation has an associated time measurement. There is a special R structure for storing and operating on time series, called ts, as illustrated here:

This reads in a data set recording the number of births per month in New York City, from 1946 to 1958 (in thousands of births?) To create the time series, we had to give the function the frequency, or the number of time points in a year, and the starting value as a vector assigned to start = c(1946, 1), the first element is the year and the second the month.

There are several other specialized data structures for handling time series. We will use one from a new set of R packages called tidyverts that is called a tsibble. https://tsibble.tidyverts.org/ The tsibbledata package contains several time series data sets and we will load two below: olympic_running, containing the Olympic winning times in different running events, and pelt, containing the classic data of the number of Lynx and Hare pelts sold in Canada in the 19th and early 20th centuries:

You can see that tsibbles contain multiple variables, one of which is the time variable and is called the index, and one or more others are denoted as key, which are variables that identify the observations but are not changing in time. For example, in the Olympic times data set, the Sex and the Length are the keys, because they are not measurement that we’re plotting over time. In the pelt data set, there is no key variable (but you could make a longer tsibble with species as the key variable if you wanted.)

16.2.1 Visualizing the data

The most straightforward way of visualizing time series is using a time plot, which can be created using autoplot:

autoplot works best with timeseries objects, so to use it with a tsibble you can convert it to a ts first, like this:

One can also use ggplot directly, for example to plot a faceted graph of the olympic times for different years:

16.3 Decomposition of time series

As you can see from the plots above, time series are often a combination of different time-dependent processes and it is often useful to think about them separately. One can distinguish three types of time variation in time series:

  • Seasonality A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week. Seasonality is always of a fixed and known frequency.

  • Trend A trend exists when there is a long-term increase or decrease in the data. It does not have to be linear. Sometimes we will refer to a trend as “changing direction”, when it might go from an increasing trend to a decreasing trend.

  • Cyclic A cycle occurs when the data exhibit rises and falls that are not of a fixed frequency. In economics, these fluctuations may by due to economic conditions and are often related to the “business cycle”.

16.3.1 Decomposition of time series

There are two main types of decompositions of time series: additive and multiplicative. Let us call \(X_t\) the time series data, \(T_t\) the trend (non-periodic component), \(S_t\) the seasonal part (periodic component), and \(R_t\) the remainder.

\[ X_t = T_t + S_t + R_t \] \[ X_t = T_t \times S_t \times R_t \]

One simple way of removing seasonality and estimating the trend is using the moving average, that is using \(k\) points before and \(k\) points after each point to calculate the trend: \[ T_t = \frac{1}{m} \sum_{i=-k}^k X_{t+i} \]

Here \(m\) is called the order of the moving average and is defined as \(m = 2k+1\). There is a useful function ma() that calculates these averages and allows them to be plotted.

Exercise: Change the moving average window and see if you can make seasonality (periodicity) vanish!

An even order of periodicity requires an asymmetric averaging window, so to create an symmetric average, one can repeat the moving average of order two on the already-averaged data.

16.3.2 Classic decomposition:

Additive decomposition [1]:

  1. If m is an even number, compute the trend-cycle component \(T_t\) using a 2×m-MA. If m is an odd number, compute the trend-cycle component \(\hat T_t\) using an m-MA.
  2. Calculate the detrended series: \(X_t - \hat T_t\)
  3. To estimate the seasonal component for each season, average the detrended values for that season. For example, with monthly data, the seasonal component for March is the average of all the detrended March values in the data. These seasonal component values are then adjusted to ensure that they add to zero. The seasonal component is obtained by stringing together these monthly values, and then replicating the sequence for each year of data. This gives \(\hat S_t\).
  4. The remainder component is calculated by subtracting the estimated seasonal and trend-cycle components: \(\hat R_t = X_t - \hat T_t - \hat S_t\)

This simple classical decomposition has numerous flaws, so better, more modern methods are preferred. In particular, it assumes a constant seasonal term, it tends to over-estimate the variation in the trend, it misses data for the first few and last few data points, and can be sensitive to outliers.

16.3.3 STL decomposition

A more robust method is called the STL decomposition (Seasonal and Trend decomposition using Loess). To summarize its advantages [1]:

  • STL can handle any type of seasonality, not only monthly and quarterly data.
  • The seasonal component is allowed to change over time, and the rate of change can be controlled by the user.
  • The smoothness of the trend-cycle can also be controlled by the user.
  • It can be robust to outliers (i.e., the user can specify a robust decomposition), so that occasional unusual observations will not affect the estimates of the trend-cycle and seasonal components. They will, however, affect the remainder component.

Exercise: Apply the two decomposition methods to the pelt time series data.

16.4 Relationships within and between time series

16.4.1 Visualizing correlation between different variables

The following data set contains the number of visitors (visitor nights) on a quarterly basis for five regions of New South Wales, Australia:

One simple question is whether different variables are related to each other. One simple way is to calculate the Pearson correlation between different time series, called the cross-correlation (where \(\bar X\) stands for the mean of X and \(\text{Var}(X)\) stands for the variance of \(X\)):

\[ \text{Cor}(X,Y) = \frac{\sum_t (\bar X - X_t)(\bar Y - Y_t)}{\sqrt{\text{Var}(X)\text{Var}(Y)}} \]

In a data set with multiple variables it can be handy to examine the cross-correlations between all pairs between them. Here’s a convenient function to both calculate and visualize it for multiple variables, in this case the hotel visitor nights for the 5 different regions of NSW. The plot below show the scatter plots of all the pairs of time series against each other, along with the respective correlation coefficients:

However, sometimes the straightforward correlation may miss or underestimate a relationship that is obviously there, but with a delay. Here is the Lynx and Hare variables from the pelts data set correlated with each other:

The correlation coefficient is about 0.3, but it appears that the relationship should be stronger than that from the plots. The function CCF from feasts package calculates correlations for a pair of variables that are shifted against each other (lagged) and here is the resulting plot:

You can see the non-shifted correlation of about 0.3 at lag of 0, but the correlation is about 2 times higher when the series is shifted by 1 or 2 years, as suggested by the time plots of the series.

16.4.2 Autocorrelation

Just as we saw for cross-correlations, a time series can be correlated against itself shifted in time by some set amount, also called lagged. We can plot the lagged correlations for different visitor

Here the colors indicate the quarter of the variable on the vertical axis, compared with the shifted (lagged variable on the horizontal axis, and the lines connect points in chronological order. The relationship is strongly positive at lags 4 and 8, reflecting the strong seasonality in the data.

This suggests that there is a strong similarity between the time series and itself, shifted by certain time values. This is described by the autocorrelation, which is defined as a function of the lag \(k\):

\[ r(k) = \frac{\sum_{t=k}^T (\bar X - X_t)(\bar X - X_{t-k})}{\text{Var}(X)} \]

The autocorrelation can be calculated and plotted for our example of the visitation nights in New South Wales:

In both cases, the lag plots are closest to the diagonal identity line for lag of 9 or 10. This should be reflected in the autocorrelation plot:

Notice the periodicity in the autocorrelation, which indicated periodicity in the time series.

Autocorrelation measures the memory of a signal - for example, pure white noise is uncorrelated with itself even a moment later, and thus has no memory. As such, it is very useful as a measure of a trend in the data - if the time series has slowly decaying, positive autocorrelation, that indicates a pronounced trend, while periodicity indicates seasonality in the data.

Exercise: Use the lag and autocorrelation analysis to describe the patterns in the time series of births in NYC.

16.4.3 Perennial warning: correlations are not causation!

These two data sets, on Australian air passengers and rice production in Guinea, have a very strong positive correlation:

However, notice that the residuals indicate a strong trend, which violates the assumptions of linear regression.

There are a number of fun examples of spurious time series correlations in reference [5].

16.5 Modeling and forecasting

16.5.1 Forecasting using smoothing methods

The package fable is part of the tidyverts bundle and can be used to easily produce forecasts using several standard methods, such as ASNAIVE (Seasonal Naive), ETS (exponential smoothing), and ARIMA (AutoRegressive Integrated Moving Average). The code below compares all three forecast models for the births in NYC data:

16.6 References and further reading:

  1. Rob J Hyndman and George Athanasopoulos. Forecasting: Principles and Practice
  2. Jonathan Cryer and Kung-Sik Chan Time Series Analysis with Applications in R
  3. Cross-validation in forecasting
  4. Time series nested cross-validation
  5. Spurious correlations