An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

- Publications
- Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

- Advanced Search
- Journal List
- Front Psychol

## Time series analysis for psychological research: examining and forecasting change

Andrew t. jebb.

1 Department of Psychological Sciences, Purdue University, West Lafayette, IN, USA

2 Department of Psychology, University of Central Florida, Orlando, FL, USA

## Qiming Huang

3 Department of Statistics, Purdue University, West Lafayette, IN, USA

## Associated Data

Psychological research has increasingly recognized the importance of integrating temporal dynamics into its theories, and innovations in longitudinal designs and analyses have allowed such theories to be formalized and tested. However, psychological researchers may be relatively unequipped to analyze such data, given its many characteristics and the general complexities involved in longitudinal modeling. The current paper introduces time series analysis to psychological research, an analytic domain that has been essential for understanding and predicting the behavior of variables across many diverse fields. First, the characteristics of time series data are discussed. Second, different time series modeling techniques are surveyed that can address various topics of interest to psychological researchers, including describing the pattern of change in a variable, modeling seasonal effects, assessing the immediate and long-term impact of a salient event, and forecasting future values. To illustrate these methods, an illustrative example based on online job search behavior is used throughout the paper, and a software tutorial in R for these analyses is provided in the Supplementary Materials.

Although time series analysis has been frequently used many disciplines, it has not been well-integrated within psychological research. In part, constraints in data collection have often limited longitudinal research to only a few time points. However, these practical limitations do not eliminate the theoretical need for understanding patterns of change over long periods of time or over many occasions. Psychological processes are inherently time-bound, and it can be argued that no theory is truly time-independent (Zaheer et al., 1999 ). Further, its prolific use in economics, engineering, and the natural sciences may perhaps be an indicator of its potential in our field, and recent technological growth has already initiated shifts in data collection that proliferate time series designs. For instance, online behaviors can now be quantified and tracked in real-time, leading to an accessible and rich source of time series data (see Stanton and Rogelberg, 2001 ). As a leading example, Ginsberg et al. ( 2009 ) developed methods of influenza tracking based on Google queries whose efficiency surpassed conventional systems, such as those provided by the Center for Disease Control and Prevention. Importantly, this work was based in prior research showing how search engine queries correlated with virological and mortality data over multiple years (Polgreen et al., 2008 ).

Furthermore, although experience sampling methods have been used for decades (Larson and Csikszentmihalyi, 1983 ), nascent technologies such as smartphones allow this technique to be increasingly feasible and less intrusive to respondents, resulting in a proliferation of time series data. As an example, Killingsworth and Gibert ( 2010 ) presented an iPhone (Apple Incorporated, Cupertino, California) application which tracks various behaviors, cognitions, and affect over time. At the time their study was published, their database contained almost a quarter of a million psychological measurements from individuals in 83 countries. Finally, due to the growing synthesis between psychology and neuroscience (e.g., affective neuroscience, social-cognitive neuroscience) the ability to analyze neuroimaging data, which is strongly linked to time series methods (e.g., Friston et al., 1995 , 2000 ), is a powerful methodological asset. Due to these overarching trends, we expect that time series data will become increasingly prevalent and spur the development of more time-sensitive psychological theory. Mindful of the growing need to contribute to the methodological toolkit of psychological researchers, the present article introduces the use of time series analysis in order to describe and understand the dynamics of psychological change over time.

In contrast to these current trends, we conducted a survey of the existing psychological literature in order to quantify the extent to which time series methods have already been used in psychological science. Using the PsycINFO database, we searched the publication histories of 15 prominent journals in psychology 1 for the term “time series” in the abstract, keywords, and subject terms. This search yielded a small sample of 36 empirical papers that utilized time series modeling. Further investigation revealed the presence of two general analytic goals: relating a time series to other substantive variables (17 papers) and examining the effects of a critical event or intervention (9 papers; the remaining papers consisted of other goals). Thus, this review not only demonstrates the relative scarcity of time series methods in psychological research, but also that scholars have primarily used descriptive or causal explanatory models for time series data analysis (Shmueli, 2010 ).

The prevalence of these types of models is typical of social science, but in fields where time series analysis is most commonly found (e.g., econometrics, finance, the atmospheric sciences), forecasting is often the primary goal because it bears on important practical decisions. As a result, the statistical time series literature is dominated by models that are aimed toward prediction , not explanation (Shmueli, 2010 ), and almost every book on applied time series analysis is exclusively devoted to forecasting methods (McCleary et al., 1980 , p. 205). Although there are many well-written texts on time series modeling for economic and financial applications (e.g., Rothman, 1999 ; Mills and Markellos, 2008 ), there is a lack of formal introductions geared toward psychological issues (see West and Hepworth, 1991 for an exception). Thus, a psychologist looking to use these methodologies may find themselves with resources that focus on entirely different goals. The current paper attempts to amend this by providing an introduction to time series methodologies that is oriented toward issues within psychological research. This is accomplished by first introducing the basic characteristics of time series data: the four components of variation (trend, seasonality, cycles, and irregular variation), autocorrelation, and stationarity. Then, various time series regression models are explicated that can be used to achieve a wide range of goals, such as describing the process of change through time, estimating seasonal effects, and examining the effect of an intervention or critical event. Not to overlook the potential importance of forecasting for psychological research, the second half of the paper discusses methods for modeling autocorrelation and generating accurate predictions—viz., autoregressive integrative moving average (ARIMA) modeling. The final section briefly describes how regression techniques and ARIMA models can be combined in a dynamic regression model that can simultaneously explain and forecast a time series variable. Thus, the current paper seeks to provide an integrative resource for psychological researchers interested in analyzing time series data which, given the trends described above, are poised to become increasingly prevalent.

## The current illustrative application

In order to better demonstrate how time series analysis can accomplish the goals of psychological research, a running practical example is presented throughout the current paper. For this particular illustration, we focused on online job search behaviors using data from Google Trends, which compiles the frequency of online searches on Google over time. We were particularly interested in the frequency of online job searches in the United States 2 and the impact of the 2008 economic crisis on these rates. Our primary research hypothesis was that this critical event resulted in a sharp increase in the series that persisted over time. The monthly frequencies of these searches from January 2004 to June 2011 were recorded, constituting a data set of 90 total observations. Figure Figure1 1 displays a plot of this original time series that will be referenced throughout the current paper. Importantly, the values of the series do not represent the raw number of Google searches, but have been normalized (0–100) in order to yield a more tractable data set; each monthly value represents its percentage relative to the maximum observed value 3 .

A plot of the original Google job search time series and the series after seasonal adjustment .

## A note on software implementation

Conceptual expositions of new analytical methods can often be undermined by the practical issue of software implementation (Sharpe, 2013 ). To preempt this obstacle, for each analysis we provide accompanying R code in the Supplementary Material, along with an intuitive explanation of the meanings and rationale behind the various commands and arguments. On account of its versatility, the open-source statistical package R (R Development Core Team, 2011 ) remains the software platform of choice for performing time series analyses, and a number of introductory texts are oriented solely toward this program, such as Introductory Time Series with R (Cowpertwait and Metcalfe, 2009 ), Time Series Analysis with Applications in R (Cryer and Chan, 2008 ), and Time Series Analysis and Its Applications with R Examples (Shumway and Stoffer, 2006 ). In recent years, R has become increasingly recognized within the psychological sciences as well (Muenchen, 2013 ). We believe that psychological researchers with even a minimal amount of experience with R will find this tutorial both informative and accessible.

## An introduction to time series data

Before introducing how time series analyses can be used in psychological research, it is necessary to first explicate the features that characterize time series data. At its simplest, a time series is a set of time-ordered observations of a process where the intervals between observations remain constant (e.g., weeks, months, years, and minor deviations in the intervals are acceptable; McCleary et al., 1980 , p. 21; Cowpertwait and Metcalfe, 2009 ). Time series data is often distinguished from other types of longitudinal data by the number and source of the observations; a univariate time series contains many observations originating from a single source (e.g., an individual, a price index), while other forms of longitudinal data often consist of several observations from many sources (e.g., a group of individuals). The length of time series can vary, but are generally at least 20 observations long, and many models require at least 50 observations for accurate estimation (McCleary et al., 1980 , p. 20). More data is always preferable, but at the very least, a time series should be long enough to capture the phenomena of interest.

Due to its unique structure, a time series exhibits characteristics that are either absent or less prominent in the kinds of cross-sectional and longitudinal data typically collected in psychological research. In the next sections, we review these features that include autocorrelation and stationarity . However, we begin by delineating the types of patterns that may be present within a time series. That is, the variation or movement in a series can be partitioned into four parts: the trend, seasonal, cyclical , and irregular components (Persons, 1919 ).

## The four components of time series

Trend refers to any systematic change in the level of a series—i.e., its long-term direction (McCleary et al., 1980 , p. 31; Hyndman and Athanasopoulos, 2014 ). Both the direction and slope (rate of change) of a trend may remain constant or change throughout the course of the series. Globally, the illustrative time series shown in Figure Figure1 1 exhibits a positive trend: The level of the series at the end is systematically higher than at its beginning. However, there are sections in this particular series that do not exhibit the same rate of increase. The beginning of the series displays a slight negative trend, and starting approximately at 2006, the series significantly rises until 2009, after which a small downward trend may even be present.

Because a trend in the data represents a significant source of variability, it must be accounted for when performing any time series analysis. That is, it must be either (a) modeled explicitly or (b) removed through mathematical transformations (i.e., detrending ; McCleary et al., 1980 , p. 32). The former approach is taken when the trend is theoretically interesting—either on its own or in relation to other variables. Conversely, removing the trend (through methods discussed later) is performed when this component is not pertinent to the goals of the analysis (e.g., strict forecasting). The decision of whether to model or remove systematic components like a trend represents an important aspect of time series analysis. The various characteristics of time series data are either of theoretical interest—in which case they should be modeled—or not, in which case they should be removed so that the aspects that are of interest can be more easily analyzed. Thus, it is incumbent upon the analyst to establish the goals of the analysis and determine which components of a time series are of interest and treat them accordingly. This topic will be revisited throughout the forthcoming sections.

## Seasonality

Unlike the trend component, the seasonal component of a series is a repeating pattern of increase and decrease in the series that occurs consistently throughout its duration. More specifically, it can be defined as a cyclical or repeating pattern of movement within a period of 1 year or less that is attributed to “seasonal” factors—i.e., those related to an aspect of the calendar (e.g., the months or quarters of a year or the days of a week; Cowpertwait and Metcalfe, 2009 , p. 6; Hyndman and Athanasopoulos, 2014 ). For instance, restaurant attendance may exhibit a weekly seasonal pattern such that the weekends routinely display the highest levels within the series across weeks (i.e., the time period), and the first several weekdays are consistently the lowest. Retail sales often display a monthly seasonal pattern, where each month across yearly periods consistently exhibits the same relative position to the others: viz., a spike in the series during the holiday months and a marked decrease in the following months. Importantly, the pattern represented by a seasonal effect remains constant and occurs over the same duration on each occasion (Hyndman and Athanasopoulos, 2014 ).

Although its underlying pattern remains fixed, the magnitude of a seasonal effect may vary across periods. Seasonal effects can also be embedded within overarching trends. Along with a marked trend, the series in Figure Figure1 1 exhibits noticeable seasonal fluctuations as well; at the beginning of each year (i.e., after the holiday months), online job searches spike and then fall significantly in February. After February, they continue to rise until about July or August, after which the series significantly drops for the remainder of the year, representing the effects of seasonal employment. Notice the consistency of both the form (i.e., pattern of increase and decrease) and magnitude of this seasonal effect. The fact that online job search behavior exhibits seasonal patterns supports the idea that this behavior (and this example in particular) is representative of job search behavior in general. In the United States, thousands of individuals engage in seasonal work which results in higher unemployment rates in the beginning of each year and in the later summer months (e.g., July and August; The United States Department of Labor, Bureau of Labor Statistics, 2014 ), manifesting in a similar seasonal pattern of job search behavior.

One may be interested in the presence of seasonal effects, but once identified, this source of variation is often removed from the time series through a procedure known as seasonal adjustment (Cowpertwait and Metcalfe, 2009 , p. 21). This is in keeping with the aforementioned theme: Once a systematic component has been identified, it must either be modeled or removed. The popularity of seasonal adjustment is due to the characteristics of seasonal effects delineated above: Unlike other more dynamic components of a time series, seasonal patterns remain consistent across periods and are generally similar in magnitude (Hyndman and Athanasopoulos, 2014 ). Their effects may also obscure other important features of time series—e.g., a previously unnoticed trend or cycles described in the following section. Put simply, “seasonal adjustment is done to simplify data so that they may be more easily interpreted…without a significant loss of information” (Bell and Hillmer, 1984 , p. 301). Unemployment rates are often seasonally adjusted to remove the fluctuations due to the effects of weather, harvests, and school schedules that remain more or less constant across years. In our data, the seasonal effects of job search behavior are not of direct theoretical interest relative to other features of the data, such as the underlying trend and the impact of the 2008 economic crisis. Thus, we may prefer to work with the simpler seasonally adjusted series. The lower panel of Figure Figure1 1 displays the original Google time series after seasonal adjustment, and the Supplementary Material contains a description of how to implement this procedure in R. It can be seen that the trend is made notably clearer after removing the seasonal effects. Despite the spike at the very end, the suspected downward trend in the later part of the series is much more evident. This insight will prove to be important when selecting an appropriate time series model in the upcoming sections.

A cyclical component in a time series is conceptually similar to a seasonal component: It is a pattern of fluctuation (i.e., increase or decrease) that reoccurs across periods of time. However, unlike seasonal effects whose duration is fixed across occurrences and are associated with some aspect of the calendar (e.g., days, months), the patterns represented by cyclical effects are not of fixed duration (i.e., their length often varies from cycle to cycle) and are not attributable to any naturally-occurring time periods (Hyndman and Athanasopoulos, 2014 ). Put simply, cycles are any non-seasonal component that varies in a recognizable pattern (e.g., business cycles; Hyndman and Athanasopoulos, 2014 ). In contrast to seasonal effects, cycles generally occur over a period lasting longer than 2 years (although they may be shorter), and the magnitude of cyclical effects is generally more variable than that of seasonal effects (Hyndman and Athanasopoulos, 2014 ). Furthermore, just as the previous two components—trend and seasonality—can be present with or without the other, a cyclical component may be present with any combination of the other two. For instance, a trend with an intrinsic seasonal effect can be embedded within a greater cyclical pattern that occurs over a period of several years. Alternatively, a cyclical effect may be present without either of these two systematic components.

In the 7 years that constitute the time series of Figure Figure1, 1 , there do not appear to be any cyclical effects. This is expected, as there are no strong theoretical reasons to believe that online or job search behavior is significantly influenced by factors that consistently manifest across a period of over one year. We have significant a priori reasons to believe that causal factors related to seasonality exist (e.g., searching for work after seasonal employment), but the same does not hold true for long-term cycles, and the time series is sufficiently long enough to capture any potential cyclical behavior.

## Irregular variation (randomness)

While the previous three components represented three systematic types of time series variability (i.e., signal ; Hyndman and Athanasopoulos, 2014 ), the irregular component represents statistical noise and is analogous to the error terms included in various types of statistical models (e.g., the random component in generalized linear modeling). It constitutes any remaining variation in a time series after these three systematic components have been partitioned out. In time series parlance, when this component is completely random (i.e., not autocorrelated), it is referred to as white noise , which plays an important role in both the theory and practice of time series modeling. Time series are assumed to be in part driven by a white noise process (explicated in a future section), and white noise is vital for judging the adequacy of a time series model. After a model has been fit to the data, the residuals form a time series of their own, called the residual error series . If the statistical model has been successful in accounting for all the patterns in the data (e.g., systematic components such as trend and seasonality), the residual error series should be nothing more than unrelated white noise error terms with a mean of zero and some constant variance. In other words, the model should be successful in extracting all the signal present in the data with only randomness left over (Cowpertwait and Metcalfe, 2009 , p. 68). This is analogous to evaluating the residuals of linear regression, which should be normally distributed around a mean of zero.

## Time series decomposition

To visually examine a series in an exploratory fashion, time series are often formally partitioned into each of these components through a procedure referred to as time series decomposition . Figure Figure2 2 displays the original Google time series (top panel) decomposed into its constituent parts. This figure depicts what is referred to as classical decomposition , when a time series is conceived of comprising three components: a trend-cycle, seasonal, and random component. (Here, the trend and cycle are combined because the duration of each cycle is unknown; Hyndman and Athanasopoulos, 2014 ). The classic additive decomposition model (Cowpertwait and Metcalfe, 2009 , p. 19) describes each value of the time series as the sum of these three components:

The original time series decomposed into its trend, seasonal, and irregular (i.e., random) components . Cyclical effects are not present within this series.

The additive decomposition model is most appropriate when the magnitude of the trend-cycle and seasonal components remain constant over the course of the series. However, when the magnitude of these components varies but still appears proportional over time (i.e., it changes by a multiplicative factor), the series may be better represented by the multiplicative decomposition model, where each observation is the product of the trend-cycle, seasonal, and random components:

In either decomposition model, each component is sequentially estimated and then removed until only the stochastic error component remains (the bottom panel of Figure Figure2). 2 ). The primary purpose of time series decomposition is to provide the analyst with a better understanding of the underlying behavior and patterns of the time series which can be valuable in determining the goals of the analysis. Decomposition models can be used to generate forecasts by adding or multiplying future estimates of the seasonal and trend-cycle components (Hyndman and Athanasopoulos, 2014 ). However, such models are beyond the scope of this present paper, and the ARIMA forecasting models discussed later are generally superior 4 .

## Autocorrelation

In psychological research, the current state of a variable may partially depend on prior states. That is, many psychological variables exhibit autocorrelation : when a variable is correlated with itself across different time points (also referred to as serial dependence ). Time series designs capture the effect of previous states and incorporate this potentially significant source of variance within their corresponding statistical models. Although the main features of many time series are its systematic components such as trend and seasonality, a large portion of time series methodology is aimed at explaining the autocorrelation in the data (Dettling, 2013 , p. 2).

The importance of accounting for autocorrelation should not be overlooked; it is ubiquitous in social science phenomena (Kerlinger, 1973 ; Jones et al., 1977 ; Hartmann et al., 1980 ; Hays, 1981 ). In a review of 44 behavioral research studies with a total of 248 independent sets of repeated measures data, Busk and Marascuilo ( 1988 ) found that 80% of the calculated autocorrelations ranged from 0.1 to 0.49, and 40% exceeded 0.25. More specific to the psychological sciences, it has been proposed that state-related constructs at the individual-level, such as emotions and arousal, are often contingent on prior states (Wood and Brown, 1994 ). Using autocorrelation analysis, Fairbairn and Sayette ( 2013 ) found that alcohol use reduces emotional inertia, the extent to which prior affective states determine current emotions. Through this, they were able to marshal support for the theory of alcohol myopia , the intuitive but largely untested idea that alcohol allows a greater enjoyment of the present, and thus formally uncovered an affective motivation for alcohol use (and misuse). Further, using time series methods, Fuller et al. ( 2003 ) found that job stress in the present day was negatively related to the degree of stress in the preceding day. Accounting for autocorrelation can therefore reveal new information on the phenomenon of interest, as the Fuller et al. ( 2003 ) analysis led to the counterintuitive finding that lower stress was observed after prior levels had been high.

Statistically, autocorrelation simply represents the Pearson correlation for a variable with itself at a previous time period, referred to as the lag of the autocorrelation. For instance, the lag-1 autocorrelation of a time series is the correlation of each value with the immediately preceding observation; a lag-2 autocorrelation is the correlation with the value that occurred two observations before. The autocorrelation with respect to any lag can be computed (e.g., a lag-20 autocorrelation), and intuitively, the strength of the autocorrelation generally diminishes as the length of the lag increases (i.e., as the values become further removed in time).

Strong positive autocorrelation in a time series manifests graphically by “runs” of values that are either above or below the average value of the time series. Such time series are sometimes called “persistent” because when the series is above (or below) the mean value it tends to remain that way for several periods. Conversely, negative autocorrelation is characterized by the absence of runs—i.e., when positive values tend to follow negative values (and vice versa). Figure Figure3 3 contains two plots of time series intended to give the reader an intuitive understanding of the presence of autocorrelation: The series in the top panel exhibits positive autocorrelation, while the center panel illustrates negative autocorrelation. It is important to note that the autocorrelation in these series is not obscured by other components and that in real time series, visual analysis alone may not be sufficient to detect autocorrelation.

Two example time series displaying exaggerated positive (top panel) and negative (center panel) autocorrelation . The bottom panel depicts the ACF of the Google job search time series after seasonal adjustment.

In time series analysis, the autocorrelation coefficient across many lags is called the autocorrelation function (ACF) and plays a significant role in model selection and evaluation (as discussed later). A plot of the ACF of the Google job search time series after seasonal adjustment is presented in the bottom panel of Figure Figure3. 3 . In an ACF plot, the y-axis displays the strength of the autocorrelation (ranging from positive to negative 1), and the x-axis represents the length of the lags: from lag-0 (which will always be 1) to much higher lags (here, lag-19). The dotted horizontal line indicates the p < 0.05 criterion for statistical significance.

## Stationarity

Definition and purpose.

A complication with time series data is that its mean, variance, or autocorrelation structure can vary over time. A time series is said to be stationary when these properties remain constant (Cryer and Chan, 2008 , p. 16). Thus, there are many ways in which a series can be non-stationary (e.g., an increasing variance over time), but it can only be stationary in one-way (viz., when all of these features do not change).

Stationarity is a pivotal concept in time series analysis because descriptive statistics of a series (e.g., its mean and variance) are only accurate population estimates if they remain constant throughout the series (Cowpertwait and Metcalfe, 2009 , pp. 31–32). With a stationary series, it will not matter when the variable is observed: “The properties of one section of the data are much like those of any other” (Chatfield, 2004 , p. 13). As a result, a stationary series is easy to predict: Its future values will be similar to those in the past (Nua, 2014 ). As a result, stationarity is the most important assumption when making predictions based on past observations (Cryer and Chan, 2008 , p. 16), and many times series models assume the series already is or can be transformed to stationarity (e.g., the broad class of ARIMA models discussed later).

In general, a stationary time series will have no predictable patterns in the long-term; plots will show the series to be roughly horizontal with some constant variance (Hyndman and Athanasopoulos, 2014 ). A stationary time series is illustrated in Figure Figure4, 4 , which is a stationary white noise series (i.e., a series of uncorrelated terms). The series hovers around the same general region (i.e., its mean) with a consistent variance around this value. Despite the observations having a constant mean, variance, and autocorrelation, notice how such a process can generate outliers (e.g., the low extreme value after t = 60), as well as runs of values that are both above or below the mean. Thus, stationarity does not preclude these temporary and fluctuating behaviors of the series, although any systematic patterns would.

An example of a stationary time series (specifically, a series of uncorrelated white noise terms) . The mean, variance, and autocorrelation are all constant over time, and the series displays no systematic patterns, such as trends or cycles.

However, many time series in real life are dominated by trends and seasonal effects that preclude stationarity. A series with a trend cannot be stationary because, by definition, a trend is when the mean level of the series changes over time. Seasonal effects also preclude stationarity, as they are reoccurring patterns of change in the mean of the series within a fixed time period (e.g., a year). Thus, trend and seasonality are the two time series components that must be addressed in order to achieve stationarity.

## Transforming a series to stationarity

When a time series is not stationary, it can be made so after accounting for these systematic components within the model or through mathematical transformations. The procedure of seasonal adjustment described above is a method that removes the systematic seasonal effects on the mean level of the series.

The most important method of stationarizing the mean of a series is through a process called differencing , which can be used to remove any trend in the series which is not of interest. In the simplest case of a linear trend, the slope (i.e., the change from one period to the next) remains relatively constant over time. In such a case, the difference between each time period and its preceding one (referred to as the first differences ) are approximately equal. Thus, one can effectively “detrend” the series by transforming the original series into a series of first differences (Meko, 2013 ; Hyndman and Athanasopoulos, 2014 ). The underlying logic is that forecasting the change in a series from one period to the next is just as useful in practice as predicting the original series values.

However, when the time series exhibits a trend that itself changes (i.e., a non-constant slope), then even transforming a series into a series of its first differences may not render it completely stationary. This is because when the slope itself is changing (e.g., an exponential trend), the difference between periods will be unequal. In such cases, taking the first differences of the already differenced series (referred to as the second differences ) will often stationarize the series. This is because each successive differencing has the effect of reducing the overall variance of the series (Anderson, 1976 ), as deviations from the mean level are increasingly reduced through this subtractive process. The second differences (i.e., the first differences of the already differenced series) will therefore further stabilize the mean. There are general guidelines on how many orders of differencing are necessary to stationarize a series. For instance, the first or second differences will nearly always stationarize the mean, and in practice it is almost never necessary to go beyond second differencing (Cryer and Chan, 2008 ; Hyndman and Athanasopoulos, 2014 ). However, for series that exhibit higher-degree polynomial trends, the order of differencing required to stationarize the series is typically equal to that degree (e.g., two orders of differencing for an approximately quadratic trend, three orders for a cubic trend; Cowpertwait and Metcalfe, 2009 , p. 93).

A common mistake in time series modeling to “overdifference” the series, when more orders of differencing than are required to achieve stationarity are performed. This can complicate the process of building an adequate and parsimonious model (see McCleary et al., 1980 , p. 97). Fortunately, overdifferencing is relatively easy to identify; differencing a series with a trend will have the effect of reducing the variance of the series, but an unnecessary degree of differencing will increase its variance (Anderson, 1976 ). Thus, the optimal order of differencing is that which results in the lowest variance of the series.

If the variance of a times series is not constant over time, a common method of making the variance stationary is through a logarithmic transformation of the series (Cowpertwait and Metcalfe, 2009 , pp. 109–112; Hyndman and Athanasopoulos, 2014 ). Taking the logarithm has the practical effect of reducing each value at an exponential rate. That is, the larger the value, the more its value is reduced. Thus, this transformation stabilizes the differences across values (i.e., its variance) which is also why it is frequently used to mitigate the effect of outliers (e.g., Aguinis et al., 2013 ). It is important to remember that if one applies a transformation, any forecasts generated by the selected model will be in these transformed units. However, once the model is fitted and the parameters estimated, one can reverse these transformations to obtain forecasts in its original metric.

Finally, there are also formal statistical tests for stationarity, termed unit root tests. A very popular procedure is the augmented Dickey–Fuller test (ADF; Said and Dickey, 1984 ) which tests the null hypothesis that the series is non-stationary. Thus, rejection of the null provides evidence for a stationary series. Table Table1 1 below contains information regarding the ADF test, as well as descriptions of other various statistical tests frequently used in time series analysis that will be discussed in the remainder of the paper. By using the ADF test in conjunction with the transformations described above (or the modeling procedures delineated below), an analyst can ensure that a series conforms to stationarity.

Common tests in time series analysis .

Augmented Dickey–Fuller (ADF) | The series is non-stationary; rejection implies a stationary series. | A series must be stationary before any AR or MA terms are added to account for its autocorrelation. The ADF test identifies if a series needs to be made stationary through differencing, or, after an order of differencing has been applied, if the series has indeed become stationary. |

Durbin–Watson | The residuals from a regression model do not have a lag-1 autocorrelation; rejection implies lag-1 autocorrelated errors. | A Durbin–Watson test can assess if the residuals of a regression model are autocorrelated. When this is the case, including ARIMA terms or using generalized least squares estimation can account for this autocorrelation. |

Ljung–Box | The errors are uncorrelated; rejection implies correlated errors. | After fitting an ARIMA or dynamic regression model to a series, the Ljung–Box test identifies if the model has been successful in extracting all the autocorrelation. |

There are other tests for stationarity, such as the Phillips–Perron and Kwiatkowski–Phillips–Schmidt–Shin (KPSS) tests which can sometimes yield contrary results. The ADF test was chosen as the focus of this paper due to its popularity and reliability. For information regarding the others, see Cowpertwait and Metcalfe ( 2009 , pp. 214–215) and Hyndman and Athanasopoulos ( 2014 ) .

## Time series modeling: regression methods

The statistical time series literature is dominated by methodologies aimed at forecasting the behavior of a time series (Shmueli, 2010 ). Yet, as the survey in the introduction illustrated, psychological researchers are primarily interested in other applications, such as describing and accounting for an underlying trend, linking explanatory variables to the criterion of interest, and assessing the impact of critical events. Thus, psychological researchers will primarily use descriptive or explanatory models, as opposed to predictive models aimed solely at generating accurate forecasts. In time series analysis, each of the aforementioned goals can be accomplished through the use of regression methods in a manner very similar to the analysis of cross-sectional data. After having explicated the basic properties of time series data, we now discuss these specific modeling approaches that are able fulfill these purposes. The next four sections begin by first providing an overview of each type of regression model, how psychological research stands to gain from the use of these methods, and their corresponding statistical models. We include mathematical treatments, but also provide conceptual explanations so that they may be understood in an accessible and intuitive manner. Additionally, Figure Figure5 5 presents a flowchart depicting different time series models and which approaches are best for addressing the various goals of psychological research. As the current paper continues, the reader will come to understand the meaning and structure of these models and their relation to substantive research questions.

A flowchart depicting various time series modeling approaches and how they are suited to address various goals in psychological research .

It is important to keep in mind that time series often exhibit strong autocorrelation which often manifests in correlated residuals after a regression model has been fit. This violates the standard assumption of independent (i.e., uncorrelated) errors. In the section that follows these regression approaches, we describe how the remaining autocorrelation can be included in the model by building a dynamic regression model that includes ARIMA terms 5 . That is, a regression model can be first fit to the data for explanatory or descriptive modeling, and ARIMA terms can be fit to the residuals in order to account for any remaining autocorrelation and improve forecasts (Hyndman and Athanasopoulos, 2014 ). However, we begin by introducing regression methods separate from ARIMA modeling, temporarily setting aside the issue of autocorrelation. This is done in order to better focus on the implementation of these models, but also because violating this assumption has minimal effects on the substance of the analysis: The parameter estimates remain unbiased and can still be used for prediction. Its forecasts will not be “wrong,” but inefficient —i.e., ignoring the information represented by the autocorrelation that could be used to obtain better predictions (Hyndman and Athanasopoulos, 2014 ). Additionally, generalized least squares estimation (as opposed to ordinary least squares) takes into account the effects of autocorrelation which otherwise lead to underestimated standard errors (Cowpertwait and Metcalfe, 2009 , p. 98). This estimation procedure was used for each of the regression models below. For further information on regression methods for time series, the reader is directed to Hyndman and Athanasopoulos ( 2014 , chaps. 4, 5) and McCleary et al. ( 1980 ), which are very accessible introductions to the topic, as well as Cowpertwait and Metcalfe ( 2009 , chap. 5) and Cryer and Chan ( 2008 , chaps. 3, 11) for more mathematically-oriented treatments.

## Modeling trends through regression

Modeling an observed trend in a time series through regression is appropriate when the trend is deterministic —i.e., the trend is due to the constant, deterministic effects of a few causal forces (McCleary et al., 1980 , p. 34). As a result, a deterministic trend is generally stable across time. Expecting any trend to continue indefinitely is often unrealistic, but for a deterministic trend, linear extrapolation can provide accurate forecasts for several periods ahead, as forecasting generally assumes that trends will continue and change relatively slowly (Cowpertwait and Metcalfe, 2009 , p. 6). Thus, when the trend is deterministic, it is desirable to use a regression model that includes the hypothesized causal factors as predictors (Cowpertwait and Metcalfe, 2009 , p. 91; McCleary et al., 1980 , p. 34).

Deterministic trends stand in contrast to stochastic trends, those that arise simply from the random movement of the variable over time (long runs of similar values due to autocorrelation; Cowpertwait and Metcalfe, 2009 , p. 91). As a result, stochastic trends often exhibit frequent and inexplicable changes in both slope and direction. When the trend is deemed to be stochastic, it is often removed through differencing. There are also methods for forecasting using stochastic trends (e.g., random walk and exponential smoothing models) discussed in Cowpertwait and Metcalfe ( 2009 , chaps. 3, 4) and Hyndman and Athanasopoulos ( 2014 , chap. 7). However, the reader should be aware that these are predictive models only, as there is nothing about a stochastic trend that can be explained through external, theoretically interesting factors (i.e., it is a trend attributable to randomness). Therefore, attempting to model it deterministically as a function of time or other substantive variables via regression can lead to spurious relationships (Kuljanin et al., 2011 ) and inaccurate forecasts, as the trend is unlikely to remain stable over time.

Returning to the example Google time series of Figure Figure1, 1 , the evident trend in the seasonally adjusted series might appear to be stochastic: It is not constant but changes at several points within the series. However, we have strong theoretical reasons for modeling it deterministically, as the 2008 economic crisis is one causal factor that likely had a profound impact on the series. Thus, this theoretical rationale implies that the otherwise inexplicable changes in its trend are due to systematic forces that can be appropriately modeled within an explanatory approach (i.e., as a deterministic function of predictors).

## The linear regression model

As noted in the literature review, psychological researchers are often directly interested in describing an underlying trend. For example, Fuller et al. ( 2003 ) examined the strain of university employees using a time series design. They found that each self-report item displayed the same deterministic trend: Globally, strain increased over time even though the perceived severity of the stressful events did not increase. Levels of strain also decreased at spring break and after finals week, during which mood and job satisfaction also exhibited rising levels. This finding cohered with prior theory on the accumulating nature of stress and the importance of regular strain relief (e.g., Bolger et al., 1989 ; Carayon, 1995 ). Furthermore, Wagner et al. ( 1988 ) examined the trend in employee productivity after the implementation of an incentive-based wage system. In addition to discovering an immediate increase in productivity, it was found that productivity increased over time as well (i.e., a continuing deterministic trend). This trend gradually diminished over time, but was still present at the end of the study period—nearly 6 years after the intervention first occurred.

By visually examining a time series, an analyst can describe how a trend changes as function of time. However, one can formally assess the behavior of a trend by regressing the series on a variable that represents time (e.g., 1–50 for 50 equally-spaced observations). In the simplest case, the trend can be modeled as a linear function of time, which is conceptually identical to a regression model for cross-sectional data using a single predictor:

where the coefficient b 1 estimates the amount of change in the time series associated with a one-unit increase in time, t is the time variable, and ε t is random error. The constant, b 0 , estimates the level of the series when t = 0.

If a deterministic trend is fully accounted for by a linear regression model, the residual error series (i.e., the collection of residuals which themselves form a time series) will not contain any remaining trend component; that is, this non-stationary behavior of the series will have been accounted for Cowpertwait and Metcalfe ( 2009 ), (p. 121). Returning to our empirical example, the linear regression model displayed in Equation (3) was fit to the seasonally adjusted Google job search data. This is displayed in the top left panel of Figure Figure6. 6 . The regression line of best-fit is superimposed, and the residual error series is shown in the panel directly to the right. Here, time is a significant predictor ( b 1 = 0.32, p < 0.001), and the model accounts for 67% of the seasonally-adjusted series variance ( R 2 = 0.67, p < 0.001). However, the residual error series displays a notable amount of remaining trend that has been left unaccounted for; the first half of the error series has a striking downward trend that begins to rise at around 2007. This is because the regression line is constrained to linearity and therefore systematically underestimates and overestimates the values of the series when the trend exhibits runs of high and low values, respectively. Importantly, the forecasts from the simple linear model will most likely be very poor as well. Although there is a spike at the end of the series, the linear model predicts that values further ahead in time will be even higher. By contrast, we actually expect these values to decrease, similar to how there was a decreasing trend in 2008 right after the first spike. Thus, despite accounting for a considerable amount of variance and serving as a general approximation of the series trend, the linear model is insufficient in several systematic ways, manifesting in inaccurate forecasts and a significant remaining trend in the residual error series. A method for improving this model is to add in a higher-order polynomial term; modeling the trend as quadratic, cubic, or an even higher-order function may lead to a better-fitting model, but the analyst must be vigilant of overfitting the series—i.e., including so many parameters that the statistical noise becomes modeled. Thus, striking a balance between parsimony and explanatory capability should always be a consideration when modeling time series (and statistical modeling in general). Although a simple linear regression on time is often adequate to approximate a trend (Cowpertwait and Metcalfe, 2009 , p. 5), in this particular instance a higher-order term may provide a better fit to the complex deterministic trend seen within this series.

Three different regression models with time as the regressor and their associated residual error series .

## Polynomial regression models

When describing the trend in the Google data earlier, it was noted that the series began to display a rising trend approximately a third of the way into the series, implying that a quadratic regression model (i.e., a single bend) may yield a good fit to the data. Furthermore, our initial hypothesis was that job search behavior proceeded at a generally constant rate and then spiked once the economic crisis began—also implying a quadratic trend. In some time series, the trend over time will be non-linear, and the predictor terms can be specified to reflect such higher-order terms (quadratic, cubic, etc.). Just like when modeling cross-sectional data, non-linear terms can be incorporated into the statistical model by squaring the predictor (here, time) 6 :

The center panels in Figure Figure6 6 show the quadratic model and its residual error series. In line with the initial hypothesis, both the quadratic term ( b 2 = 0.003, p < 0.001) and linear term ( b 1 = 0.32, p < 0.001) were statistically significant. Thus, modeling the trend as a quadratic function of time explained an additional 4% of the series variance relative to the more parsimonious linear model ( R 2 = 0.71, p < 0.001). However, examination of this series and its residuals shows that it is not as different from the linear model than was expected; although the first half of the residual error series has a more stable mean level, there are still noticeable trends in the first half of the residual error series, and the forecasts implied by this model are even higher than those of the linear model. Therefore, a cubic trend may provide an even better fit, as there are two apparent bends in the series:

After fitting this model to the Google data, 87% of the series variance is accounted for ( R 2 = 0.87 p < 0.001), and all three coefficients are statistically significant: b 1 = 0.69, p < 0.001, b 2 = 0.003, p = 0.05, and b 3 = −0.0003, p < 0.001. Furthermore, the forecasts implied by the model are much more realistic. Ultimately, it is unlikely that this model will provide accurate forecasts many periods into the future (as is often the case for regression models; Cowpertwait and Metcalfe, 2009 , p. 6; Hyndman and Athanasopoulos, 2014 ). It is more likely that either (a) a negative trend will return the series back to more moderate levels or (b) the series will simply continue at a generally high level. Furthermore, relative to the linear model, the residual error series of this model appears much closer to stationarity (e.g., Figure Figure4), 4 ), as the initial downward trend of the time series is captured. Therefore, modeling the series as a cubic function of time is the most successful in terms of accounting for the trend, and adding an even higher-order polynomial term has little remaining variance to explain (<15%) and would likely lead to an overfitted model. Thus, relative to the two previous models, the cubic model strikes a balance between relative parsimony and descriptive capability. However, any forecasts from this model could be improved upon by removing the remaining trend and including other terms that account for any autocorrelation in the data, topics discussed in an upcoming section on ARIMA modeling.

## Interrupted time series analysis

Although we are interested in describing the underlying trend within the Google time series as a function of time, we are also interested in the effect of a critical event, represented by the following question: “Did the 2008 economic crisis result in elevated rates job search behaviors?” In psychological science, many research questions center on the impact of an event, whether it be a relationship change, job transition, or major stressor or uplift (Kanner et al., 1981 ; Dalal et al., 2014 ). In the survey of how time series analysis had been previously used in psychological research, examining the impact of an event was one of its most common uses. In time series methodology, questions regarding the impact of events can be analyzed through interrupted time series analysis (or intervention analysis ; Glass et al., 1975 ), in which the time series observations are “interrupted” by an intervention, treatment, or incident occurring at a known point in time (Cook and Campbell, 1979 ).

In both academic and applied settings, psychological researchers are often constrained to correlational, cross-sectional data. As a result, researchers rarely have the ability to implement control groups within their study designs and are less capable of drawing conclusions regarding causality. In the majority of cases, it is the theory itself that provides the rationale for drawing causal inferences (Shmueli, 2010 , p. 290). In contrast, an interrupted time series is the strongest quasi-experimental design to evaluate the longitudinal impact of an event (Wagner et al., 2002 , p. 299). In a review of previous research on the efficacy of interventions, Beer and Walton ( 1987 ) stated, “much of the research overlooks time and is not sufficiently longitudinal. By assessing the events and their impact at only one nearly contemporaneous moment, the research cannot discuss how permanent the changes are” (p. 343). Interrupted time series analysis ameliorates this problem by taking multiple measurements both before and after the event, thereby allowing the analyst to examine the pre- and post-event trend.

Collecting data at multiple time points also offers advantages relative to cross-sectional comparisons based on pre- and post-event means. A longitudinal interrupted time series design allows the analyst to control for the trend prior to the event, which may turn out to be the cause of any alleged intervention effect. For instance, in the field of industrial/organizational psychology, Pearce et al. ( 1985 ) found a positive trend in four measures of organizational performance over the course of the 4 years under study. However, after incorporating the effects of the pre-event trend in the analysis, neither the implementation of the policy nor the first year of merit-based rewards yielded any additional effects. That is, the post-event trends were almost totally attributable to the pre-event behavior of the series. Thus, a time series design and analysis yielded an entirely different and more parsimonious conclusion that might have otherwise been drawn. In contrast, Wagner et al. ( 1988 ) was able to show that that for non-managerial employees, an incentive-based wage system substantially increased employee productivity in both its baseline level and post-intervention slope (the baseline level jumped over 100%). Thus, interrupted time series analysis is an ideal method for examining the impacts of such events and can be generalized to other criteria of interest.

## Modeling an interrupted time series

Statistical modeling of an interrupted time series can be accomplished through segmented regression analysis (Wagner et al., 2002 , p. 300). Here, the time series is partitioned into two parts: the pre- and post-event segments whose levels (intercepts) and trends (slopes) are both estimated. A change in these parameters represents an effect of the event: A significant change in the level of the series indicates an immediate change, and a change in trend reflects a more gradual change in the outcome (and of course, both are possible; Wagner et al., 2002 , p. 300). The formal model reflects these four parameters of interest:

Here, b 0 represents the pre-event baseline level, t is the predictor time (in our example, coded 1–90), and its coefficient, b 1, estimates the trend prior to the event (Wagner et al., 2002 , p. 31). The dummy variable event t codes for whether or not each time point occurred before or after the event (0 for all points prior to the event; 1 for all points after). Its coefficient, b 2 , assesses the post-event baseline level (intercept). The variable t after event represents how many units after the event the observation took place (0 for all points prior to the event; 1, 2, 3 … for subsequent time points), and its coefficient, b 3 , estimates the change in trend over the two segments. Therefore, the sum of the pre-event trend ( b 1 ) and its estimated change ( b 3 ) yields the post-event slope (Wagner et al., 2002 , p. 301).

Importantly, this analysis requires that the time of event occurrence be specified a priori, otherwise a researcher may search the series in an “exploratory” fashion and discover a time point that yields a notable effect, resulting in potentially spurious results (McCleary et al., 1980 , p. 143). In our example, the event of interest was the economic crisis of 2008. However, as is often the case when analyzing large-scale social phenomena, it was not a discrete, singular incident, but rather unfolded over time. Thus, no exact point in time can perfectly represent its moment of occurrence. In other topics of psychological research, the event of interest is a unique post-event time may be identified. Although interrupted time series analysis requires that events be discrete, this conceptual problem can be easily managed in practice; selecting a point of demarcation that generally reflects when the event occurred will still allow the statistical model to assess the impact of the event on the level and trend of the series. Therefore, due to prior theory and for simplicity, we specified the pre- and post-crisis segments to be separated at January 2008, representing the beginning of the economic crisis and acknowledging that this demarcation was imperfect, but one that would still allow the substantive research question of interest to be answered.

Although not utilized in our analysis, when analyzing an interrupted time series using segmented regression one has the option of actually specifying the post-event segment after the actual event occurred. The rationale behind this is to accommodate the time it takes for the causal effect of the event itself manifest in the time series—the equilibration period (see Mitchell and James, 2001 , p. 539; Wagner et al., 2002 , p. 300). Although an equilibration period is likely a component of all causal phenomena (i.e., causal effects probably never fully manifest at once), two prior reviews have illustrated that researchers account for it only infrequently, both theoretically and empirically (Kelly and McGrath, 1988 ; Mitchell and James, 2001 ). Statistically, this is accomplished through the segmented regression model above, but simply coding the event as occurring later in the series. Comparing models with different post-event start times can also allow competitive tests of the equilibration period.

## Empirical example

For our working example, a segmented regression model was fit to the seasonally adjusted Google time series: A linear trend estimated the first segment and a quadratic trend was fit to the second due to the noted curvilinear form of the second half of the series. Thus, a new variable and coefficient were added to the formal model to account for this non-linearity: t after event 2 and b 4 , respectively. The results of the analysis indicated that there was a practically significant effect of the crisis: The parameter representing an immediate change in the post-event level was b 2 = 8.66, p < 0.001. Although the level (i.e., intercept) differed across segments, the post-crisis trend appears to be the most notable change in the series. That is, the real effect of the crisis unfolded over time rather than having an immediately abrupt impact. This is reflected in the other coefficients of the model: The pre-crisis trend was estimated to be near zero ( b 1 = −0.03, p = 0.44), and the post-crisis trend terms were b 3 = 0.70, p < 0.001 for the linear component, and b 4 = −0.02, p < 0.001 for the quadratic term, indicating that there was a marked change in trend, but also that it was concave (i.e., on the whole, slowly decreasing over time). Graphically the model seems to capture the underlying trend of both segments exceptionally well ( R 2 = 0.87, p < 0.001), as the residual error series has almost reached stationarity ( ADF = −3.38, p = 0.06). Both are shown in Figure Figure7 7 below.

A segmented regression model used to assess the effect of the 2008 economic crisis on the time series and its associated residual error series .

## Estimating seasonal effects

Up until now, we have chosen to remove any seasonal effects by working with the seasonally adjusted time series in order to more fully investigate a trend of substantive interest. This was consistent with the following adage of time series modeling: When a systematic trend or seasonal pattern is present, it must either be modeled or removed. However, psychological researchers may also be interested in the presence and nature of a seasonal effect, and seasonal adjustment would only serve to remove this component of interest. Seasonality was defined earlier as any regular pattern of fluctuation (i.e., movement up or down in the level of the series) associated with some aspect of the calendar. For instance, although online job searchers exhibited an underlying trend in our data across years, they also display the same pattern of movement within each year (i.e., across months; see Figure Figure1). 1 ). Following the need for more time-based theory and empirical research, seasonal effects are also increasingly recognized as significant for psychological science. In a recent conceptual review Dalal et al. ( 2014 ) noted that, “mood cycles… are likely to occur simultaneously over the course of a day (relatively short term) and over the course of a year (long term)” (p. 1401). Relatedly, Larsen and Kasimatis ( 1990 ) used time series methods to examine the stability of mood fluctuations across individuals. They uncovered a regular weekly fluctuation that was stronger for introverted individuals than for extraverts (due to the latter's sensation-seeking behavior that resulted in greater mood variability).

Furthermore, many systems of interest exhibit rhythmicity. This can be readily observed across a broad spectrum of phenomena that are of interest to psychological researchers. At the individual level, there is a long history in biopsychology exploring the cyclical pattern of human behavior as a function of biological processes. Prior research has consistently shown that humans possess many common physiological and behavioral cycles that range from 90-min to 365-days (Aschoff, 1984 ; Almagor and Ehrlich, 1990 ) and may affect important psychological outcomes. For instance, circadian rhythms are particularly well-known and are associated with physical, mental, and behavioral changes within a 24-h period (McGrath and Rotchford, 1983 ). It has been suggested that peak motivation levels may occur at specific points in the day (George and Jones, 2000 ), and longer cyclical fluctuations of emotion, sensitivity, intelligence, and physical characteristics over days and weeks have been identified (for a review, see Conroy and Mills, 1970 ; Luce, 1970 ; Almagor and Ehrlich, 1990 ). Such cycles have been found to affect intelligence test performance and other physical and cognitive tasks (e.g., Latman, 1977 ; Kumari and Corr, 1996 ).

## Regression with seasonal indicators

As previously stated, when seasonal effects are theoretically important, seasonal adjustment is undesirable because it removes the time series component pertinent to the research question at large. An alternative is to qualitatively describe the seasonal pattern or formally specify a regression model that includes a variable which estimates the effect of each season. If a simple linear approximation is used for the trend, the formal model can be expressed as:

where b 0 is now the estimate of the linear relationship between the dependent variable and time, and the coefficients b 1:S are estimates of the S seasonal effects (e.g., S = 12 for yearly data; Cowpertwait and Metcalfe, 2009 , p. 100). Put more intuitively, this model can still be conceived of as a linear model but with a different estimated intercept for each season that represents its effect (Notice that the b 1:S parameters are not coefficients but constants).

As an example, the model above was fit to the original, non-seasonally adjusted Google data. Although modeling the series as a linear function of time was found to produce inaccurate forecasts, it can be used when estimating seasonal effects because this component of the model does not affect the estimates of the seasonal effects. For our data, the estimates of each monthly effect were: b 1 = 67.51, b 2 = 59.43, b 3 = 60.11, b 4 = 60.66, b 5 = 63.59, b 6 = 66.77, b 7 = 63.70, b 8 = 62.38, b 9 = 60.49, b 10 = 56.88, b 11 = 52.13, b 12 = 45.66 (Each effect was statistically significant at p < 0.001). The pattern of these intercepts mirrors the pattern of movement qualitatively described in the discussion on the seasonal component: Online job search behaviors begin at its highest levels in January ( b 1 = 67.51), likely due to the end of holiday employment, and then dropped significantly in February ( b 2 = 59.43). Subsequently, its level continued to rise during the next 4 months until June ( b 6 = 66.77), after which the series decreased each successive month until reaching its lowest point in December ( b 12 = 45.66).

## Harmonic seasonal models

Another approach to modeling seasonal effects is to fit a harmonic seasonal model that uses sine and cosine functions to describe the pattern of fluctuations seen across periods. Seasonal effects often vary in a smooth, continuous fashion, and instead of estimating a discrete intercept for each season, this approach can provide a more realistic model of seasonal change (see Cowpertwait and Metcalfe, 2009 , pp. 101–108). Formally, the model is:

where m t is the estimate of the trend at t (approximated as a linear or polynomial function of time), s i and c i are the unknown parameters of interest, S is the number of seasons within the time period (e.g., 12 months for a yearly period), i is an index that ranges from 1 to S/2 , and t is a variable that is coded to represent time (e.g., 1:90 for 90 equally-spaced observations). Although this model is complex, it can be conceived as including a predictor for each season that contains a sine and/or cosine term. For yearly data, this means that six s and six c coefficients estimate the seasonal pattern ( S/2 coefficients for each parameter type). Importantly, after this initial model is estimated, the coefficients that are not statistically significant can be dropped, which often results in fewer parameters relative to the seasonal indicator model introduced first (Cowpertwait and Metcalfe, 2009 , p. 104). For our data, the above model was fit using a linear approximation for the trend, and five of the original twelve seasonal coefficients were statistically significant and thus retained: c 1 = −5.08, p < 0.001, s 2 = 2.85, p = 0.005, s 3 = 2.68, p = 0.009, c 3 = −2.25, p = 0.03, c 5 = −2.97, p = 0.004. This model also explained a substantial amount of the series variance ( R 2 = 0.75, p < 0.001). Pre-made and annotated R code for this analysis can be found in the Supplementary Material.

## Time series forecasting: ARIMA ( p, d, q ) modeling

In the preceding section, a number of descriptive and explanatory regression models were introduced that addressed various topics relevant to psychological research. First, we sought to determine how the trend in the series could be best described as a function of time. Three models were fit to the data, and modeling the trend as a cubic function provided the best fit: It was the most parsimonious model that explained a very large amount of variation in the series, it did not systematically over or underestimate many successive observations, and any potential forecasts were clearly superior relative to those of the simpler linear and quadratic models. In the subsequent section, a segmented regression analysis was conducted in order to examine the impact of the 2008 economic crisis on job search behavior. It was found that there was both a significant immediate increase in the baseline level of the series (intercept) and a concomitant increase in its trend (i.e., slope) that gradually decreased over time. Finally, the seasonal effects of online search behavior were estimated and mirrored the pattern of job employment rates described in a prior section.

From these analyses, it can be seen that the main features of many times series are the trend and seasonal components that must either be modeled as deterministic functions of predictors or removed from the series. However, as previously described, another critical feature in time series data is its autocorrelation , and a large portion of time series methodology is aimed at explaining this component (Dettling, 2013 , p. 2). Primarily, accounting for autocorrelation entails fitting an ARIMA model to the original series, or adding ARIMA terms to a previously fit regression model; ARIMA models are the most general class of models that seek to explain the autocorrelation frequently found in time series data (Hyndman and Athanasopoulos, 2014 ). Without these terms, a regression model will ignore the pattern of autocorrelation among the residuals and produce less accurate forecasts (Hyndman and Athanasopoulos, 2014 ). Therefore, ARIMA models are predictive forecasting models . Time series models that include both regression and ARIMA terms are referred to as dynamic models and may be a primary type of time series models used by psychological researchers.

Although not strongly emphasized within psychological science, forecasting is an important aspect of scientific verification (Popper, 1968 ). Standard cross-sectional and longitudinal models are generally used in an explanatory fashion (e.g., estimating the relationships among constructs and testing null hypotheses), but they are quite capable of prediction as well. Because of the ostensible movement to more time-based empirical research and theory, predicting future values will likely become a more important aspect of statistical modeling, as it can validate psychological theory (Weiss and Cropanzano, 1996 ) and computational models (Tobias, 2009 ) that specify effects over time.

At the outset, it is helpful to note that the regression and ARIMA modeling approaches are not substantially different: They both formalize the variation in the time series variable as a function of predictors and some stochastic noise (i.e., the error term). The only practical difference is that while regression models are generally built from prior research or theory, ARIMA models are developed empirically from the data (as will be seen presently; McCleary et al., 1980 , p. 20). In describing ARIMA modeling, the following sections take the form of those discussing regression methods: Conceptual and mathematical treatments are provided in complement in order to provide the reader with a more holistic understanding of these methodologies.

## Introduction

The first step in ARIMA modeling is to visually examine a plot of the series' ACF (autocorrelation function) to see if there is any autocorrelation present that can be used to improve the regression model—or else the analyst may end up adding unnecessary terms. The ACF for the Google data is shown in Figure Figure3. 3 . Again, we will work with the seasonally adjusted series for simplicity. More formally, if a regression model has been fit, the Durbin–Watson test can be used to assess if there is autocorrelation among the residuals and if ARIMA terms can be included to improve its forecasts. The Durbin–Watson test tests the null hypothesis that there is no lag-1 autocorrelation present in the residuals. Thus, a rejection of the null means that ARIMA terms can be included (the Ljung–Box test described below can also be used; Hyndman and Athanasopoulos, 2014 ).

Although the modeling techniques described in the present and following sections can be applied to any one of these models, due to space constraints we continue the tutorial on time series modeling using the cubic model of the first section. A model with only one predictor (viz., time) will allow more focus on the additional model terms that will be added to account for the autocorrelation in the data.

## I( d ): integrated

ARIMA is an acronym formed by the three constituent parts of these models. The AR( p ) and MA( q ) components are predictors that explain the autocorrelation. In contrast, the integrated (I[ d ]) portion of ARIMA models does not add predictors to the forecasting equation. Rather, it indicates the order of differencing that has been applied to the time series in order to remove any trend in the data and render it stationary. Before any AR or MA terms can be included, the series must be stationary . Thus, ARIMA models allow non-stationary series to be modeled due to this “integrated” component (an advantage over simpler ARMA models that do not include such terms; Cowpertwait and Metcalfe, 2009 , p. 137). A time series that has been made stationary by taking the d difference of the original series is notated as I( d ). For instance, an I(1) model indicates that the series that has been made stationary by taking its first differences, I(2), by the second differences (i.e., the first differences of the first differences), etc. Thus, the order of integrated terms in an ARIMA model merely specifies how many iterations of differencing were performed in order to make the series stationary so that AR and MA terms may be included.

## Identifying the order of differencing

Identifying the appropriate order of differencing to stationarize the series is the first and perhaps most important step in selecting an ARIMA model (Nua, 2014 ). It is also relatively straightforward. As stated previously, the order of differencing rarely needs to be greater than two in order to stationarize the series. Therefore, in practice the choice comes down to whether the series is transformed into either its first or second differences, the optimal choice being the order of differencing that results in the lowest series variance (and does not result in an increase in variance that characterizes overdifferencing).

## AR( p ): autoregressive terms

The first part of an ARIMA model is the AR( p ) component, which stands for autoregressive . As correlation is to regression, autocorrelation is to autoregression. That is, in regression, variables that are correlated with the criterion can be used for prediction, and the model specifies the criterion as a function of the predictors. Similarly, with a variable that is autocorrelated (i.e., correlated with itself across time periods), past values can serve as predictors, and the values of the time series are modeled as a function of previous values (thus, autoregression ). In other words, an ARIMA ( p, d, q ) model with p AR terms is simply a linear regression of the time series values against the preceding p observations. Thus, an ARIMA(1, d, q ) model includes one predictor, the observation immediately preceding the current value, and an ARIMA(2, d, q ) model includes two predictors, the first and second preceding observations. The number of these autoregressive terms is called the order of the AR component of the ARIMA model. The following equation uses one AR term (an AR[1] model) in which the preceding value in the time series is used as a regressor:

where ϕ is the autoregressive coefficient (interpretable as a regression coefficient), and y t−1 is the immediately preceding observation. More generally, a model with AR( p ) terms is expressed as:

## Selecting the number of autoregressive terms

The number of autoregressive terms required depends on how many lagged observations explain a significant amount of unique autocorrelation in the time series. Again, an analogy can be made to multiple linear regression: Each predictor should account for a significant amount of variance after controlling for the others. However, a significant autocorrelation at higher lags may be attributable to an autocorrelation at a lower lag. For instance, if a strong autocorrelation exists at lag-1, then a significant lag-3 autocorrelation (i.e., a correlation of time t with t -3) may be a result of t being correlated with t -1, t -1 with t -2, and t -2 with t -3 (and so forth). That is, a strong autocorrelation at an early lag can “persist” throughout the time series, inducing significant autocorrelations at higher lags. Therefore, instead of inspecting the ACF which displays zero-order autocorrelations, a plot of the partial autocorrelation function (PACF) across different lags is the primary method in determining which prior observations explain a significant amount of unique autocorrelation, and accordingly, how many AR terms (i.e., lagged observations as predictors) should be included. Put simply, the PACF displays the autocorrelation of each lag after controlling for the autocorrelation due to all preceding lags (McCleary et al., 1980 , p. 75). A conventional rule is that if there is a sharp drop in the PACF after p lags, then the previous p -values are responsible for the autocorrelation in the series, and the model should include p autoregressive terms (the partial autocorrelation coefficient typically being the value of the autoregressive coefficient, ϕ; Cowpertwait and Metcalfe, 2009 , p. 81). Additionally, the ACF of such a series will gradually decay (i.e., reduce) toward zero as the lag increases.

Applying this knowledge to the empirical example, Figure Figure3 3 depicted the ACF of the seasonally adjusted Google time series, and Figure Figure8 8 displays its PACF. Here, only one lagged partial autocorrelation is statistically significant (lag-6), despite over a dozen autocorrelations in the ACF reaching significance. Thus, it is probable that early lags—and the lag-6 in particular—are responsible for the chain of autocorrelation that persists throughout the series. Although the series is considerably non-stationary (i.e., there is a marked trend and seasonal component), if the series was already stationary, then a model with a single AR term (an AR[1] model) would likely provide the best fit, given a single significant partial autocorrelation at lag-6. The ACF in Figure Figure3 3 also displays the characteristics of an AR(1) series: It has many significant autocorrelations that gradually reduce toward zero. This coheres with the notion that one AR term is often sufficient for a residual time series (Cowpertwait and Metcalfe, 2009 , p. 121). However, if the pattern of autocorrelation is more complex, then additional AR terms may be required. Importantly, if a particular number of AR terms have been successful in explaining the autocorrelation of a stationary series, the residual error series should appear as entirely random white noise (as in Figure Figure4 4 ).

A plot of the partial autocorrelation function (PACF) of the seasonally adjusted time series of Google job searches .

## MA( q ): moving average terms

In the preceding section, it was shown that one can account for the autocorrelation in the data by regressing on prior values in the series (AR terms). However, sometimes the autocorrelation is more easily explained by the inclusion of MA terms; the use of MA terms to explain the autocorrelation—either on their own or in combination with AR components—can result in greater parameter parsimony (i.e., fewer parameters), relative to relying solely on AR terms (Cowpertwait and Metcalfe, 2009 , p. 127). As noted above, ARIMA models assume that any systematic components have either been modeled or removed and that the time series is stationary—i.e., a stochastic process. In time series theory, the values of stochastic processes are determined by two forces: prior values, described in the preceding section, and random shocks (i.e., errors; McCleary et al., 1980 , pp. 18–19). Random shocks are the myriad variables that vary across time and interact with such complexity that their behavior is ostensibly random (e.g., white noise; McCleary et al., 1980 , p. 40). Each shock can be conceived of as an unobserved value at each point in time that influences each observed value of the time series. Thus, autocorrelation in the data may be explained by the persistence of prior values (or outputs , as in AR terms) or, alternatively, the lingering effects of prior unobserved shocks (i.e., the inputs , in MA terms). Therefore, if prior random shocks are related to the value of the series, then these can be included in the prediction equation to explain the autocorrelation and improve the efficiency of the forecasts generated by the model. In other words, just as AR terms can be conceived as a linear regression on previous time series values, MA terms are conceptually a linear regression of the current value of the series against prior random shocks. For instance, an MA(1) model can be expressed as:

where ε t is the value of the random shock at time t, ε t− 1 is the value of the previous random shock, and θ is its coefficient (again, interpretable as a regression coefficient). More generally, the order of MA terms is conventionally denoted as q , and an MA( q ) model can be expressed as:

## Selecting the number of MA terms

Selecting the number of MA terms in the model is conceptually similar to the process of identifying the number of AR terms: One examines plots of the autocorrelation (ACF) and partial autocorrelation functions (PACF) and then specifies an appropriate model. However, while the number of AR terms could be identified by the PACF of the series (more specifically, the point at which the PACF dropped), the number of appropriate MA terms is usually identified by the ACF. Specifically, if the ACF is non-zero for the first q lags and then drops toward zero, then q MA terms should be included in the model (McCleary et al., 1980 , p. 79). All successive lags of the ACF are expected to be zero, and the PACF of such a series will be gradually decaying (McCleary et al., 1980 , p. 79). Thus, relative to AR terms, the roles of the ACF and PACF are essentially reversed when determining the number of MA terms. Furthermore, in practice most social processes can be sufficiently modeled by a single MA term; models of order q = 2 are less common, and higher-order models are extremely rare (McCleary et al., 1980 , p. 63).

## Model building and further notes on ARIMA ( p, d, q ) models

The components of ARIMA models—autoregressive, integrated, and moving average—are aimed at explaining the autocorrelation in a series that is either stationary or can be made so through differencing (i.e., I[ d ] integrated terms). Though already stated, the importance of the following point warrants reiteration: After a successful ARIMA( p, d, q ) model has been fit to the autocorrelated data, the residual error series should be a white noise series. That is, after a good-fitting model has been specified, the residual error series should not display any significant autocorrelations, have a mean of zero, and some constant variance; i.e., there should be no remaining signal that can be used to improve the model's forecasts. Thus, after specifying a particular model, visual inspection of the ACF and PACF of the error series is critical in order to assess model adequacy (McCleary et al., 1980 , p. 93). All autocorrelations are expected to be zero with 5% expected to be statistically significant due to sampling error.

Furthermore, just as there are formal methods to test that a series is stationary before fitting an ARIMA model, there are also statistical tests for the presence of autocorrelation after the model has been fit. The Ljung–Box test (Ljung and Box, 1978 ) is one commonly-applied method in which the null hypothesis is that the errors are uncorrelated across many lags (Cryer and Chan, 2008 , p. 184; Hyndman and Athanasopoulos, 2014 ). Thus, failing to reject the null provides evidence that the model has succeeded in explaining the remaining autocorrelation in the data.

If both formal and informal methods indicate that the residual error series is not a series of white noise terms (i.e., there is remaining autocorrelation), then the analyst must reassess the pattern of autocorrelation and re-specify a new model. Thus, in contrast to regression approaches, ARIMA modeling is an exploratory, iterative process in which the data is examined, models are specified, checked for adequacy, and then re-specified as needed. However, selecting the most appropriate order of AR, I, and MA terms can prove to be a difficult process (Hyndman and Athanasopoulos, 2014 ). Fortunately, model comparison can be easily performed by comparing the Akaike information criterion (AIC) across models (Akaike, 1974 ) 7 . This statistic is based on the fit of a model and its number of parameters, and models with lower values should be selected. Generally, models within two AIC values are considered comparable, a difference of 4–7 points indicates considerable support for the better-fitting model, and a difference of 10 points or greater signifies full support of that model (Burnham and Anderson, 2004 , p. 271). Additionally, the “forecast” R package (Hyndman, 2014 ) contains a function to automatically derive the best-fitting ARIMA model based on the AIC or other fit criteria (see Hyndman and Khandakar, 2008 ). This procedure is discussed in the Supplementary Material.

Furthermore, a particular pattern of autocorrelation can often be explained by either AR or MA terms. Generally, AR terms are preferable to MA terms because their interpretation of these parameters is more straightforward (e.g., the regression coefficient associated with a previous time series value rather than a coefficient associated with an unobserved random shock). However, a more central concern is parameter parsimony; if a model using MA terms (or a combination of AR and MA terms) can explain the autocorrelation with fewer parameters than one that relies solely on AR terms, then these models are generally preferable.

Finally, although a mixed ARIMA model containing both AR and MA terms can result in greater parameter parsimony (Cowpertwait and Metcalfe, 2009 , p. 127), in practice, non-mixed models (i.e., those with either with AR or MA terms alone) should always be ruled out prior to fitting these more complex models (McCleary et al., 1980 , p. 66). Unnecessary model complexity (i.e., redundant parameters) may not become evident at all during the process of model checking, while the inadequacy of simpler models is often easily identified (e.g., noticeable remaining autocorrelation in the ACF plot).

## Fitting a dynamic regression model with ARIMA terms

In this final section, we illustrate how a predictive ARIMA approach to time series modeling can be combined with regression methods through specification of a dynamic regression model. These models can be fit to the data in order to generate accurate forecasts, as well as explain or examine an underlying trend or seasonal effect (as opposed to their removal). We then analyze the predictions from this model and discuss methods of assessing forecast accuracy. For simplicity, we continue with the regression model that modeled the series as a cubic function of time.

## Preliminaries

When the predictor is time, one should begin specification of a dynamic regression model by first examining the residual error series after the regression model has been fit. This is done in order to first detect if there is any autocorrelation in the model residuals that would warrant the inclusion of ARIMA terms. The residual error series are of interest because a dynamic regression model can be thought of as a hybrid model that includes a correction for autocorrelated errors. That is, whatever the regression model does not account for (trend, autocorrelation, etc.) can be supplemented by ARIMA modeling. Analytically, this is performed by re-specifying the initial regression model as an ARIMA model with regressors (sometimes called an “ARIMAX” model, the “X” denoting external predictors) and selecting the appropriate order of ARIMA terms that fit the autocorrelation structure in the residuals (Nua, 2014 ).

## Identifying the order of differencing: I( d ) terms

As noted previously, the residual error series of the cubic regression model exhibited a remaining trend and autocorrelation (see Figure Figure6). 6 ). A significant Durbin–Watson test formally confirms this is the case (i.e., the error terms are not uncorrelated; DW = 0.47, p < 0.001). Thus, ARIMA terms are necessary to (a) stationarize the series (I[ d ] terms) and (b) generate more accurate forecasts (AR[ p ] and/or MA[ q ] terms). As stated above, the conventional first step when formulating an ARIMA is determining the number of I( d ) terms (i.e., order of differencing) required to remove any remaining trend and render the series stationary. We note that, in this case, the systematic seasonal effects have already been removed through seasonal adjustment. It was previously noted that in practice, removing a trend is accomplished almost always by taking either the first or second differences—whichever transformation results in the lowest variance and avoids overdifferencing (i.e., an increase in the series variance). Because the residual trend does not have a markedly changing slope, it is likely that only one order of differencing will be required. The results indicate that this is indeed the case: After first differencing, the series variance is reduced from 13.56 to 6.45, and an augmented Dickey–Fuller test rejects the null hypothesis of a non-stationary series ( ADF = −4.50, p < 0.01). Taking the second differences also results in stationarity (i.e., the trend is removed), but leads to an overdifferenced series with a variance that is inflated to a level higher than the original error series ( s 2 = 14.90).

## Identification of AR( p ) and MA( q ) terms

After the order of I( d ) terms has been identified (here, 1), the next step is to determine whether the pattern of autocorrelation can be better explained by AR terms, MA terms, or a combination of both. As noted, AR terms are often preferred to MA terms because their interpretation is more straightforward, and simpler models with either AR or MA terms are preferable to mixed models. We therefore begin by examining plots of the ACF and PACF for the residual error series shown Figure Figure9 9 in order to see if they display either an AR or MA “signature” (e.g., drop-offs or slow decays).

ACF and PACF of the cubic model residuals used to determine the number of AR and MA terms in an ARIMA model .

From Figure Figure9, 9 , we can see that there are many high autocorrelations in the ACF plot that slowly decay, indicative that AR terms are probably most suitable (A sharp drop in the ACF would indicate that the autocorrelation is probably better explained by MA terms). As stated earlier, the PACF gives the autocorrelation for a lag after controlling for all earlier lags; a significant drop in the PACF at a particular lag indicates that this lagged value is largely responsible for the large zero-order autocorrelations in the ACF. Based on this PACF, the number of terms to include is less clear; aside from the lag-0 autocorrelation, there is no perceptible drop-off in the PACF, and there are no strong partial autocorrelations to attribute the persistence of the autocorrelation seen in the ACF. However, we know that there is autocorrelation in the model residuals, and that either one or two AR terms are typically sufficient for accounting for any autocorrelation (Cowpertwait and Metcalfe, 2009 , p. 121). Therefore, we suspect that a single AR term can account for it. After fitting an ARIMA (1, 1, 0) model, a failure to reject the null hypothesis in a Ljung–Box test indicated that the model residuals were indistinguishable from a random white noise series ( χ 2 = 0.005, p = 0.94), and less than 5% of the autocorrelations in the ACF were statistically significant (The AIC of this model was 419.80). For illustrative purposes, several other models were fit to the data that either included additional AR or MA terms, or a combination of both. Their relative fit was analyzed and the results are shown in Table Table2. 2 . As can be seen, the ARIMA (1, 1, 0) model provided a level of fit that exceeded all of the other models (i.e., the smallest AIC difference among models was 4, showing considerable support). Thus, this model parsimoniously accounted for the systematic trend through a combination of regression modeling and first differencing and successfully extracted all the autocorrelation (i.e., signal) from the data in order to achieve more efficient forecasts.

Comparison of different ARIMA models .

ARIMA(1, 1, 0): one AR term | Ljung–Box test: = 0.005, = 0.94 | 419.80 |

ARIMA(0, 1, 1): one MA term | Ljung–Box test: = 0.01, = 0.92 | 423.84 |

ARIMA(1, 1, 1): a mixed model | Ljung–Box test: = 0.02, = 0.89 | 425.84 |

ARIMA(2, 1, 0): two AR terms | Ljung–Box test: = 0.61, = 0.43 | 448.79 |

ARIMA(0, 1, 2): two MA terms | Ljung–Box test: = 0.02, = 0.89 | 425.84 |

ACF plots for all models showed that <5% of autocorrelations reached statistical significance .

## Forecasting methods and diagnostics

Because forecasts into the future cannot be directly assessed for accuracy until the actual values are observed, it is important that the analyst establish the adequacy of the model prior to forecasting. To do this, the analyst can partition the data into two parts: the estimation period , comprising about 80% of the initial observations and used to estimate the model parameters, and the validation period , usually about 20% of the data and used to ensure that the model predictions are accurate. These percentages may shift depending on the length of the series (see Nua, 2014 ), but the size of the validation period should at least equal the number of periods ahead the analyst wishes to forecast (Hyndman and Athanasopoulos, 2014 ). The predictions generated by the model are then compared to the observed data in the validation period to assess their accuracy. Evaluating forecast accuracy is accomplished by examining the residuals for any systematic patterns of misspecification. Forecasts should ideally be located within the 95% confidence limits, and formal statistics can be calculated from the model residuals in order to evaluate its adequacy. A popular and intuitive statistic is the mean absolute error (MAE): the average absolute deviation from the predicted values. However, this value cannot be used to compare models, as it is scale-dependent (e.g., a residual with an absolute value of 10 is much less egregious when forecasting from a series whose mean is 10,000 relative to a series with a mean of 10). Another statistic, the mean absolute percentage error (MAPE) is useful for comparing across models and is defined as the average percentage that the forecasts deviated from the observed values. Other methods and statistics, such as the root mean squared error (RMSE) and the mean absolute scaled error (MASE) can aid model evaluation and selection and are accessibly discussed by Hyndman and Athanasopoulos ( 2014 , chap. 2). Once a forecasting model has been deemed sufficiently accurate through these methods, forecasts into the future can then be calculated with relative confidence.

Because we have the benefit of hindsight in our example, all observations were used for estimation, and six forecasts were generated for the remainder of the 2011 year and compared to the actual observed values. The point forecasts (blue line), 80%, and 95% confidence limits are displayed in Figure Figure10 10 juxtaposed against the actual values in red. As can be seen, this forecasting model is generally successful: Each observed value lies within the 80% limits, and the residuals have a low mean absolute error ( MAE = 2.03) relative to the series mean ( M = 75.47), as well as a low mean absolute percentage error ( MAPE = 2.33). Additional statistics verified the accuracy of these predictions, and the full results of the analysis can be obtained from the first author.

Forecasts from the dynamic regression model compared to the observed values . The blue line represents the forecasts, and the red dotted line indicates the observed values. The darker gray region denotes the 80% confidence region, the lighter gray, 90%.

As a final note on ARIMA modeling, if the sole goal of the analysis is to produce accurate forecasts, then the seasonal and trend components represent a priori barriers to this goal and should be removed through seasonal adjustment and the I( d ) terms of an appropriate ARIMA model, respectively. Such predictive models are often easier to implement, as there are no systematic components of interest to describe or estimate; they are simply removed through transformations in order to achieve a stationary series. Finally, we close this section with two tables. The first, Table Table3, 3 , compiles the general steps involved in ARIMA time series modeling described above, from selecting the optimal order of ARIMA terms to assessing forecast accuracy. The second, Table Table4, 4 , provides a reference for the various time series terms introduced in the current paper.

Steps for specifying an ARIMA forecasting model .

Step 1. Confirm the presence of autocorrelation. | If there is autocorrelation in the data, then an ARIMA model can be used for forecasting or ARIMA terms can be included within an existing regression model to improve its forecast accuracy (i.e., a dynamic regression/ARIMAX model). | • Examine a plot of the ACF for any large autocorrelations across different lags. In a white noise series, 5% of autocorrelations are expected to reach statistical significance, so one must look at strength of the autocorrelation in addition to statistical significance for the best diagnosis. • If a regression model has been fit to the data, one can formally test for a lag-1 autocorrelation in the residuals by conducting a Durbin–Watson or Ljung–Box test (see Table 1). |

Step 2. Determine if the series is stationary. | Before AR or MA terms can be included in the model to account for the autocorrelation, the series must be stationary (i.e., a constant mean, variance, and autocorrelation). | • Examine a plot of the series for systematic changes in its mean level (i.e., trend or seasonal effects) and variance. • Conduct an ADF test to formally test for stationarity. |

Step 3. Transform the series to stationarity. | AR and MA terms assume a stationary series, and this assumption must be met before modeling the autocorrelation. | • If the variance is not constant over time, taking the natural logarithm of the series can stabilize it. • Seasonal effects can be removed through seasonal adjustment. • A trend component can be removed through differencing and nearly always through either its first or second differences (I[1] or I[2] terms in an ARIMA model, respectively). Each successive order of differencing should further remove the trend and reduce the overall series variance. (But be careful to avoid overdifferencing the series, indicated by an increase in its variance.) • Confirm the series is stationary by performing an ADF test. |

Step 3. Partition the data into estimation validation periods. | Before a forecasting model is used, its accuracy should be assessed. This entails conserving some data in the latter portion of the series to compare to the predictions generated by the model (the validation period). However, the majority of the data should still be used for parameter estimation. | • As a general rule, the first 80% of the series can be used to estimate the parameters and the remaining 20% to assess the accuracy of the model predictions. • For longer series, a larger percentage can be used for the validation period, and its size should be at least as large as the number of periods forecasted ahead. |

Step 4. Examine the ACF and PACF, and fit a parsimonious ARIMA model. | Examining the ACF and PACF of a series can indicate how many AR and MA terms will be required to explain the series autocorrelation. | • A pattern of autocorrelation that is best explained by AR terms has a steadily decaying ACF and a PACF that drops after lags. If this is the case, then AR terms will generally be required. • If the autocorrelation displays an MA signature (a drop-off in the ACF after lags and a gradually decaying PACF) then a model with MA terms will likely provide the best fit to the data. • Ordinarily, only one or two AR or MA terms are required for explaining a series' autocorrelation. |

Step 5. Examine model sufficiency. | A successful model will have extracted all of the autocorrelation from the data after being fit. Noticeable remaining autocorrelation indicates that the model can be improved. | • Examine a plot of the model residuals which should appear as random white noise. • Conduct a Ljung–Box test on the residuals to formally assess if the autocorrelations are significantly different than those expected from a white noise series. |

Step 6. Re-specify the model if necessary and use the AIC to compare models. | An initial model may not successfully explain all the autocorrelation present in the data. Alternatively, a model may successfully account for the autocorrelation but be needlessly complex (i.e., more AR or MA terms than is necessary). Thus, ARIMA modeling is an iterative, exploratory process where multiple models are specified and then compared. | • Sometimes a mixed model can explain the autocorrelation using less parameters. Alternatively, a simpler model may also fit the data well. These models can be specified and checked for adequacy (Step 5). • Among the fitted models, compare the AIC which evaluates the fit of the model and includes penalties for model complexity. Models with a smaller AIC value indicate a superior relative fit. • As a rule of thumb, models within two AIC points are comparable, a difference of 4–7 points indicates considerable support, and a difference of 10 points or greater signifies full support. |

Step 7. Generate predictions and compare to observations in the validation period. | Once a model has been chosen, comparing the model predictions within the validation period allows the analyst to determine if the model produces accurate forecasts during the time periods that have been observed. This provides evidence that it will provide accurate forecasts whose precision cannot be immediately evaluated. | • After estimating model parameters from the first portion of the data, use the remaining observations to compare to the predicted values given by the model. • Observed values should ideally be located within the 95% confidence limits of the forecasts. • Calculate statistics that quantify its accuracy, such as the MAE and MAPE. |

Step 8. Generate forecasts into the future. | After a good-fitting model has been selected and checked for forecasting accuracy, it can be used to generate forecasts into the future. | • Determine how many periods ahead into the future to forecast. • ARIMA models can provide accurate forecasts several periods into the future, but long-term forecasting is inherently more uncertain. |

ACF, Autocorrelation function; PACF, Partial autocorrelation function; ADF, augmented Dickey–Fuller; AIC, Akaike information criterion .

Glossary of time series terms .

Trend | The overarching long-term change in the mean level of a time series. | Trends often represent time series effects that are theoretically interesting, such as the result of a critical event or the effect of other variables. Importantly, trends may be either deterministic or stochastic. Deterministic trends are those due to the constant effects of a few causal forces. As a result, they are generally stable across time and are suitable to be modeled through regression. In contrast, stochastic trends arise simply by chance and are consequently not suitably modeled through regression methods. |

Seasonality | A pattern of rises and falls in the mean level of a series that consistently occurs across time periods. | Seasonal effects may be substantively interesting (in which case they should be estimated) or they may obscure other more important components, such as a trend (in which case they should be removed). |

Cycles | Any repeating pattern in the mean level of a series whose duration is not fixed or known and generally occurs over a period of 2 or more years. | Cycles may also represent patterns of interest. However, cycles are more difficult to identify and generally require longer series to be adequately captured. |

Autocorrelation | When current observations exhibit a dependence upon prior states, manifesting statistically as a correlation between lagged observations. | The presence of autocorrelation means that there is signal in the data that can be modeled by AR or MA terms to generate more accurate forecasts. |

Stationarity | When the mean, variance, and autocorrelation of a series are constant across time. | Descriptive statistics of a time series are only meaningful when it is stationary. Furthermore, before a time series can be modeled by AR or MA terms it must be made stationary. |

Seasonal adjustment | A process of estimating the seasonal effects and removing them from the series. | Seasonal adjustment can remove a source of variation that is not interesting from a theoretical perspective so that the elements of a time series that are of interest can be more clearly analyzed (e.g., a trend). |

Differencing | The process of transforming the values of a series into a series of the differences between observations adjacent in time. | Differencing removes the trend from a time series and thus helps to make the mean of a time series stationary. |

Autocorrelation function (ACF) | A measure of linear association (correlation) between the current time series values with its past series values. | The ACF allows the analyst to see if there is any autocorrelation in the data and at what lags it manifests. It is essential in identifying the appropriate number of AR and MA terms to explain the pattern of the residuals. It is also valuable for determining if there is any remaining autocorrelation after an ARIMA model has been fit (i.e., model diagnostics). |

Partial autocorrelation function (PACF) | A measure of linear association (correlation) between the current time series values with its past series values after controlling for the intervening observations. | The PACF is useful for identifying the number of AR or MA terms that will explain the autocorrelation in the data. |

Integrated (I) | In an ARIMA model, the number of times the series has been differenced in order to make it stationary. | Stationarity is an assumption that must be met before any AR or MA terms can be included in a model. In an ARIMA model, the Integrated component allows the inclusion of series that are non-stationary in the mean. |

Autoregressive (AR) | When a variable is regressed on its prior values in order to account for autocorrelation. | AR terms are able to account for autocorrelation in the data to improve forecasts. |

Moving average (MA) | When a variable is regressed on past random shocks (error terms) in order to account for autocorrelation. | MA terms are able to account for autocorrelation in the data to improve forecasts. |

Dynamic regression (ARIMAX) | A time series model that includes both regression and ARIMA terms. | A model that includes both explanatory variables and AR or MA terms be used to simultaneously model an underlying trend and generate accurate forecasts. |

## Addendum: further time series techniques and resources

Finally, because time series analysis contains a wide range of analytic techniques, there was not room to cover them all here (or in any introductory article for that matter). For a discussion of computing correlations between time series (i.e., the cross-correlation function), the reader is directed to McCleary et al. ( 1980 ). For a general introduction to regression modeling, Cowpertwait and Metcalfe ( 2009 ) and Ostrom ( 1990 ) have excellent discussions, the latter describing the process of identifying lagged effects. For a highly accessible exposition of identifying and cycles or seasonal effects within the data through periodogram and spectral analysis , the reader should consult Warner ( 1998 ), a social scientist-based text which also describes cross-spectral analysis , a method for assessing how well cycles within two series align. For regression modeling using other time series as substantive predictors, the analyst can use transfer function or dynamic regression modeling and is referred to Pankratz ( 1991 ) and Shumway and Stoffer ( 2006 ) for further reading. For additional information on forecasting with ARIMA models and other methods, we refer the reader to Hyndman and Athanasopoulos ( 2014 ) and McCleary et al. ( 1980 ). Finally, multivariate time series analysis can model reciprocal causal relations among time series in a modeling technique called vector ARMA models, and for discussions we recommend Liu ( 1986 ), Wei ( 2006 ), and the introduction in Pankratz ( 1991 , chap. 10). Future work should attempt to incorporate these analytic frameworks within psychological research, as the analysis of time series brings in a host of complex issues (e.g., detecting cycles, guarding against spurious regression and correlation) that must be handled appropriately for proper data analysis and the development of psychological theory.

Time series analysis has proved to be integral for many disciplines over many decades. As time series data becomes more accessible to psychologists, these methods will be increasingly central to addressing substantive research questions in psychology as well. Indeed, we believe that such shifts have already started and that at an introduction to time series data is substantially important. By integrating time-series methodologies within psychological research, scholars will be impelled to think about how variables at various psychological levels may exhibit trends, cyclical or seasonal patterns, or a dependence on prior states (i.e., autocorrelation). Furthermore, when examining the influence of salient events or “shocks,” essential questions, such as “What was the pre-event trend?” and “How long did its effects endure, and what was its trajectory?” will become natural extensions. In other words, researchers will think in an increasingly longitudinal manner and will possess the necessary statistical knowledge to answer any resulting research questions—the importance of which was demonstrated above.

The ultimate goal of this introductory paper is to foster such fruitful lines of conceptualizing research. The more proximal goal is to provide an accessible yet comprehensive exposition of a number of time series modeling techniques fit for addressing a wide range of research questions. These models were based in descriptive, explanatory, and predictive frameworks—all three of which are necessary to accommodate the complex, dynamic nature of psychological theory and its data.

## Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

1 These journals were: Psychological Review, Psychological Bulletin, Journal of Personality and Social Psychology, Journal of Abnormal Psychology, Cognition, American Psychologist, Journal of Applied Psychology, Psychological Science, Perspectives on Psychological Science, Current Directions in Psychological Science, Journal of Experimental Psychology: General, Cognitive Psychology, Trends in Cognitive Sciences, Personnel Psychology, and Frontiers in Psychology .

2 The specific search term was, “jobs – Steve Jobs” which excluded the popular search phrase “Steve Jobs” that would have otherwise unduly influenced the data.

3 Thus, the highest value in the series must be set at 100—i.e., 100% of itself. Furthermore, although measuring a variable in terms of percentages can be misleading when assessing practical significance (e.g., a change from 1 to 4 yields a 400% increase, but may not be a large change in practice), the presumably large raw numbers of searches that include the term “jobs” entail that even a single point increase or decrease in the data is notable.

4 In addition to the two classical models (additive and multiplicative) described above, there are further techniques for time series decomposition that lie beyond the scope of this introduction (e.g., STL or X-12-ARIMA decomposition). These overcome the known shortcomings of classical decomposition (e.g., the first and last several estimates of the trend component are not calculated; Hyndman and Athanasopoulos, 2014 ) which still remains the most commonly used method for time series decomposition. For information regarding these alternative methods the reader is directed to Cowpertwait and Metcalfe ( 2009 , pp. 19–22) and Hyndman and Athanasopoulos ( 2014 , chap. 6).

5 Importantly, the current paper discusses dynamic models that specify time as the regressor (either as a linear or polynomial function). For modeling substantive predictors, more sophisticated techniques are necessary, and the reader is directed to Pankratz ( 1991 ) for a description of this method.

6 Just like in traditional regression, the parent term t is centered before creating the polynomial term in order to mitigate collinearity.

7 The use of additional fit indices, such as the AIC c (a variant of the AIC for small samples) and Bayesian information criterion (BIC) is also recommended, but we focus on the AIC here for simplicity.

## Supplementary material

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpsyg.2015.00727/abstract

- Aguinis H., Gottfredson R. K., Joo H. (2013). Best-practice recommendations for defining, identifying, and handling outliers . Psychol. Res. Methods 16 , 270–301. 10.1177/1094428112470848 [ CrossRef ] [ Google Scholar ]
- Akaike H. (1974). A new look at the statistical model identification . IEEE Trans. Automat. Contr . 19 , 716–723. 10.1109/TAC.1974.1100705 [ CrossRef ] [ Google Scholar ]
- Almagor M., Ehrlich S. (1990). Personality correlates and cyclicity in positive and negative affect . Psychol. Rep . 66 , 1159–1169. [ PubMed ] [ Google Scholar ]
- Anderson O. (1976). Time Series Analysis and Forecasting: The Box-Jenkins Approach . London: Butterworths. [ Google Scholar ]
- Aschoff J. (1984). A survey of biological rhythms , in Handbook of Behavioral Neurobiology, Vol. 4, Biological Rhythms , ed Aschoff J. (New York, NY: Plenum; ), 3–10. [ Google Scholar ]
- Beer M., Walton A. E. (1987). Organization change and development . Annu. Rev. Psychol . 38 , 339–367. [ Google Scholar ]
- Bell W. R., Hillmer S. C. (1984). Issues involved with the seasonal adjustment of time series . J. Bus. Econ. Stat . 2 , 291–320. 10.2307/1391266 [ CrossRef ] [ Google Scholar ]
- Bolger N., DeLongis, A, Kessler R. C., Schilling E. A. (1989). Effects of daily stress on negative mood . J. Pers. Soc. Psychol . 57 , 808–818. [ PubMed ] [ Google Scholar ]
- Burnham K., Anderson D. (2004). Multimodel inference: understanding AIC and BIC in model selection . Sociol. Methods Res . 33 , 261–304. 10.1177/0049124104268644 [ CrossRef ] [ Google Scholar ]
- Busk P. L., Marascuilo L. A. (1988). Autocorrelation in single-subject research: a counterargument to the myth of no autocorrelation . Behav. Assess . 10 , 229–242. [ Google Scholar ]
- Carayon P. (1995). Chronic effect of job control, supervisor social support, and work pressure on office worker stress , in Organizational Risk Factors for Job Stress , eds Sauter S. L., Murphy L. R. (Washington, DC: American Psychological Association; ), 357–370. [ Google Scholar ]
- Chatfield C. (2004). The Analysis of Time Series: An Introduction, 6th Edn . New York, NY: Chapman and Hall/CRC. [ Google Scholar ]
- Conroy R. T., Mills W. L. (1970). Human Circadian Rhythms . Baltimore, MD: The Williams and Wilkins Company. [ Google Scholar ]
- Cook T. D., Campbell D. T. (1979). Quasi-Experimentation: Design and Analysis Issues for Field Settings . Boston, MA: Houghton Mifflin. [ Google Scholar ]
- Cowpertwait P. S., Metcalfe A. (2009). Introductory Time Series with R . New York, NY: Springer-Verlag. [ Google Scholar ]
- Cryer J. D., Chan K.-S. (2008). Time Series Analysis: With Applications in R, 2nd Edn . New York, NY: Springer. [ Google Scholar ]
- Dalal R. S., Bhave D. P., Fiset J. (2014). Within-person variability in job performance: A theoretical review and research agenda . J. Manag . 40 , 1396–1436. 10.1177/0149206314532691 [ CrossRef ] [ Google Scholar ]
- Dettling M. (2013). Applied Time Series Analysis [PDF Document] . Available online at: http://stat.ethz.ch/education/semesters/ss2012/atsa/ATSA-Scriptum-SS2012-120521.pdf
- Fairbairn C. E., Sayette M. A. (2013). The effect of alcohol on emotional inertia: a test of alcohol myopia . J. Abnorm. Psychol . 122 , 770–781. 10.1037/a0032980 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Friston K. J., Holmes A. P., Poline J. B., Grasby P. J., Williams S. C. R., Frackowiak R. S. J., et al.. (1995). Analysis of fMRI time series revisited . Neuroimage 2 , 45–53. [ PubMed ] [ Google Scholar ]
- Friston K. J., Josephs O., Zarahn E., Holmes A. P., Rouquette S., Poline J-B. (2000). To smooth or not to smooth? Bias and efficiency in fMRI time series analysis . Neuroimage 12 , 196–208. 10.1006/nimg.2000.0609 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Fuller J. A., Stanton J. M., Fisher G. G., Spitzmuller C., Russell S. S. (2003). A lengthy look at the daily grind: time series analysis of events, mood, stress, and satisfaction . J. Appl. Psychol . 88 , 1019–1033. 10.1037/0021-9010.88.6.1019 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- George J. M., Jones G. R. (2000). The role of time in theory and theory building . J. Manage . 26 , 657–684. 10.1177/014920630002600404 [ CrossRef ] [ Google Scholar ]
- Ginsberg J., Mohebbi M. H., Patel R. S., Brammer L., Smolinski M. S., Brilliant L. (2009). Detecting influenza epidemics using search engine query data . Nature 457 , 1012–1014. 10.1038/nature07634 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Glass G. V., Willson V. L., Gottman J. M. (1975). Design and Analysis of Time Series Experiments . Boulder, CO: Colorado Associated University Press. [ Google Scholar ]
- Hartmann D. P., Gottman J. M., Jones R. R., Gardner W., Kazdin A. E., Vaught R. S. (1980). Interrupted time-series analysis and its application to behavioral data . J. Appl. Behav. Anal . 13 , 543–559. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Hays W. L. (1981). Statistics, 2nd Edn . New York, NY: Holt, Rinehart, and Winston. [ Google Scholar ]
- Hyndman R. J. (2014). Forecast: Forecasting Functions for Time Series and Linear Models . R Package Version 5.4. Available online at: http://CRAN.R-project.org/package=forecast
- Hyndman R. J., Athanasopoulos G. (2014). Forecasting: Principles and Practice . OTexts. Available online at: http://otexts.org/fpp/
- Hyndman R. J., Khandakar Y. (2008). Automatic time series forecasting: the forecast package for R . J. Stat. Softw . 26 , 1–22. [ Google Scholar ]
- Jones R. R., Vaught R. S., Weinrott M. (1977). Time series analysis in operant research . J. Appl. Behav. Anal . 10 , 151–166. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Kanner A. D., Coyne J. C., Schaefer C., Lazarus R. S. (1981). Comparisons of two modes of stress measurement: daily hassles and uplifts versus major life events . J. Behav. Med . 4 , 1–39. [ PubMed ] [ Google Scholar ]
- Kelly J. R., McGrath J. E. (1988). On Time and Method . Newbury Park, CA: Sage. [ Google Scholar ]
- Kerlinger F. N. (1973). Foundations of Behavioral Research, 2nd Edn . New York, NY: Holt, Rinehart. [ Google Scholar ]
- Killingsworth M. A., Gibert D. T. (2010). A wandering mind is an unhappy mind . Science 330 , 932–933. 10.1126/science.1192439 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Kuljanin G., Braun M. T., DeShon R. P. (2011). A cautionary note on applying growth models to longitudinal data . Psychol. Methods 16 , 249–264. 10.1037/a0023348 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Kumari V., Corr P. J. (1996). Menstrual cycle, arousal-induction, and intelligence test performance . Psychol. Rep . 78 , 51–58. [ PubMed ] [ Google Scholar ]
- Larsen R. J., Kasimatis M. (1990). Individual differences in entrainment of mood to the weekly calendar . J. Pers. Soc. Psychol . 58 , 164–171. [ PubMed ] [ Google Scholar ]
- Larson R., Csikszentmihalyi M. (1983). The experience sampling method . New Dir. Methodol. Soc. Behav. Sci . 15 , 41–56. [ Google Scholar ]
- Latman N. (1977). Human sensitivity, intelligence and physical cycles and motor vehicle accidents . Accid. Anal. Prev . 9 , 109–112. [ Google Scholar ]
- Liu L.-M. (1986). Multivariate Time Series Analysis using Vector ARMA Models . Lisle, IL: Scientific Computing Associates. [ Google Scholar ]
- Ljung G. M., Box G. E. P. (1978). On a measure of lack of fit in time series models . Biometrika 65 , 297–303. [ Google Scholar ]
- Luce G. G. (1970). Biological Rhythms in Psychiatry and Medicine . Public Health Service Publication (Public Health Service Publication No. 2088). U.S. Institute of Mental Health.
- McCleary R., Hay R. A., Meidinger E. E., McDowall D. (1980). Applied Time Series Analysis for the Social Sciences . Beverly Hills, CA: Sage. [ Google Scholar ]
- McGrath J. E., Rotchford N. L. (1983). Time and behavior in organizations . Res. Psychol. Behav . 5 , 57–101. [ Google Scholar ]
- Meko D. M. (2013). Applied Time Series Analysis [PDF Documents]. Available online at: http://www.ltrr.arizona.edu/~dmeko/geos585a.html#chandout
- Mills T. C., Markellos R. N. (2008). The Econometric Modeling of Financial Time Series, 3rd Edn . Cambridge: Cambridge University Press. [ Google Scholar ]
- Mitchell T. R., James L. R. (2001). Building better theory: time and the specification of when things happen . Acad. Manage. Rev . 26 , 530–547. 10.5465/AMR.2001.5393889 [ CrossRef ] [ Google Scholar ]
- Muenchen R. A. (2013). The Popularity of Data Analysis Software . Available online at: http://r4stats.com/articles/popularity/
- Nua R. (2014). Statistical Forecasting , [Online Lecture Notes]. Available online at: http://people.duke.edu/~rnau/411home.htm
- Ostrom C. W. (1990). Time Series Analysis: Regression Techniques . Newbury Park, CA: Sage. [ Google Scholar ]
- Pankratz A. (1991). Forecasting with Dynamic Regression Models . New York, NY: Wiley. [ Google Scholar ]
- Pearce J. L., Stevenson W. B., Perry J. L. (1985). Managerial compensation based on psychological performance: a time series analysis of the effects of merit pay . Acad. Manage. J . 28 , 261–278. [ Google Scholar ]
- Persons W. M. (1919). Indices of business conditions . Rev. Econ. Stat . 1 , 5–107. [ Google Scholar ]
- Pinheiro J., Bates D., DebRoy S., Sarkar D., R Core Team. (2014). NLME: Linear and Nonlinear Mixed Effects Models . R Package Version 3.1-117. Available online at: http://CRAN.R-project.org/package=nlme
- Polgreen P. M., Chen Y., Pennock D. M., Forrest N. D. (2008). Using internet searches for influenza surveillance . Clin. Infect. Dis . 47 , 1443–1448. [ PubMed ] [ Google Scholar ]
- Popper K. R. (1968). The Logic of Scientific Discovery . New York, NY: Harper and Row. [ Google Scholar ]
- R Development Core Team. (2011). R: A Language and Environment for Statistical Computing . Vienna: R Foundation for Statistical Computing; Available online at: http://www.R-project.org/ [ Google Scholar ]
- Rothman P. (eds.). (1999). Nonlinear Time Series Analysis of Economic and Financial Data . Dordrecht: Kluwer Academic Publishers. [ Google Scholar ]
- Said S. E., Dickey D. A. (1984). Testing for unit roots in autoregressive-moving average models of unknown order . Biometrika 71 , 599–607. [ Google Scholar ]
- Sharpe D. (2013). Why the resistance to statistical innovations? Bridging the communication gap . Psychol. Methods 18 , 572–582. 10.1037/a0034177 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Shmueli G. (2010). To explain or to predict? Stat. Sci . 25 , 289–310. 10.1214/10-STS330 [ CrossRef ] [ Google Scholar ]
- Shumway R. H., Stoffer D. S. (2006). Time Series Analysis and Its Applications with R Examples, 2nd Edn . New York, NY: Springer. [ Google Scholar ]
- Stanton J. M., Rogelberg S. G. (2001). Using Internet/Intranet web pages to collect psychological research data . Psychol. Res. Methods 4 , 200–217. 10.1177/109442810143002 [ CrossRef ] [ Google Scholar ]
- Tobias R. (2009). Changing behavior by memory aids: a social–psychological model of prospective memory and habit development tested with dynamic field data . Psychol. Rev . 116 , 408-438. 10.1037/a0015512 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Trapletti A., Hornik K. (2013). Tseries: Time Series Analysis and Computational Finance . R Package Version 0.10-32.
- United States Department of Labor, Bureau of Labor Statistics. (2014). Labor Force Statistics from the Current Population Survey [Data Set]. Available online at: http://data.bls.gov/timeseries/LNU04000000
- Wagner A. K., Soumerai S. B., Zhang F., Ross-Degnan D. (2002). Segmented regression analysis of interrupted time series studies in medication use research . J. Clin. Pharm. Ther . 27 , 299–309. 10.1046/j.1365-2710.2002.00430.x [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Wagner J. A., Rubin P. A., Callahan T. J. (1988). Incentive payment and nonmanagerial productivity: an interrupted time series analysis of magnitude and trend . Organ. Behav. Hum. Decis. Process . 42 , 47–74. [ Google Scholar ]
- Warner R. M. (1998). Spectral Analysis of Time-Series Data . New York, NY: Guilford Press. [ Google Scholar ]
- Wei W. S. (2006). Time Series Analysis: Univariate and Multivariate Methods, 2nd Edn . London: Pearson. [ Google Scholar ]
- Weiss H. M., Cropanzano R. (1996). Affective events theory: a theoretical discussion of the structure, causes and consequences of affective experiences at work , in Research in Psychological Behavior: An Annual Series of Analytical Essays and Critical Reviews , eds Staw B. M., Cummings L. L. (Greenwich, CT: JAI Press; ), 1–74. [ Google Scholar ]
- West S. G., Hepworth J. T. (1991). Statistical issues in the study of temporal data: daily experience . J. Pers . 59 , 609–662. [ PubMed ] [ Google Scholar ]
- Wood P., Brown D. (1994). The study of intraindividual differences by means of dynamic factor models: rationale, implementation, and interpretation . Psychol. Bull . 116 , 166–186. [ Google Scholar ]
- Zaheer S., Albert S., Zaheer A. (1999). Time-scale and psychological theory . Acad. Manage. Rev . 24 , 725–741. [ Google Scholar ]
- Zeileis A., Hothorn T. (2002). Diagnostic checking in regression relationships . R News 3 , 7–10. [ Google Scholar ]

## METHODS article

Time series analysis for psychological research: examining and forecasting change.

- 1 Department of Psychological Sciences, Purdue University, West Lafayette, IN, USA
- 2 Department of Psychology, University of Central Florida, Orlando, FL, USA
- 3 Department of Statistics, Purdue University, West Lafayette, IN, USA

Psychological research has increasingly recognized the importance of integrating temporal dynamics into its theories, and innovations in longitudinal designs and analyses have allowed such theories to be formalized and tested. However, psychological researchers may be relatively unequipped to analyze such data, given its many characteristics and the general complexities involved in longitudinal modeling. The current paper introduces time series analysis to psychological research, an analytic domain that has been essential for understanding and predicting the behavior of variables across many diverse fields. First, the characteristics of time series data are discussed. Second, different time series modeling techniques are surveyed that can address various topics of interest to psychological researchers, including describing the pattern of change in a variable, modeling seasonal effects, assessing the immediate and long-term impact of a salient event, and forecasting future values. To illustrate these methods, an illustrative example based on online job search behavior is used throughout the paper, and a software tutorial in R for these analyses is provided in the Supplementary Materials.

Although time series analysis has been frequently used many disciplines, it has not been well-integrated within psychological research. In part, constraints in data collection have often limited longitudinal research to only a few time points. However, these practical limitations do not eliminate the theoretical need for understanding patterns of change over long periods of time or over many occasions. Psychological processes are inherently time-bound, and it can be argued that no theory is truly time-independent ( Zaheer et al., 1999 ). Further, its prolific use in economics, engineering, and the natural sciences may perhaps be an indicator of its potential in our field, and recent technological growth has already initiated shifts in data collection that proliferate time series designs. For instance, online behaviors can now be quantified and tracked in real-time, leading to an accessible and rich source of time series data (see Stanton and Rogelberg, 2001 ). As a leading example, Ginsberg et al. (2009) developed methods of influenza tracking based on Google queries whose efficiency surpassed conventional systems, such as those provided by the Center for Disease Control and Prevention. Importantly, this work was based in prior research showing how search engine queries correlated with virological and mortality data over multiple years ( Polgreen et al., 2008 ).

Furthermore, although experience sampling methods have been used for decades ( Larson and Csikszentmihalyi, 1983 ), nascent technologies such as smartphones allow this technique to be increasingly feasible and less intrusive to respondents, resulting in a proliferation of time series data. As an example, Killingsworth and Gibert (2010) presented an iPhone (Apple Incorporated, Cupertino, California) application which tracks various behaviors, cognitions, and affect over time. At the time their study was published, their database contained almost a quarter of a million psychological measurements from individuals in 83 countries. Finally, due to the growing synthesis between psychology and neuroscience (e.g., affective neuroscience, social-cognitive neuroscience) the ability to analyze neuroimaging data, which is strongly linked to time series methods (e.g., Friston et al., 1995 , 2000 ), is a powerful methodological asset. Due to these overarching trends, we expect that time series data will become increasingly prevalent and spur the development of more time-sensitive psychological theory. Mindful of the growing need to contribute to the methodological toolkit of psychological researchers, the present article introduces the use of time series analysis in order to describe and understand the dynamics of psychological change over time.

In contrast to these current trends, we conducted a survey of the existing psychological literature in order to quantify the extent to which time series methods have already been used in psychological science. Using the PsycINFO database, we searched the publication histories of 15 prominent journals in psychology 1 for the term “time series” in the abstract, keywords, and subject terms. This search yielded a small sample of 36 empirical papers that utilized time series modeling. Further investigation revealed the presence of two general analytic goals: relating a time series to other substantive variables (17 papers) and examining the effects of a critical event or intervention (9 papers; the remaining papers consisted of other goals). Thus, this review not only demonstrates the relative scarcity of time series methods in psychological research, but also that scholars have primarily used descriptive or causal explanatory models for time series data analysis ( Shmueli, 2010 ).

The prevalence of these types of models is typical of social science, but in fields where time series analysis is most commonly found (e.g., econometrics, finance, the atmospheric sciences), forecasting is often the primary goal because it bears on important practical decisions. As a result, the statistical time series literature is dominated by models that are aimed toward prediction , not explanation ( Shmueli, 2010 ), and almost every book on applied time series analysis is exclusively devoted to forecasting methods ( McCleary et al., 1980 , p. 205). Although there are many well-written texts on time series modeling for economic and financial applications (e.g., Rothman, 1999 ; Mills and Markellos, 2008 ), there is a lack of formal introductions geared toward psychological issues (see West and Hepworth, 1991 for an exception). Thus, a psychologist looking to use these methodologies may find themselves with resources that focus on entirely different goals. The current paper attempts to amend this by providing an introduction to time series methodologies that is oriented toward issues within psychological research. This is accomplished by first introducing the basic characteristics of time series data: the four components of variation (trend, seasonality, cycles, and irregular variation), autocorrelation, and stationarity. Then, various time series regression models are explicated that can be used to achieve a wide range of goals, such as describing the process of change through time, estimating seasonal effects, and examining the effect of an intervention or critical event. Not to overlook the potential importance of forecasting for psychological research, the second half of the paper discusses methods for modeling autocorrelation and generating accurate predictions—viz., autoregressive integrative moving average (ARIMA) modeling. The final section briefly describes how regression techniques and ARIMA models can be combined in a dynamic regression model that can simultaneously explain and forecast a time series variable. Thus, the current paper seeks to provide an integrative resource for psychological researchers interested in analyzing time series data which, given the trends described above, are poised to become increasingly prevalent.

## The Current Illustrative Application

In order to better demonstrate how time series analysis can accomplish the goals of psychological research, a running practical example is presented throughout the current paper. For this particular illustration, we focused on online job search behaviors using data from Google Trends, which compiles the frequency of online searches on Google over time. We were particularly interested in the frequency of online job searches in the United States 2 and the impact of the 2008 economic crisis on these rates. Our primary research hypothesis was that this critical event resulted in a sharp increase in the series that persisted over time. The monthly frequencies of these searches from January 2004 to June 2011 were recorded, constituting a data set of 90 total observations. Figure 1 displays a plot of this original time series that will be referenced throughout the current paper. Importantly, the values of the series do not represent the raw number of Google searches, but have been normalized (0–100) in order to yield a more tractable data set; each monthly value represents its percentage relative to the maximum observed value 3 .

Figure 1. A plot of the original Google job search time series and the series after seasonal adjustment .

## A Note on Software Implementation

Conceptual expositions of new analytical methods can often be undermined by the practical issue of software implementation ( Sharpe, 2013 ). To preempt this obstacle, for each analysis we provide accompanying R code in the Supplementary Material, along with an intuitive explanation of the meanings and rationale behind the various commands and arguments. On account of its versatility, the open-source statistical package R ( R Development Core Team, 2011 ) remains the software platform of choice for performing time series analyses, and a number of introductory texts are oriented solely toward this program, such as Introductory Time Series with R ( Cowpertwait and Metcalfe, 2009 ), Time Series Analysis with Applications in R ( Cryer and Chan, 2008 ), and Time Series Analysis and Its Applications with R Examples ( Shumway and Stoffer, 2006 ). In recent years, R has become increasingly recognized within the psychological sciences as well ( Muenchen, 2013 ). We believe that psychological researchers with even a minimal amount of experience with R will find this tutorial both informative and accessible.

## An Introduction to Time Series Data

Before introducing how time series analyses can be used in psychological research, it is necessary to first explicate the features that characterize time series data. At its simplest, a time series is a set of time-ordered observations of a process where the intervals between observations remain constant (e.g., weeks, months, years, and minor deviations in the intervals are acceptable; McCleary et al., 1980 , p. 21; Cowpertwait and Metcalfe, 2009 ). Time series data is often distinguished from other types of longitudinal data by the number and source of the observations; a univariate time series contains many observations originating from a single source (e.g., an individual, a price index), while other forms of longitudinal data often consist of several observations from many sources (e.g., a group of individuals). The length of time series can vary, but are generally at least 20 observations long, and many models require at least 50 observations for accurate estimation ( McCleary et al., 1980 , p. 20). More data is always preferable, but at the very least, a time series should be long enough to capture the phenomena of interest.

Due to its unique structure, a time series exhibits characteristics that are either absent or less prominent in the kinds of cross-sectional and longitudinal data typically collected in psychological research. In the next sections, we review these features that include autocorrelation and stationarity . However, we begin by delineating the types of patterns that may be present within a time series. That is, the variation or movement in a series can be partitioned into four parts: the trend, seasonal, cyclical , and irregular components ( Persons, 1919 ).

## The Four Components of Time Series

Trend refers to any systematic change in the level of a series—i.e., its long-term direction ( McCleary et al., 1980 , p. 31; Hyndman and Athanasopoulos, 2014 ). Both the direction and slope (rate of change) of a trend may remain constant or change throughout the course of the series. Globally, the illustrative time series shown in Figure 1 exhibits a positive trend: The level of the series at the end is systematically higher than at its beginning. However, there are sections in this particular series that do not exhibit the same rate of increase. The beginning of the series displays a slight negative trend, and starting approximately at 2006, the series significantly rises until 2009, after which a small downward trend may even be present.

Because a trend in the data represents a significant source of variability, it must be accounted for when performing any time series analysis. That is, it must be either (a) modeled explicitly or (b) removed through mathematical transformations (i.e., detrending ; McCleary et al., 1980 , p. 32). The former approach is taken when the trend is theoretically interesting—either on its own or in relation to other variables. Conversely, removing the trend (through methods discussed later) is performed when this component is not pertinent to the goals of the analysis (e.g., strict forecasting). The decision of whether to model or remove systematic components like a trend represents an important aspect of time series analysis. The various characteristics of time series data are either of theoretical interest—in which case they should be modeled—or not, in which case they should be removed so that the aspects that are of interest can be more easily analyzed. Thus, it is incumbent upon the analyst to establish the goals of the analysis and determine which components of a time series are of interest and treat them accordingly. This topic will be revisited throughout the forthcoming sections.

## Seasonality

Unlike the trend component, the seasonal component of a series is a repeating pattern of increase and decrease in the series that occurs consistently throughout its duration. More specifically, it can be defined as a cyclical or repeating pattern of movement within a period of 1 year or less that is attributed to “seasonal” factors—i.e., those related to an aspect of the calendar (e.g., the months or quarters of a year or the days of a week; Cowpertwait and Metcalfe, 2009 , p. 6; Hyndman and Athanasopoulos, 2014 ). For instance, restaurant attendance may exhibit a weekly seasonal pattern such that the weekends routinely display the highest levels within the series across weeks (i.e., the time period), and the first several weekdays are consistently the lowest. Retail sales often display a monthly seasonal pattern, where each month across yearly periods consistently exhibits the same relative position to the others: viz., a spike in the series during the holiday months and a marked decrease in the following months. Importantly, the pattern represented by a seasonal effect remains constant and occurs over the same duration on each occasion ( Hyndman and Athanasopoulos, 2014 ).

Although its underlying pattern remains fixed, the magnitude of a seasonal effect may vary across periods. Seasonal effects can also be embedded within overarching trends. Along with a marked trend, the series in Figure 1 exhibits noticeable seasonal fluctuations as well; at the beginning of each year (i.e., after the holiday months), online job searches spike and then fall significantly in February. After February, they continue to rise until about July or August, after which the series significantly drops for the remainder of the year, representing the effects of seasonal employment. Notice the consistency of both the form (i.e., pattern of increase and decrease) and magnitude of this seasonal effect. The fact that online job search behavior exhibits seasonal patterns supports the idea that this behavior (and this example in particular) is representative of job search behavior in general. In the United States, thousands of individuals engage in seasonal work which results in higher unemployment rates in the beginning of each year and in the later summer months (e.g., July and August; The United States Department of Labor, Bureau of Labor Statistics, 2014 ), manifesting in a similar seasonal pattern of job search behavior.

One may be interested in the presence of seasonal effects, but once identified, this source of variation is often removed from the time series through a procedure known as seasonal adjustment ( Cowpertwait and Metcalfe, 2009 , p. 21). This is in keeping with the aforementioned theme: Once a systematic component has been identified, it must either be modeled or removed. The popularity of seasonal adjustment is due to the characteristics of seasonal effects delineated above: Unlike other more dynamic components of a time series, seasonal patterns remain consistent across periods and are generally similar in magnitude ( Hyndman and Athanasopoulos, 2014 ). Their effects may also obscure other important features of time series—e.g., a previously unnoticed trend or cycles described in the following section. Put simply, “seasonal adjustment is done to simplify data so that they may be more easily interpreted…without a significant loss of information” ( Bell and Hillmer, 1984 , p. 301). Unemployment rates are often seasonally adjusted to remove the fluctuations due to the effects of weather, harvests, and school schedules that remain more or less constant across years. In our data, the seasonal effects of job search behavior are not of direct theoretical interest relative to other features of the data, such as the underlying trend and the impact of the 2008 economic crisis. Thus, we may prefer to work with the simpler seasonally adjusted series. The lower panel of Figure 1 displays the original Google time series after seasonal adjustment, and the Supplementary Material contains a description of how to implement this procedure in R. It can be seen that the trend is made notably clearer after removing the seasonal effects. Despite the spike at the very end, the suspected downward trend in the later part of the series is much more evident. This insight will prove to be important when selecting an appropriate time series model in the upcoming sections.

A cyclical component in a time series is conceptually similar to a seasonal component: It is a pattern of fluctuation (i.e., increase or decrease) that reoccurs across periods of time. However, unlike seasonal effects whose duration is fixed across occurrences and are associated with some aspect of the calendar (e.g., days, months), the patterns represented by cyclical effects are not of fixed duration (i.e., their length often varies from cycle to cycle) and are not attributable to any naturally-occurring time periods ( Hyndman and Athanasopoulos, 2014 ). Put simply, cycles are any non-seasonal component that varies in a recognizable pattern (e.g., business cycles; Hyndman and Athanasopoulos, 2014 ). In contrast to seasonal effects, cycles generally occur over a period lasting longer than 2 years (although they may be shorter), and the magnitude of cyclical effects is generally more variable than that of seasonal effects ( Hyndman and Athanasopoulos, 2014 ). Furthermore, just as the previous two components—trend and seasonality—can be present with or without the other, a cyclical component may be present with any combination of the other two. For instance, a trend with an intrinsic seasonal effect can be embedded within a greater cyclical pattern that occurs over a period of several years. Alternatively, a cyclical effect may be present without either of these two systematic components.

In the 7 years that constitute the time series of Figure 1 , there do not appear to be any cyclical effects. This is expected, as there are no strong theoretical reasons to believe that online or job search behavior is significantly influenced by factors that consistently manifest across a period of over one year. We have significant a priori reasons to believe that causal factors related to seasonality exist (e.g., searching for work after seasonal employment), but the same does not hold true for long-term cycles, and the time series is sufficiently long enough to capture any potential cyclical behavior.

## Irregular Variation (Randomness)

While the previous three components represented three systematic types of time series variability (i.e., signal ; Hyndman and Athanasopoulos, 2014 ), the irregular component represents statistical noise and is analogous to the error terms included in various types of statistical models (e.g., the random component in generalized linear modeling). It constitutes any remaining variation in a time series after these three systematic components have been partitioned out. In time series parlance, when this component is completely random (i.e., not autocorrelated), it is referred to as white noise , which plays an important role in both the theory and practice of time series modeling. Time series are assumed to be in part driven by a white noise process (explicated in a future section), and white noise is vital for judging the adequacy of a time series model. After a model has been fit to the data, the residuals form a time series of their own, called the residual error series . If the statistical model has been successful in accounting for all the patterns in the data (e.g., systematic components such as trend and seasonality), the residual error series should be nothing more than unrelated white noise error terms with a mean of zero and some constant variance. In other words, the model should be successful in extracting all the signal present in the data with only randomness left over ( Cowpertwait and Metcalfe, 2009 , p. 68). This is analogous to evaluating the residuals of linear regression, which should be normally distributed around a mean of zero.

## Time Series Decomposition

To visually examine a series in an exploratory fashion, time series are often formally partitioned into each of these components through a procedure referred to as time series decomposition . Figure 2 displays the original Google time series (top panel) decomposed into its constituent parts. This figure depicts what is referred to as classical decomposition , when a time series is conceived of comprising three components: a trend-cycle, seasonal, and random component. (Here, the trend and cycle are combined because the duration of each cycle is unknown; Hyndman and Athanasopoulos, 2014 ). The classic additive decomposition model ( Cowpertwait and Metcalfe, 2009 , p. 19) describes each value of the time series as the sum of these three components:

Figure 2. The original time series decomposed into its trend, seasonal, and irregular (i.e., random) components . Cyclical effects are not present within this series.

The additive decomposition model is most appropriate when the magnitude of the trend-cycle and seasonal components remain constant over the course of the series. However, when the magnitude of these components varies but still appears proportional over time (i.e., it changes by a multiplicative factor), the series may be better represented by the multiplicative decomposition model, where each observation is the product of the trend-cycle, seasonal, and random components:

In either decomposition model, each component is sequentially estimated and then removed until only the stochastic error component remains (the bottom panel of Figure 2 ). The primary purpose of time series decomposition is to provide the analyst with a better understanding of the underlying behavior and patterns of the time series which can be valuable in determining the goals of the analysis. Decomposition models can be used to generate forecasts by adding or multiplying future estimates of the seasonal and trend-cycle components ( Hyndman and Athanasopoulos, 2014 ). However, such models are beyond the scope of this present paper, and the ARIMA forecasting models discussed later are generally superior 4 .

## Autocorrelation

In psychological research, the current state of a variable may partially depend on prior states. That is, many psychological variables exhibit autocorrelation : when a variable is correlated with itself across different time points (also referred to as serial dependence ). Time series designs capture the effect of previous states and incorporate this potentially significant source of variance within their corresponding statistical models. Although the main features of many time series are its systematic components such as trend and seasonality, a large portion of time series methodology is aimed at explaining the autocorrelation in the data ( Dettling, 2013 , p. 2).

The importance of accounting for autocorrelation should not be overlooked; it is ubiquitous in social science phenomena ( Kerlinger, 1973 ; Jones et al., 1977 ; Hartmann et al., 1980 ; Hays, 1981 ). In a review of 44 behavioral research studies with a total of 248 independent sets of repeated measures data, Busk and Marascuilo (1988) found that 80% of the calculated autocorrelations ranged from 0.1 to 0.49, and 40% exceeded 0.25. More specific to the psychological sciences, it has been proposed that state-related constructs at the individual-level, such as emotions and arousal, are often contingent on prior states ( Wood and Brown, 1994 ). Using autocorrelation analysis, Fairbairn and Sayette (2013) found that alcohol use reduces emotional inertia, the extent to which prior affective states determine current emotions. Through this, they were able to marshal support for the theory of alcohol myopia , the intuitive but largely untested idea that alcohol allows a greater enjoyment of the present, and thus formally uncovered an affective motivation for alcohol use (and misuse). Further, using time series methods, Fuller et al. (2003) found that job stress in the present day was negatively related to the degree of stress in the preceding day. Accounting for autocorrelation can therefore reveal new information on the phenomenon of interest, as the Fuller et al. (2003) analysis led to the counterintuitive finding that lower stress was observed after prior levels had been high.

Statistically, autocorrelation simply represents the Pearson correlation for a variable with itself at a previous time period, referred to as the lag of the autocorrelation. For instance, the lag-1 autocorrelation of a time series is the correlation of each value with the immediately preceding observation; a lag-2 autocorrelation is the correlation with the value that occurred two observations before. The autocorrelation with respect to any lag can be computed (e.g., a lag-20 autocorrelation), and intuitively, the strength of the autocorrelation generally diminishes as the length of the lag increases (i.e., as the values become further removed in time).

Strong positive autocorrelation in a time series manifests graphically by “runs” of values that are either above or below the average value of the time series. Such time series are sometimes called “persistent” because when the series is above (or below) the mean value it tends to remain that way for several periods. Conversely, negative autocorrelation is characterized by the absence of runs—i.e., when positive values tend to follow negative values (and vice versa). Figure 3 contains two plots of time series intended to give the reader an intuitive understanding of the presence of autocorrelation: The series in the top panel exhibits positive autocorrelation, while the center panel illustrates negative autocorrelation. It is important to note that the autocorrelation in these series is not obscured by other components and that in real time series, visual analysis alone may not be sufficient to detect autocorrelation.

Figure 3. Two example time series displaying exaggerated positive (top panel) and negative (center panel) autocorrelation . The bottom panel depicts the ACF of the Google job search time series after seasonal adjustment.

In time series analysis, the autocorrelation coefficient across many lags is called the autocorrelation function (ACF) and plays a significant role in model selection and evaluation (as discussed later). A plot of the ACF of the Google job search time series after seasonal adjustment is presented in the bottom panel of Figure 3 . In an ACF plot, the y-axis displays the strength of the autocorrelation (ranging from positive to negative 1), and the x-axis represents the length of the lags: from lag-0 (which will always be 1) to much higher lags (here, lag-19). The dotted horizontal line indicates the p < 0.05 criterion for statistical significance.

## Stationarity

Definition and purpose.

A complication with time series data is that its mean, variance, or autocorrelation structure can vary over time. A time series is said to be stationary when these properties remain constant ( Cryer and Chan, 2008 , p. 16). Thus, there are many ways in which a series can be non-stationary (e.g., an increasing variance over time), but it can only be stationary in one-way (viz., when all of these features do not change).

Stationarity is a pivotal concept in time series analysis because descriptive statistics of a series (e.g., its mean and variance) are only accurate population estimates if they remain constant throughout the series ( Cowpertwait and Metcalfe, 2009 , pp. 31–32). With a stationary series, it will not matter when the variable is observed: “The properties of one section of the data are much like those of any other” ( Chatfield, 2004 , p. 13). As a result, a stationary series is easy to predict: Its future values will be similar to those in the past ( Nua, 2014 ). As a result, stationarity is the most important assumption when making predictions based on past observations ( Cryer and Chan, 2008 , p. 16), and many times series models assume the series already is or can be transformed to stationarity (e.g., the broad class of ARIMA models discussed later).

In general, a stationary time series will have no predictable patterns in the long-term; plots will show the series to be roughly horizontal with some constant variance ( Hyndman and Athanasopoulos, 2014 ). A stationary time series is illustrated in Figure 4 , which is a stationary white noise series (i.e., a series of uncorrelated terms). The series hovers around the same general region (i.e., its mean) with a consistent variance around this value. Despite the observations having a constant mean, variance, and autocorrelation, notice how such a process can generate outliers (e.g., the low extreme value after t = 60), as well as runs of values that are both above or below the mean. Thus, stationarity does not preclude these temporary and fluctuating behaviors of the series, although any systematic patterns would.

Figure 4. An example of a stationary time series (specifically, a series of uncorrelated white noise terms) . The mean, variance, and autocorrelation are all constant over time, and the series displays no systematic patterns, such as trends or cycles.

However, many time series in real life are dominated by trends and seasonal effects that preclude stationarity. A series with a trend cannot be stationary because, by definition, a trend is when the mean level of the series changes over time. Seasonal effects also preclude stationarity, as they are reoccurring patterns of change in the mean of the series within a fixed time period (e.g., a year). Thus, trend and seasonality are the two time series components that must be addressed in order to achieve stationarity.

## Transforming a Series to Stationarity

When a time series is not stationary, it can be made so after accounting for these systematic components within the model or through mathematical transformations. The procedure of seasonal adjustment described above is a method that removes the systematic seasonal effects on the mean level of the series.

The most important method of stationarizing the mean of a series is through a process called differencing , which can be used to remove any trend in the series which is not of interest. In the simplest case of a linear trend, the slope (i.e., the change from one period to the next) remains relatively constant over time. In such a case, the difference between each time period and its preceding one (referred to as the first differences ) are approximately equal. Thus, one can effectively “detrend” the series by transforming the original series into a series of first differences ( Meko, 2013 ; Hyndman and Athanasopoulos, 2014 ). The underlying logic is that forecasting the change in a series from one period to the next is just as useful in practice as predicting the original series values.

However, when the time series exhibits a trend that itself changes (i.e., a non-constant slope), then even transforming a series into a series of its first differences may not render it completely stationary. This is because when the slope itself is changing (e.g., an exponential trend), the difference between periods will be unequal. In such cases, taking the first differences of the already differenced series (referred to as the second differences ) will often stationarize the series. This is because each successive differencing has the effect of reducing the overall variance of the series ( Anderson, 1976 ), as deviations from the mean level are increasingly reduced through this subtractive process. The second differences (i.e., the first differences of the already differenced series) will therefore further stabilize the mean. There are general guidelines on how many orders of differencing are necessary to stationarize a series. For instance, the first or second differences will nearly always stationarize the mean, and in practice it is almost never necessary to go beyond second differencing ( Cryer and Chan, 2008 ; Hyndman and Athanasopoulos, 2014 ). However, for series that exhibit higher-degree polynomial trends, the order of differencing required to stationarize the series is typically equal to that degree (e.g., two orders of differencing for an approximately quadratic trend, three orders for a cubic trend; Cowpertwait and Metcalfe, 2009 , p. 93).

A common mistake in time series modeling to “overdifference” the series, when more orders of differencing than are required to achieve stationarity are performed. This can complicate the process of building an adequate and parsimonious model (see McCleary et al., 1980 , p. 97). Fortunately, overdifferencing is relatively easy to identify; differencing a series with a trend will have the effect of reducing the variance of the series, but an unnecessary degree of differencing will increase its variance ( Anderson, 1976 ). Thus, the optimal order of differencing is that which results in the lowest variance of the series.

If the variance of a times series is not constant over time, a common method of making the variance stationary is through a logarithmic transformation of the series ( Cowpertwait and Metcalfe, 2009 , pp. 109–112; Hyndman and Athanasopoulos, 2014 ). Taking the logarithm has the practical effect of reducing each value at an exponential rate. That is, the larger the value, the more its value is reduced. Thus, this transformation stabilizes the differences across values (i.e., its variance) which is also why it is frequently used to mitigate the effect of outliers (e.g., Aguinis et al., 2013 ). It is important to remember that if one applies a transformation, any forecasts generated by the selected model will be in these transformed units. However, once the model is fitted and the parameters estimated, one can reverse these transformations to obtain forecasts in its original metric.

Finally, there are also formal statistical tests for stationarity, termed unit root tests. A very popular procedure is the augmented Dickey–Fuller test (ADF; Said and Dickey, 1984 ) which tests the null hypothesis that the series is non-stationary. Thus, rejection of the null provides evidence for a stationary series. Table 1 below contains information regarding the ADF test, as well as descriptions of other various statistical tests frequently used in time series analysis that will be discussed in the remainder of the paper. By using the ADF test in conjunction with the transformations described above (or the modeling procedures delineated below), an analyst can ensure that a series conforms to stationarity.

Table 1. Common tests in time series analysis .

## Time Series Modeling: Regression Methods

The statistical time series literature is dominated by methodologies aimed at forecasting the behavior of a time series ( Shmueli, 2010 ). Yet, as the survey in the introduction illustrated, psychological researchers are primarily interested in other applications, such as describing and accounting for an underlying trend, linking explanatory variables to the criterion of interest, and assessing the impact of critical events. Thus, psychological researchers will primarily use descriptive or explanatory models, as opposed to predictive models aimed solely at generating accurate forecasts. In time series analysis, each of the aforementioned goals can be accomplished through the use of regression methods in a manner very similar to the analysis of cross-sectional data. After having explicated the basic properties of time series data, we now discuss these specific modeling approaches that are able fulfill these purposes. The next four sections begin by first providing an overview of each type of regression model, how psychological research stands to gain from the use of these methods, and their corresponding statistical models. We include mathematical treatments, but also provide conceptual explanations so that they may be understood in an accessible and intuitive manner. Additionally, Figure 5 presents a flowchart depicting different time series models and which approaches are best for addressing the various goals of psychological research. As the current paper continues, the reader will come to understand the meaning and structure of these models and their relation to substantive research questions.

Figure 5. A flowchart depicting various time series modeling approaches and how they are suited to address various goals in psychological research .

It is important to keep in mind that time series often exhibit strong autocorrelation which often manifests in correlated residuals after a regression model has been fit. This violates the standard assumption of independent (i.e., uncorrelated) errors. In the section that follows these regression approaches, we describe how the remaining autocorrelation can be included in the model by building a dynamic regression model that includes ARIMA terms 5 . That is, a regression model can be first fit to the data for explanatory or descriptive modeling, and ARIMA terms can be fit to the residuals in order to account for any remaining autocorrelation and improve forecasts ( Hyndman and Athanasopoulos, 2014 ). However, we begin by introducing regression methods separate from ARIMA modeling, temporarily setting aside the issue of autocorrelation. This is done in order to better focus on the implementation of these models, but also because violating this assumption has minimal effects on the substance of the analysis: The parameter estimates remain unbiased and can still be used for prediction. Its forecasts will not be “wrong,” but inefficient —i.e., ignoring the information represented by the autocorrelation that could be used to obtain better predictions ( Hyndman and Athanasopoulos, 2014 ). Additionally, generalized least squares estimation (as opposed to ordinary least squares) takes into account the effects of autocorrelation which otherwise lead to underestimated standard errors ( Cowpertwait and Metcalfe, 2009 , p. 98). This estimation procedure was used for each of the regression models below. For further information on regression methods for time series, the reader is directed to Hyndman and Athanasopoulos (2014 , chaps. 4, 5) and McCleary et al. (1980) , which are very accessible introductions to the topic, as well as Cowpertwait and Metcalfe (2009 , chap. 5) and Cryer and Chan (2008 , chaps. 3, 11) for more mathematically-oriented treatments.

## Modeling Trends through Regression

Modeling an observed trend in a time series through regression is appropriate when the trend is deterministic —i.e., the trend is due to the constant, deterministic effects of a few causal forces ( McCleary et al., 1980 , p. 34). As a result, a deterministic trend is generally stable across time. Expecting any trend to continue indefinitely is often unrealistic, but for a deterministic trend, linear extrapolation can provide accurate forecasts for several periods ahead, as forecasting generally assumes that trends will continue and change relatively slowly ( Cowpertwait and Metcalfe, 2009 , p. 6). Thus, when the trend is deterministic, it is desirable to use a regression model that includes the hypothesized causal factors as predictors ( Cowpertwait and Metcalfe, 2009 , p. 91; McCleary et al., 1980 , p. 34).

Deterministic trends stand in contrast to stochastic trends, those that arise simply from the random movement of the variable over time (long runs of similar values due to autocorrelation; Cowpertwait and Metcalfe, 2009 , p. 91). As a result, stochastic trends often exhibit frequent and inexplicable changes in both slope and direction. When the trend is deemed to be stochastic, it is often removed through differencing. There are also methods for forecasting using stochastic trends (e.g., random walk and exponential smoothing models) discussed in Cowpertwait and Metcalfe (2009 , chaps. 3, 4) and Hyndman and Athanasopoulos (2014 , chap. 7). However, the reader should be aware that these are predictive models only, as there is nothing about a stochastic trend that can be explained through external, theoretically interesting factors (i.e., it is a trend attributable to randomness). Therefore, attempting to model it deterministically as a function of time or other substantive variables via regression can lead to spurious relationships ( Kuljanin et al., 2011) and inaccurate forecasts, as the trend is unlikely to remain stable over time.

Returning to the example Google time series of Figure 1 , the evident trend in the seasonally adjusted series might appear to be stochastic: It is not constant but changes at several points within the series. However, we have strong theoretical reasons for modeling it deterministically, as the 2008 economic crisis is one causal factor that likely had a profound impact on the series. Thus, this theoretical rationale implies that the otherwise inexplicable changes in its trend are due to systematic forces that can be appropriately modeled within an explanatory approach (i.e., as a deterministic function of predictors).

## The Linear Regression Model

As noted in the literature review, psychological researchers are often directly interested in describing an underlying trend. For example, ( Fuller et al. (2003) examined the strain of university employees using a time series design. They found that each self-report item displayed the same deterministic trend: Globally, strain increased over time even though the perceived severity of the stressful events did not increase. Levels of strain also decreased at spring break and after finals week, during which mood and job satisfaction also exhibited rising levels. This finding cohered with prior theory on the accumulating nature of stress and the importance of regular strain relief (e.g., Bolger et al., 1989 ; Carayon, 1995) . Furthermore, Wagner et al. (1988) examined the trend in employee productivity after the implementation of an incentive-based wage system. In addition to discovering an immediate increase in productivity, it was found that productivity increased over time as well (i.e., a continuing deterministic trend). This trend gradually diminished over time, but was still present at the end of the study period—nearly 6 years after the intervention first occurred.

By visually examining a time series, an analyst can describe how a trend changes as function of time. However, one can formally assess the behavior of a trend by regressing the series on a variable that represents time (e.g., 1–50 for 50 equally-spaced observations). In the simplest case, the trend can be modeled as a linear function of time, which is conceptually identical to a regression model for cross-sectional data using a single predictor:

where the coefficient b 1 estimates the amount of change in the time series associated with a one-unit increase in time, t is the time variable, and ε t is random error. The constant, b 0 , estimates the level of the series when t = 0.

If a deterministic trend is fully accounted for by a linear regression model, the residual error series (i.e., the collection of residuals which themselves form a time series) will not contain any remaining trend component; that is, this non-stationary behavior of the series will have been accounted for Cowpertwait and Metcalfe (2009) , (p. 121). Returning to our empirical example, the linear regression model displayed in Equation (3) was fit to the seasonally adjusted Google job search data. This is displayed in the top left panel of Figure 6 . The regression line of best-fit is superimposed, and the residual error series is shown in the panel directly to the right. Here, time is a significant predictor ( b 1 = 0.32, p < 0.001), and the model accounts for 67% of the seasonally-adjusted series variance ( R 2 = 0.67, p < 0.001). However, the residual error series displays a notable amount of remaining trend that has been left unaccounted for; the first half of the error series has a striking downward trend that begins to rise at around 2007. This is because the regression line is constrained to linearity and therefore systematically underestimates and overestimates the values of the series when the trend exhibits runs of high and low values, respectively. Importantly, the forecasts from the simple linear model will most likely be very poor as well. Although there is a spike at the end of the series, the linear model predicts that values further ahead in time will be even higher. By contrast, we actually expect these values to decrease, similar to how there was a decreasing trend in 2008 right after the first spike. Thus, despite accounting for a considerable amount of variance and serving as a general approximation of the series trend, the linear model is insufficient in several systematic ways, manifesting in inaccurate forecasts and a significant remaining trend in the residual error series. A method for improving this model is to add in a higher-order polynomial term; modeling the trend as quadratic, cubic, or an even higher-order function may lead to a better-fitting model, but the analyst must be vigilant of overfitting the series—i.e., including so many parameters that the statistical noise becomes modeled. Thus, striking a balance between parsimony and explanatory capability should always be a consideration when modeling time series (and statistical modeling in general). Although a simple linear regression on time is often adequate to approximate a trend ( Cowpertwait and Metcalfe, 2009 , p. 5), in this particular instance a higher-order term may provide a better fit to the complex deterministic trend seen within this series.

Figure 6. Three different regression models with time as the regressor and their associated residual error series .

## Polynomial Regression Models

When describing the trend in the Google data earlier, it was noted that the series began to display a rising trend approximately a third of the way into the series, implying that a quadratic regression model (i.e., a single bend) may yield a good fit to the data. Furthermore, our initial hypothesis was that job search behavior proceeded at a generally constant rate and then spiked once the economic crisis began—also implying a quadratic trend. In some time series, the trend over time will be non-linear, and the predictor terms can be specified to reflect such higher-order terms (quadratic, cubic, etc.). Just like when modeling cross-sectional data, non-linear terms can be incorporated into the statistical model by squaring the predictor (here, time) 6 :

The center panels in Figure 6 show the quadratic model and its residual error series. In line with the initial hypothesis, both the quadratic term ( b 2 = 0.003, p < 0.001) and linear term ( b 1 = 0.32, p < 0.001) were statistically significant. Thus, modeling the trend as a quadratic function of time explained an additional 4% of the series variance relative to the more parsimonious linear model ( R 2 = 0.71, p < 0.001). However, examination of this series and its residuals shows that it is not as different from the linear model than was expected; although the first half of the residual error series has a more stable mean level, there are still noticeable trends in the first half of the residual error series, and the forecasts implied by this model are even higher than those of the linear model. Therefore, a cubic trend may provide an even better fit, as there are two apparent bends in the series:

After fitting this model to the Google data, 87% of the series variance is accounted for ( R 2 = 0.87 p < 0.001), and all three coefficients are statistically significant: b 1 = 0.69, p < 0.001, b 2 = 0.003, p = 0.05, and b 3 = −0.0003, p < 0.001. Furthermore, the forecasts implied by the model are much more realistic. Ultimately, it is unlikely that this model will provide accurate forecasts many periods into the future (as is often the case for regression models; Cowpertwait and Metcalfe, 2009 , p. 6; Hyndman and Athanasopoulos, 2014 ). It is more likely that either (a) a negative trend will return the series back to more moderate levels or (b) the series will simply continue at a generally high level. Furthermore, relative to the linear model, the residual error series of this model appears much closer to stationarity (e.g., Figure 4 ), as the initial downward trend of the time series is captured. Therefore, modeling the series as a cubic function of time is the most successful in terms of accounting for the trend, and adding an even higher-order polynomial term has little remaining variance to explain (<15%) and would likely lead to an overfitted model. Thus, relative to the two previous models, the cubic model strikes a balance between relative parsimony and descriptive capability. However, any forecasts from this model could be improved upon by removing the remaining trend and including other terms that account for any autocorrelation in the data, topics discussed in an upcoming section on ARIMA modeling.

## Interrupted Time Series Analysis

Although we are interested in describing the underlying trend within the Google time series as a function of time, we are also interested in the effect of a critical event, represented by the following question: “Did the 2008 economic crisis result in elevated rates job search behaviors?” In psychological science, many research questions center on the impact of an event, whether it be a relationship change, job transition, or major stressor or uplift ( Kanner et al., 1981 ; Dalal et al., 2014 ). In the survey of how time series analysis had been previously used in psychological research, examining the impact of an event was one of its most common uses. In time series methodology, questions regarding the impact of events can be analyzed through interrupted time series analysis (or intervention analysis ; Glass et al., 1975 ), in which the time series observations are “interrupted” by an intervention, treatment, or incident occurring at a known point in time ( Cook and Campbell, 1979 ).

In both academic and applied settings, psychological researchers are often constrained to correlational, cross-sectional data. As a result, researchers rarely have the ability to implement control groups within their study designs and are less capable of drawing conclusions regarding causality. In the majority of cases, it is the theory itself that provides the rationale for drawing causal inferences ( Shmueli, 2010 , p. 290). In contrast, an interrupted time series is the strongest quasi-experimental design to evaluate the longitudinal impact of an event ( Wagner et al., 2002 , p. 299). In a review of previous research on the efficacy of interventions, Beer and Walton (1987) stated, “much of the research overlooks time and is not sufficiently longitudinal. By assessing the events and their impact at only one nearly contemporaneous moment, the research cannot discuss how permanent the changes are” (p. 343). Interrupted time series analysis ameliorates this problem by taking multiple measurements both before and after the event, thereby allowing the analyst to examine the pre- and post-event trend.

Collecting data at multiple time points also offers advantages relative to cross-sectional comparisons based on pre- and post-event means. A longitudinal interrupted time series design allows the analyst to control for the trend prior to the event, which may turn out to be the cause of any alleged intervention effect. For instance, in the field of industrial/organizational psychology, Pearce et al. (1985) found a positive trend in four measures of organizational performance over the course of the 4 years under study. However, after incorporating the effects of the pre-event trend in the analysis, neither the implementation of the policy nor the first year of merit-based rewards yielded any additional effects. That is, the post-event trends were almost totally attributable to the pre-event behavior of the series. Thus, a time series design and analysis yielded an entirely different and more parsimonious conclusion that might have otherwise been drawn. In contrast, Wagner et al. (1988) was able to show that that for non-managerial employees, an incentive-based wage system substantially increased employee productivity in both its baseline level and post-intervention slope (the baseline level jumped over 100%). Thus, interrupted time series analysis is an ideal method for examining the impacts of such events and can be generalized to other criteria of interest.

## Modeling an Interrupted Time Series

Statistical modeling of an interrupted time series can be accomplished through segmented regression analysis ( Wagner et al., 2002 , p. 300). Here, the time series is partitioned into two parts: the pre- and post-event segments whose levels (intercepts) and trends (slopes) are both estimated. A change in these parameters represents an effect of the event: A significant change in the level of the series indicates an immediate change, and a change in trend reflects a more gradual change in the outcome (and of course, both are possible; Wagner et al., 2002 , p. 300). The formal model reflects these four parameters of interest:

Here, b 0 represents the pre-event baseline level, t is the predictor time (in our example, coded 1–90), and its coefficient, b 1, estimates the trend prior to the event ( Wagner et al., 2002 , p. 31). The dummy variable event t codes for whether or not each time point occurred before or after the event (0 for all points prior to the event; 1 for all points after). Its coefficient, b 2 , assesses the post-event baseline level (intercept). The variable t after event represents how many units after the event the observation took place (0 for all points prior to the event; 1, 2, 3 … for subsequent time points), and its coefficient, b 3 , estimates the change in trend over the two segments. Therefore, the sum of the pre-event trend ( b 1 ) and its estimated change ( b 3 ) yields the post-event slope ( Wagner et al., 2002 , p. 301).

Importantly, this analysis requires that the time of event occurrence be specified a priori, otherwise a researcher may search the series in an “exploratory” fashion and discover a time point that yields a notable effect, resulting in potentially spurious results ( McCleary et al., 1980 , p. 143). In our example, the event of interest was the economic crisis of 2008. However, as is often the case when analyzing large-scale social phenomena, it was not a discrete, singular incident, but rather unfolded over time. Thus, no exact point in time can perfectly represent its moment of occurrence. In other topics of psychological research, the event of interest is a unique post-event time may be identified. Although interrupted time series analysis requires that events be discrete, this conceptual problem can be easily managed in practice; selecting a point of demarcation that generally reflects when the event occurred will still allow the statistical model to assess the impact of the event on the level and trend of the series. Therefore, due to prior theory and for simplicity, we specified the pre- and post-crisis segments to be separated at January 2008, representing the beginning of the economic crisis and acknowledging that this demarcation was imperfect, but one that would still allow the substantive research question of interest to be answered.

Although not utilized in our analysis, when analyzing an interrupted time series using segmented regression one has the option of actually specifying the post-event segment after the actual event occurred. The rationale behind this is to accommodate the time it takes for the causal effect of the event itself manifest in the time series—the equilibration period (see Mitchell and James, 2001 , p. 539; Wagner et al., 2002 , p. 300). Although an equilibration period is likely a component of all causal phenomena (i.e., causal effects probably never fully manifest at once), two prior reviews have illustrated that researchers account for it only infrequently, both theoretically and empirically ( Kelly and McGrath, 1988 ; Mitchell and James, 2001 ). Statistically, this is accomplished through the segmented regression model above, but simply coding the event as occurring later in the series. Comparing models with different post-event start times can also allow competitive tests of the equilibration period.

## Empirical Example

For our working example, a segmented regression model was fit to the seasonally adjusted Google time series: A linear trend estimated the first segment and a quadratic trend was fit to the second due to the noted curvilinear form of the second half of the series. Thus, a new variable and coefficient were added to the formal model to account for this non-linearity: t after event 2 and b 4 , respectively. The results of the analysis indicated that there was a practically significant effect of the crisis: The parameter representing an immediate change in the post-event level was b 2 = 8.66, p < 0.001. Although the level (i.e., intercept) differed across segments, the post-crisis trend appears to be the most notable change in the series. That is, the real effect of the crisis unfolded over time rather than having an immediately abrupt impact. This is reflected in the other coefficients of the model: The pre-crisis trend was estimated to be near zero ( b 1 = −0.03, p = 0.44), and the post-crisis trend terms were b 3 = 0.70, p < 0.001 for the linear component, and b 4 = −0.02, p < 0.001 for the quadratic term, indicating that there was a marked change in trend, but also that it was concave (i.e., on the whole, slowly decreasing over time). Graphically the model seems to capture the underlying trend of both segments exceptionally well ( R 2 = 0.87, p < 0.001), as the residual error series has almost reached stationarity ( ADF = −3.38, p = 0.06). Both are shown in Figure 7 below.

Figure 7. A segmented regression model used to assess the effect of the 2008 economic crisis on the time series and its associated residual error series .

## Estimating Seasonal Effects

Up until now, we have chosen to remove any seasonal effects by working with the seasonally adjusted time series in order to more fully investigate a trend of substantive interest. This was consistent with the following adage of time series modeling: When a systematic trend or seasonal pattern is present, it must either be modeled or removed. However, psychological researchers may also be interested in the presence and nature of a seasonal effect, and seasonal adjustment would only serve to remove this component of interest. Seasonality was defined earlier as any regular pattern of fluctuation (i.e., movement up or down in the level of the series) associated with some aspect of the calendar. For instance, although online job searchers exhibited an underlying trend in our data across years, they also display the same pattern of movement within each year (i.e., across months; see Figure 1 ). Following the need for more time-based theory and empirical research, seasonal effects are also increasingly recognized as significant for psychological science. In a recent conceptual review Dalal et al. (2014) noted that, “mood cycles… are likely to occur simultaneously over the course of a day (relatively short term) and over the course of a year (long term)” (p. 1401). Relatedly, Larsen and Kasimatis (1990) used time series methods to examine the stability of mood fluctuations across individuals. They uncovered a regular weekly fluctuation that was stronger for introverted individuals than for extraverts (due to the latter's sensation-seeking behavior that resulted in greater mood variability).

Furthermore, many systems of interest exhibit rhythmicity. This can be readily observed across a broad spectrum of phenomena that are of interest to psychological researchers. At the individual level, there is a long history in biopsychology exploring the cyclical pattern of human behavior as a function of biological processes. Prior research has consistently shown that humans possess many common physiological and behavioral cycles that range from 90-min to 365-days ( Aschoff, 1984 ; Almagor and Ehrlich, 1990 ) and may affect important psychological outcomes. For instance, circadian rhythms are particularly well-known and are associated with physical, mental, and behavioral changes within a 24-h period ( McGrath and Rotchford, 1983 ). It has been suggested that peak motivation levels may occur at specific points in the day ( George and Jones, 2000 ), and longer cyclical fluctuations of emotion, sensitivity, intelligence, and physical characteristics over days and weeks have been identified (for a review, see Conroy and Mills, 1970 ; Luce, 1970 ; Almagor and Ehrlich, 1990 ). Such cycles have been found to affect intelligence test performance and other physical and cognitive tasks (e.g., Latman, 1977 ; Kumari and Corr, 1996 ).

## Regression with Seasonal Indicators

As previously stated, when seasonal effects are theoretically important, seasonal adjustment is undesirable because it removes the time series component pertinent to the research question at large. An alternative is to qualitatively describe the seasonal pattern or formally specify a regression model that includes a variable which estimates the effect of each season. If a simple linear approximation is used for the trend, the formal model can be expressed as:

where b 0 is now the estimate of the linear relationship between the dependent variable and time, and the coefficients b 1:S are estimates of the S seasonal effects (e.g., S = 12 for yearly data; Cowpertwait and Metcalfe, 2009 , p. 100). Put more intuitively, this model can still be conceived of as a linear model but with a different estimated intercept for each season that represents its effect (Notice that the b 1:S parameters are not coefficients but constants).

As an example, the model above was fit to the original, non-seasonally adjusted Google data. Although modeling the series as a linear function of time was found to produce inaccurate forecasts, it can be used when estimating seasonal effects because this component of the model does not affect the estimates of the seasonal effects. For our data, the estimates of each monthly effect were: b 1 = 67.51, b 2 = 59.43, b 3 = 60.11, b 4 = 60.66, b 5 = 63.59, b 6 = 66.77, b 7 = 63.70, b 8 = 62.38, b 9 = 60.49, b 10 = 56.88, b 11 = 52.13, b 12 = 45.66 (Each effect was statistically significant at p < 0.001). The pattern of these intercepts mirrors the pattern of movement qualitatively described in the discussion on the seasonal component: Online job search behaviors begin at its highest levels in January ( b 1 = 67.51), likely due to the end of holiday employment, and then dropped significantly in February ( b 2 = 59.43). Subsequently, its level continued to rise during the next 4 months until June ( b 6 = 66.77), after which the series decreased each successive month until reaching its lowest point in December ( b 12 = 45.66).

## Harmonic Seasonal Models

Another approach to modeling seasonal effects is to fit a harmonic seasonal model that uses sine and cosine functions to describe the pattern of fluctuations seen across periods. Seasonal effects often vary in a smooth, continuous fashion, and instead of estimating a discrete intercept for each season, this approach can provide a more realistic model of seasonal change (see Cowpertwait and Metcalfe, 2009 , pp. 101–108). Formally, the model is:

where m t is the estimate of the trend at t (approximated as a linear or polynomial function of time), s i and c i are the unknown parameters of interest, S is the number of seasons within the time period (e.g., 12 months for a yearly period), i is an index that ranges from 1 to S/2 , and t is a variable that is coded to represent time (e.g., 1:90 for 90 equally-spaced observations). Although this model is complex, it can be conceived as including a predictor for each season that contains a sine and/or cosine term. For yearly data, this means that six s and six c coefficients estimate the seasonal pattern ( S/2 coefficients for each parameter type). Importantly, after this initial model is estimated, the coefficients that are not statistically significant can be dropped, which often results in fewer parameters relative to the seasonal indicator model introduced first ( Cowpertwait and Metcalfe, 2009 , p. 104). For our data, the above model was fit using a linear approximation for the trend, and five of the original twelve seasonal coefficients were statistically significant and thus retained: c 1 = −5.08, p < 0.001, s 2 = 2.85, p = 0.005, s 3 = 2.68, p = 0.009, c 3 = −2.25, p = 0.03, c 5 = −2.97, p = 0.004. This model also explained a substantial amount of the series variance ( R 2 = 0.75, p < 0.001). Pre-made and annotated R code for this analysis can be found in the Supplementary Material.

## Time Series Forecasting: ARIMA ( p, d, q ) Modeling

In the preceding section, a number of descriptive and explanatory regression models were introduced that addressed various topics relevant to psychological research. First, we sought to determine how the trend in the series could be best described as a function of time. Three models were fit to the data, and modeling the trend as a cubic function provided the best fit: It was the most parsimonious model that explained a very large amount of variation in the series, it did not systematically over or underestimate many successive observations, and any potential forecasts were clearly superior relative to those of the simpler linear and quadratic models. In the subsequent section, a segmented regression analysis was conducted in order to examine the impact of the 2008 economic crisis on job search behavior. It was found that there was both a significant immediate increase in the baseline level of the series (intercept) and a concomitant increase in its trend (i.e., slope) that gradually decreased over time. Finally, the seasonal effects of online search behavior were estimated and mirrored the pattern of job employment rates described in a prior section.

From these analyses, it can be seen that the main features of many times series are the trend and seasonal components that must either be modeled as deterministic functions of predictors or removed from the series. However, as previously described, another critical feature in time series data is its autocorrelation , and a large portion of time series methodology is aimed at explaining this component ( Dettling, 2013 , p. 2). Primarily, accounting for autocorrelation entails fitting an ARIMA model to the original series, or adding ARIMA terms to a previously fit regression model; ARIMA models are the most general class of models that seek to explain the autocorrelation frequently found in time series data ( Hyndman and Athanasopoulos, 2014 ). Without these terms, a regression model will ignore the pattern of autocorrelation among the residuals and produce less accurate forecasts ( Hyndman and Athanasopoulos, 2014 ). Therefore, ARIMA models are predictive forecasting models . Time series models that include both regression and ARIMA terms are referred to as dynamic models and may be a primary type of time series models used by psychological researchers.

Although not strongly emphasized within psychological science, forecasting is an important aspect of scientific verification ( Popper, 1968 ). Standard cross-sectional and longitudinal models are generally used in an explanatory fashion (e.g., estimating the relationships among constructs and testing null hypotheses), but they are quite capable of prediction as well. Because of the ostensible movement to more time-based empirical research and theory, predicting future values will likely become a more important aspect of statistical modeling, as it can validate psychological theory ( Weiss and Cropanzano, 1996 ) and computational models ( Tobias, 2009 ) that specify effects over time.

At the outset, it is helpful to note that the regression and ARIMA modeling approaches are not substantially different: They both formalize the variation in the time series variable as a function of predictors and some stochastic noise (i.e., the error term). The only practical difference is that while regression models are generally built from prior research or theory, ARIMA models are developed empirically from the data (as will be seen presently; McCleary et al., 1980 , p. 20). In describing ARIMA modeling, the following sections take the form of those discussing regression methods: Conceptual and mathematical treatments are provided in complement in order to provide the reader with a more holistic understanding of these methodologies.

## Introduction

The first step in ARIMA modeling is to visually examine a plot of the series' ACF (autocorrelation function) to see if there is any autocorrelation present that can be used to improve the regression model—or else the analyst may end up adding unnecessary terms. The ACF for the Google data is shown in Figure 3 . Again, we will work with the seasonally adjusted series for simplicity. More formally, if a regression model has been fit, the Durbin–Watson test can be used to assess if there is autocorrelation among the residuals and if ARIMA terms can be included to improve its forecasts. The Durbin–Watson test tests the null hypothesis that there is no lag-1 autocorrelation present in the residuals. Thus, a rejection of the null means that ARIMA terms can be included (the Ljung–Box test described below can also be used; Hyndman and Athanasopoulos, 2014 ).

Although the modeling techniques described in the present and following sections can be applied to any one of these models, due to space constraints we continue the tutorial on time series modeling using the cubic model of the first section. A model with only one predictor (viz., time) will allow more focus on the additional model terms that will be added to account for the autocorrelation in the data.

## I( d ): integrated

ARIMA is an acronym formed by the three constituent parts of these models. The AR( p ) and MA( q ) components are predictors that explain the autocorrelation. In contrast, the integrated (I[ d ]) portion of ARIMA models does not add predictors to the forecasting equation. Rather, it indicates the order of differencing that has been applied to the time series in order to remove any trend in the data and render it stationary. Before any AR or MA terms can be included, the series must be stationary . Thus, ARIMA models allow non-stationary series to be modeled due to this “integrated” component (an advantage over simpler ARMA models that do not include such terms; Cowpertwait and Metcalfe, 2009 , p. 137). A time series that has been made stationary by taking the d difference of the original series is notated as I( d ). For instance, an I(1) model indicates that the series that has been made stationary by taking its first differences, I(2), by the second differences (i.e., the first differences of the first differences), etc. Thus, the order of integrated terms in an ARIMA model merely specifies how many iterations of differencing were performed in order to make the series stationary so that AR and MA terms may be included.

## Identifying the Order of Differencing

Identifying the appropriate order of differencing to stationarize the series is the first and perhaps most important step in selecting an ARIMA model ( Nua, 2014 ). It is also relatively straightforward. As stated previously, the order of differencing rarely needs to be greater than two in order to stationarize the series. Therefore, in practice the choice comes down to whether the series is transformed into either its first or second differences, the optimal choice being the order of differencing that results in the lowest series variance (and does not result in an increase in variance that characterizes overdifferencing).

## AR( p ): Autoregressive Terms

The first part of an ARIMA model is the AR( p ) component, which stands for autoregressive . As correlation is to regression, autocorrelation is to autoregression. That is, in regression, variables that are correlated with the criterion can be used for prediction, and the model specifies the criterion as a function of the predictors. Similarly, with a variable that is autocorrelated (i.e., correlated with itself across time periods), past values can serve as predictors, and the values of the time series are modeled as a function of previous values (thus, autoregression ). In other words, an ARIMA ( p, d, q ) model with p AR terms is simply a linear regression of the time series values against the preceding p observations. Thus, an ARIMA(1, d, q ) model includes one predictor, the observation immediately preceding the current value, and an ARIMA(2, d, q ) model includes two predictors, the first and second preceding observations. The number of these autoregressive terms is called the order of the AR component of the ARIMA model. The following equation uses one AR term (an AR[1] model) in which the preceding value in the time series is used as a regressor:

where ϕ is the autoregressive coefficient (interpretable as a regression coefficient), and y t−1 is the immediately preceding observation. More generally, a model with AR( p ) terms is expressed as:

## Selecting the Number of Autoregressive Terms

The number of autoregressive terms required depends on how many lagged observations explain a significant amount of unique autocorrelation in the time series. Again, an analogy can be made to multiple linear regression: Each predictor should account for a significant amount of variance after controlling for the others. However, a significant autocorrelation at higher lags may be attributable to an autocorrelation at a lower lag. For instance, if a strong autocorrelation exists at lag-1, then a significant lag-3 autocorrelation (i.e., a correlation of time t with t -3) may be a result of t being correlated with t -1, t -1 with t -2, and t -2 with t -3 (and so forth). That is, a strong autocorrelation at an early lag can “persist” throughout the time series, inducing significant autocorrelations at higher lags. Therefore, instead of inspecting the ACF which displays zero-order autocorrelations, a plot of the partial autocorrelation function (PACF) across different lags is the primary method in determining which prior observations explain a significant amount of unique autocorrelation, and accordingly, how many AR terms (i.e., lagged observations as predictors) should be included. Put simply, the PACF displays the autocorrelation of each lag after controlling for the autocorrelation due to all preceding lags ( McCleary et al., 1980 , p. 75). A conventional rule is that if there is a sharp drop in the PACF after p lags, then the previous p -values are responsible for the autocorrelation in the series, and the model should include p autoregressive terms (the partial autocorrelation coefficient typically being the value of the autoregressive coefficient, ϕ; Cowpertwait and Metcalfe, 2009 , p. 81). Additionally, the ACF of such a series will gradually decay (i.e., reduce) toward zero as the lag increases.

Applying this knowledge to the empirical example, Figure 3 depicted the ACF of the seasonally adjusted Google time series, and Figure 8 displays its PACF. Here, only one lagged partial autocorrelation is statistically significant (lag-6), despite over a dozen autocorrelations in the ACF reaching significance. Thus, it is probable that early lags—and the lag-6 in particular—are responsible for the chain of autocorrelation that persists throughout the series. Although the series is considerably non-stationary (i.e., there is a marked trend and seasonal component), if the series was already stationary, then a model with a single AR term (an AR[1] model) would likely provide the best fit, given a single significant partial autocorrelation at lag-6. The ACF in Figure 3 also displays the characteristics of an AR(1) series: It has many significant autocorrelations that gradually reduce toward zero. This coheres with the notion that one AR term is often sufficient for a residual time series ( Cowpertwait and Metcalfe, 2009 , p. 121). However, if the pattern of autocorrelation is more complex, then additional AR terms may be required. Importantly, if a particular number of AR terms have been successful in explaining the autocorrelation of a stationary series, the residual error series should appear as entirely random white noise (as in Figure 4 ).

Figure 8. A plot of the partial autocorrelation function (PACF) of the seasonally adjusted time series of Google job searches .

## MA( q ): Moving Average Terms

In the preceding section, it was shown that one can account for the autocorrelation in the data by regressing on prior values in the series (AR terms). However, sometimes the autocorrelation is more easily explained by the inclusion of MA terms; the use of MA terms to explain the autocorrelation—either on their own or in combination with AR components—can result in greater parameter parsimony (i.e., fewer parameters), relative to relying solely on AR terms ( Cowpertwait and Metcalfe, 2009 , p. 127). As noted above, ARIMA models assume that any systematic components have either been modeled or removed and that the time series is stationary—i.e., a stochastic process. In time series theory, the values of stochastic processes are determined by two forces: prior values, described in the preceding section, and random shocks (i.e., errors; McCleary et al., 1980 , pp. 18–19). Random shocks are the myriad variables that vary across time and interact with such complexity that their behavior is ostensibly random (e.g., white noise; McCleary et al., 1980 , p. 40). Each shock can be conceived of as an unobserved value at each point in time that influences each observed value of the time series. Thus, autocorrelation in the data may be explained by the persistence of prior values (or outputs , as in AR terms) or, alternatively, the lingering effects of prior unobserved shocks (i.e., the inputs , in MA terms). Therefore, if prior random shocks are related to the value of the series, then these can be included in the prediction equation to explain the autocorrelation and improve the efficiency of the forecasts generated by the model. In other words, just as AR terms can be conceived as a linear regression on previous time series values, MA terms are conceptually a linear regression of the current value of the series against prior random shocks. For instance, an MA(1) model can be expressed as:

where ε t is the value of the random shock at time t, ε t− 1 is the value of the previous random shock, and θ is its coefficient (again, interpretable as a regression coefficient). More generally, the order of MA terms is conventionally denoted as q , and an MA( q ) model can be expressed as:

## Selecting the Number of MA Terms

Selecting the number of MA terms in the model is conceptually similar to the process of identifying the number of AR terms: One examines plots of the autocorrelation (ACF) and partial autocorrelation functions (PACF) and then specifies an appropriate model. However, while the number of AR terms could be identified by the PACF of the series (more specifically, the point at which the PACF dropped), the number of appropriate MA terms is usually identified by the ACF. Specifically, if the ACF is non-zero for the first q lags and then drops toward zero, then q MA terms should be included in the model ( McCleary et al., 1980 , p. 79). All successive lags of the ACF are expected to be zero, and the PACF of such a series will be gradually decaying ( McCleary et al., 1980 , p. 79). Thus, relative to AR terms, the roles of the ACF and PACF are essentially reversed when determining the number of MA terms. Furthermore, in practice most social processes can be sufficiently modeled by a single MA term; models of order q = 2 are less common, and higher-order models are extremely rare ( McCleary et al., 1980 , p. 63).

## Model Building and Further Notes on ARIMA ( p, d, q ) Models

The components of ARIMA models—autoregressive, integrated, and moving average—are aimed at explaining the autocorrelation in a series that is either stationary or can be made so through differencing (i.e., I[ d ] integrated terms). Though already stated, the importance of the following point warrants reiteration: After a successful ARIMA( p, d, q ) model has been fit to the autocorrelated data, the residual error series should be a white noise series. That is, after a good-fitting model has been specified, the residual error series should not display any significant autocorrelations, have a mean of zero, and some constant variance; i.e., there should be no remaining signal that can be used to improve the model's forecasts. Thus, after specifying a particular model, visual inspection of the ACF and PACF of the error series is critical in order to assess model adequacy ( McCleary et al., 1980 , p. 93). All autocorrelations are expected to be zero with 5% expected to be statistically significant due to sampling error.

Furthermore, just as there are formal methods to test that a series is stationary before fitting an ARIMA model, there are also statistical tests for the presence of autocorrelation after the model has been fit. The Ljung–Box test ( Ljung and Box, 1978 ) is one commonly-applied method in which the null hypothesis is that the errors are uncorrelated across many lags ( Cryer and Chan, 2008 , p. 184; Hyndman and Athanasopoulos, 2014 ). Thus, failing to reject the null provides evidence that the model has succeeded in explaining the remaining autocorrelation in the data.

If both formal and informal methods indicate that the residual error series is not a series of white noise terms (i.e., there is remaining autocorrelation), then the analyst must reassess the pattern of autocorrelation and re-specify a new model. Thus, in contrast to regression approaches, ARIMA modeling is an exploratory, iterative process in which the data is examined, models are specified, checked for adequacy, and then re-specified as needed. However, selecting the most appropriate order of AR, I, and MA terms can prove to be a difficult process ( Hyndman and Athanasopoulos, 2014 ). Fortunately, model comparison can be easily performed by comparing the Akaike information criterion (AIC) across models ( Akaike, 1974 ) 7 . This statistic is based on the fit of a model and its number of parameters, and models with lower values should be selected. Generally, models within two AIC values are considered comparable, a difference of 4–7 points indicates considerable support for the better-fitting model, and a difference of 10 points or greater signifies full support of that model ( Burnham and Anderson, 2004 , p. 271). Additionally, the “forecast” R package ( Hyndman, 2014 ) contains a function to automatically derive the best-fitting ARIMA model based on the AIC or other fit criteria (see Hyndman and Khandakar, 2008 ). This procedure is discussed in the Supplementary Material.

Furthermore, a particular pattern of autocorrelation can often be explained by either AR or MA terms. Generally, AR terms are preferable to MA terms because their interpretation of these parameters is more straightforward (e.g., the regression coefficient associated with a previous time series value rather than a coefficient associated with an unobserved random shock). However, a more central concern is parameter parsimony; if a model using MA terms (or a combination of AR and MA terms) can explain the autocorrelation with fewer parameters than one that relies solely on AR terms, then these models are generally preferable.

Finally, although a mixed ARIMA model containing both AR and MA terms can result in greater parameter parsimony ( Cowpertwait and Metcalfe, 2009 , p. 127), in practice, non-mixed models (i.e., those with either with AR or MA terms alone) should always be ruled out prior to fitting these more complex models ( McCleary et al., 1980 , p. 66). Unnecessary model complexity (i.e., redundant parameters) may not become evident at all during the process of model checking, while the inadequacy of simpler models is often easily identified (e.g., noticeable remaining autocorrelation in the ACF plot).

## Fitting a Dynamic Regression Model with ARIMA Terms

In this final section, we illustrate how a predictive ARIMA approach to time series modeling can be combined with regression methods through specification of a dynamic regression model. These models can be fit to the data in order to generate accurate forecasts, as well as explain or examine an underlying trend or seasonal effect (as opposed to their removal). We then analyze the predictions from this model and discuss methods of assessing forecast accuracy. For simplicity, we continue with the regression model that modeled the series as a cubic function of time.

## Preliminaries

When the predictor is time, one should begin specification of a dynamic regression model by first examining the residual error series after the regression model has been fit. This is done in order to first detect if there is any autocorrelation in the model residuals that would warrant the inclusion of ARIMA terms. The residual error series are of interest because a dynamic regression model can be thought of as a hybrid model that includes a correction for autocorrelated errors. That is, whatever the regression model does not account for (trend, autocorrelation, etc.) can be supplemented by ARIMA modeling. Analytically, this is performed by re-specifying the initial regression model as an ARIMA model with regressors (sometimes called an “ARIMAX” model, the “X” denoting external predictors) and selecting the appropriate order of ARIMA terms that fit the autocorrelation structure in the residuals ( Nua, 2014 ).

## Identifying the Order of Differencing: I( d ) Terms

As noted previously, the residual error series of the cubic regression model exhibited a remaining trend and autocorrelation (see Figure 6 ). A significant Durbin–Watson test formally confirms this is the case (i.e., the error terms are not uncorrelated; DW = 0.47, p < 0.001). Thus, ARIMA terms are necessary to (a) stationarize the series (I[ d ] terms) and (b) generate more accurate forecasts (AR[ p ] and/or MA[ q ] terms). As stated above, the conventional first step when formulating an ARIMA is determining the number of I( d ) terms (i.e., order of differencing) required to remove any remaining trend and render the series stationary. We note that, in this case, the systematic seasonal effects have already been removed through seasonal adjustment. It was previously noted that in practice, removing a trend is accomplished almost always by taking either the first or second differences—whichever transformation results in the lowest variance and avoids overdifferencing (i.e., an increase in the series variance). Because the residual trend does not have a markedly changing slope, it is likely that only one order of differencing will be required. The results indicate that this is indeed the case: After first differencing, the series variance is reduced from 13.56 to 6.45, and an augmented Dickey–Fuller test rejects the null hypothesis of a non-stationary series ( ADF = −4.50, p < 0.01). Taking the second differences also results in stationarity (i.e., the trend is removed), but leads to an overdifferenced series with a variance that is inflated to a level higher than the original error series ( s 2 = 14.90).

## Identification of AR( p ) and MA( q ) Terms

After the order of I( d ) terms has been identified (here, 1), the next step is to determine whether the pattern of autocorrelation can be better explained by AR terms, MA terms, or a combination of both. As noted, AR terms are often preferred to MA terms because their interpretation is more straightforward, and simpler models with either AR or MA terms are preferable to mixed models. We therefore begin by examining plots of the ACF and PACF for the residual error series shown Figure 9 in order to see if they display either an AR or MA “signature” (e.g., drop-offs or slow decays).

Figure 9. ACF and PACF of the cubic model residuals used to determine the number of AR and MA terms in an ARIMA model .

From Figure 9 , we can see that there are many high autocorrelations in the ACF plot that slowly decay, indicative that AR terms are probably most suitable (A sharp drop in the ACF would indicate that the autocorrelation is probably better explained by MA terms). As stated earlier, the PACF gives the autocorrelation for a lag after controlling for all earlier lags; a significant drop in the PACF at a particular lag indicates that this lagged value is largely responsible for the large zero-order autocorrelations in the ACF. Based on this PACF, the number of terms to include is less clear; aside from the lag-0 autocorrelation, there is no perceptible drop-off in the PACF, and there are no strong partial autocorrelations to attribute the persistence of the autocorrelation seen in the ACF. However, we know that there is autocorrelation in the model residuals, and that either one or two AR terms are typically sufficient for accounting for any autocorrelation ( Cowpertwait and Metcalfe, 2009 , p. 121). Therefore, we suspect that a single AR term can account for it. After fitting an ARIMA (1, 1, 0) model, a failure to reject the null hypothesis in a Ljung–Box test indicated that the model residuals were indistinguishable from a random white noise series ( χ 2 = 0.005, p = 0.94), and less than 5% of the autocorrelations in the ACF were statistically significant (The AIC of this model was 419.80). For illustrative purposes, several other models were fit to the data that either included additional AR or MA terms, or a combination of both. Their relative fit was analyzed and the results are shown in Table 2 . As can be seen, the ARIMA (1, 1, 0) model provided a level of fit that exceeded all of the other models (i.e., the smallest AIC difference among models was 4, showing considerable support). Thus, this model parsimoniously accounted for the systematic trend through a combination of regression modeling and first differencing and successfully extracted all the autocorrelation (i.e., signal) from the data in order to achieve more efficient forecasts.

Table 2. Comparison of different ARIMA models .

## Forecasting Methods and Diagnostics

Because forecasts into the future cannot be directly assessed for accuracy until the actual values are observed, it is important that the analyst establish the adequacy of the model prior to forecasting. To do this, the analyst can partition the data into two parts: the estimation period , comprising about 80% of the initial observations and used to estimate the model parameters, and the validation period , usually about 20% of the data and used to ensure that the model predictions are accurate. These percentages may shift depending on the length of the series (see Nua, 2014 ), but the size of the validation period should at least equal the number of periods ahead the analyst wishes to forecast ( Hyndman and Athanasopoulos, 2014 ). The predictions generated by the model are then compared to the observed data in the validation period to assess their accuracy. Evaluating forecast accuracy is accomplished by examining the residuals for any systematic patterns of misspecification. Forecasts should ideally be located within the 95% confidence limits, and formal statistics can be calculated from the model residuals in order to evaluate its adequacy. A popular and intuitive statistic is the mean absolute error (MAE): the average absolute deviation from the predicted values. However, this value cannot be used to compare models, as it is scale-dependent (e.g., a residual with an absolute value of 10 is much less egregious when forecasting from a series whose mean is 10,000 relative to a series with a mean of 10). Another statistic, the mean absolute percentage error (MAPE) is useful for comparing across models and is defined as the average percentage that the forecasts deviated from the observed values. Other methods and statistics, such as the root mean squared error (RMSE) and the mean absolute scaled error (MASE) can aid model evaluation and selection and are accessibly discussed by Hyndman and Athanasopoulos (2014 , chap. 2). Once a forecasting model has been deemed sufficiently accurate through these methods, forecasts into the future can then be calculated with relative confidence.

Because we have the benefit of hindsight in our example, all observations were used for estimation, and six forecasts were generated for the remainder of the 2011 year and compared to the actual observed values. The point forecasts (blue line), 80%, and 95% confidence limits are displayed in Figure 10 juxtaposed against the actual values in red. As can be seen, this forecasting model is generally successful: Each observed value lies within the 80% limits, and the residuals have a low mean absolute error ( MAE = 2.03) relative to the series mean ( M = 75.47), as well as a low mean absolute percentage error ( MAPE = 2.33). Additional statistics verified the accuracy of these predictions, and the full results of the analysis can be obtained from the first author.

Figure 10. Forecasts from the dynamic regression model compared to the observed values . The blue line represents the forecasts, and the red dotted line indicates the observed values. The darker gray region denotes the 80% confidence region, the lighter gray, 90%.

As a final note on ARIMA modeling, if the sole goal of the analysis is to produce accurate forecasts, then the seasonal and trend components represent a priori barriers to this goal and should be removed through seasonal adjustment and the I( d ) terms of an appropriate ARIMA model, respectively. Such predictive models are often easier to implement, as there are no systematic components of interest to describe or estimate; they are simply removed through transformations in order to achieve a stationary series. Finally, we close this section with two tables. The first, Table 3 , compiles the general steps involved in ARIMA time series modeling described above, from selecting the optimal order of ARIMA terms to assessing forecast accuracy. The second, Table 4 , provides a reference for the various time series terms introduced in the current paper.

Table 3. Steps for specifying an ARIMA forecasting model .

Table 4. Glossary of time series terms .

## Addendum: Further Time Series Techniques and Resources

Finally, because time series analysis contains a wide range of analytic techniques, there was not room to cover them all here (or in any introductory article for that matter). For a discussion of computing correlations between time series (i.e., the cross-correlation function), the reader is directed to McCleary et al. (1980) . For a general introduction to regression modeling, Cowpertwait and Metcalfe (2009) and Ostrom (1990) have excellent discussions, the latter describing the process of identifying lagged effects. For a highly accessible exposition of identifying and cycles or seasonal effects within the data through periodogram and spectral analysis , the reader should consult Warner (1998) , a social scientist-based text which also describes cross-spectral analysis , a method for assessing how well cycles within two series align. For regression modeling using other time series as substantive predictors, the analyst can use transfer function or dynamic regression modeling and is referred to Pankratz (1991) and Shumway and Stoffer (2006) for further reading. For additional information on forecasting with ARIMA models and other methods, we refer the reader to Hyndman and Athanasopoulos (2014) and McCleary et al. (1980) . Finally, multivariate time series analysis can model reciprocal causal relations among time series in a modeling technique called vector ARMA models, and for discussions we recommend Liu (1986) , Wei (2006) , and the introduction in Pankratz (1991 , chap. 10). Future work should attempt to incorporate these analytic frameworks within psychological research, as the analysis of time series brings in a host of complex issues (e.g., detecting cycles, guarding against spurious regression and correlation) that must be handled appropriately for proper data analysis and the development of psychological theory.

Time series analysis has proved to be integral for many disciplines over many decades. As time series data becomes more accessible to psychologists, these methods will be increasingly central to addressing substantive research questions in psychology as well. Indeed, we believe that such shifts have already started and that at an introduction to time series data is substantially important. By integrating time-series methodologies within psychological research, scholars will be impelled to think about how variables at various psychological levels may exhibit trends, cyclical or seasonal patterns, or a dependence on prior states (i.e., autocorrelation). Furthermore, when examining the influence of salient events or “shocks,” essential questions, such as “What was the pre-event trend?” and “How long did its effects endure, and what was its trajectory?” will become natural extensions. In other words, researchers will think in an increasingly longitudinal manner and will possess the necessary statistical knowledge to answer any resulting research questions—the importance of which was demonstrated above.

The ultimate goal of this introductory paper is to foster such fruitful lines of conceptualizing research. The more proximal goal is to provide an accessible yet comprehensive exposition of a number of time series modeling techniques fit for addressing a wide range of research questions. These models were based in descriptive, explanatory, and predictive frameworks—all three of which are necessary to accommodate the complex, dynamic nature of psychological theory and its data.

## Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

## Supplementary Material

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpsyg.2015.00727/abstract

1. ^ These journals were: Psychological Review, Psychological Bulletin, Journal of Personality and Social Psychology, Journal of Abnormal Psychology, Cognition, American Psychologist, Journal of Applied Psychology, Psychological Science, Perspectives on Psychological Science, Current Directions in Psychological Science, Journal of Experimental Psychology: General, Cognitive Psychology, Trends in Cognitive Sciences, Personnel Psychology, and Frontiers in Psychology .

2. ^ The specific search term was, “jobs – Steve Jobs” which excluded the popular search phrase “Steve Jobs” that would have otherwise unduly influenced the data.

3. ^ Thus, the highest value in the series must be set at 100—i.e., 100% of itself. Furthermore, although measuring a variable in terms of percentages can be misleading when assessing practical significance (e.g., a change from 1 to 4 yields a 400% increase, but may not be a large change in practice), the presumably large raw numbers of searches that include the term “jobs” entail that even a single point increase or decrease in the data is notable.

4. ^ In addition to the two classical models (additive and multiplicative) described above, there are further techniques for time series decomposition that lie beyond the scope of this introduction (e.g., STL or X-12-ARIMA decomposition). These overcome the known shortcomings of classical decomposition (e.g., the first and last several estimates of the trend component are not calculated; Hyndman and Athanasopoulos, 2014 ) which still remains the most commonly used method for time series decomposition. For information regarding these alternative methods the reader is directed to Cowpertwait and Metcalfe (2009 , pp. 19–22) and Hyndman and Athanasopoulos (2014 , chap. 6).

5. ^ Importantly, the current paper discusses dynamic models that specify time as the regressor (either as a linear or polynomial function). For modeling substantive predictors, more sophisticated techniques are necessary, and the reader is directed to Pankratz (1991) for a description of this method.

6. ^ Just like in traditional regression, the parent term t is centered before creating the polynomial term in order to mitigate collinearity.

7. ^ The use of additional fit indices, such as the AIC c (a variant of the AIC for small samples) and Bayesian information criterion (BIC) is also recommended, but we focus on the AIC here for simplicity.

Aguinis, H., Gottfredson, R. K., and Joo, H. (2013). Best-practice recommendations for defining, identifying, and handling outliers. Psychol. Res. Methods 16, 270–301. doi: 10.1177/1094428112470848

CrossRef Full Text | Google Scholar

Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans. Automat. Contr . 19, 716–723. doi: 10.1109/TAC.1974.1100705

Almagor, M., and Ehrlich, S. (1990). Personality correlates and cyclicity in positive and negative affect. Psychol. Rep . 66, 1159–1169.

PubMed Abstract | Google Scholar

Anderson, O. (1976). Time Series Analysis and Forecasting: The Box-Jenkins Approach . London: Butterworths.

Google Scholar

Aschoff, J. (1984). “A survey of biological rhythms,” in Handbook of Behavioral Neurobiology, Vol. 4, Biological Rhythms , ed J. Aschoff (New York, NY: Plenum), 3–10.

Beer, M., and Walton, A. E. (1987). Organization change and development. Annu. Rev. Psychol . 38, 339–367.

Bell, W. R., and Hillmer, S. C. (1984). Issues involved with the seasonal adjustment of time series. J. Bus. Econ. Stat . 2, 291–320. doi: 10.2307/1391266

Bolger, N., DeLongis, A, Kessler, R. C., and Schilling, E. A. (1989). Effects of daily stress on negative mood. J. Pers. Soc. Psychol . 57, 808–818.

Burnham, K., and Anderson, D. (2004). Multimodel inference: understanding AIC and BIC in model selection. Sociol. Methods Res . 33, 261–304. doi: 10.1177/0049124104268644

Busk, P. L., and Marascuilo, L. A. (1988). Autocorrelation in single-subject research: a counterargument to the myth of no autocorrelation. Behav. Assess . 10, 229–242.

Carayon, P. (1995). “Chronic effect of job control, supervisor social support, and work pressure on office worker stress,” in Organizational Risk Factors for Job Stress , eds S. L. Sauter and L. R. Murphy (Washington, DC: American Psychological Association), 357–370.

Chatfield, C. (2004). The Analysis of Time Series: An Introduction, 6th Edn . New York, NY: Chapman and Hall/CRC.

Conroy, R. T., and Mills, W. L. (1970). Human Circadian Rhythms . Baltimore, MD: The Williams and Wilkins Company.

Cook, T. D., and Campbell, D. T. (1979). Quasi-Experimentation: Design and Analysis Issues for Field Settings . Boston, MA: Houghton Mifflin.

Cowpertwait, P. S., and Metcalfe, A. (2009). Introductory Time Series with R . New York, NY: Springer-Verlag.

Cryer, J. D., and Chan, K.-S. (2008). Time Series Analysis: With Applications in R, 2nd Edn . New York, NY: Springer.

Dalal, R. S., Bhave, D. P., and Fiset, J. (2014). Within-person variability in job performance: A theoretical review and research agenda. J. Manag . 40, 1396–1436. doi: 10.1177/0149206314532691

Dettling, M. (2013). Applied Time Series Analysis [PDF Document] . Available online at: http://stat.ethz.ch/education/semesters/ss2012/atsa/ATSA-Scriptum-SS2012-120521.pdf

Fairbairn, C. E., and Sayette, M. A. (2013). The effect of alcohol on emotional inertia: a test of alcohol myopia. J. Abnorm. Psychol . 122, 770–781. doi: 10.1037/a0032980

PubMed Abstract | CrossRef Full Text | Google Scholar

Friston, K. J., Holmes, A. P., Poline, J. B., Grasby, P. J., Williams, S. C. R., Frackowiak, R. S. J., et al. (1995). Analysis of fMRI time series revisited. Neuroimage 2, 45–53.

Friston, K. J., Josephs, O., Zarahn, E., Holmes, A. P., Rouquette, S., and Poline, J-B. (2000). To smooth or not to smooth? Bias and efficiency in fMRI time series analysis. Neuroimage 12, 196–208. doi: 10.1006/nimg.2000.0609

Fuller, J. A., Stanton, J. M., Fisher, G. G., Spitzmuller, C., and Russell, S. S. (2003). A lengthy look at the daily grind: time series analysis of events, mood, stress, and satisfaction. J. Appl. Psychol . 88, 1019–1033. doi: 10.1037/0021-9010.88.6.1019

George, J. M., and Jones, G. R. (2000). The role of time in theory and theory building. J. Manage . 26, 657–684. doi: 10.1177/014920630002600404

Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., and Brilliant, L. (2009). Detecting influenza epidemics using search engine query data. Nature 457, 1012–1014. doi: 10.1038/nature07634

Glass, G. V., Willson, V. L., and Gottman, J. M. (1975). Design and Analysis of Time Series Experiments . Boulder, CO: Colorado Associated University Press.

Hartmann, D. P., Gottman, J. M., Jones, R. R., Gardner, W., Kazdin, A. E., and Vaught, R. S. (1980). Interrupted time-series analysis and its application to behavioral data. J. Appl. Behav. Anal . 13, 543–559.

Hays, W. L. (1981). Statistics, 2nd Edn . New York, NY: Holt, Rinehart, and Winston.

Hyndman, R. J. (2014). Forecast: Forecasting Functions for Time Series and Linear Models . R Package Version 5.4. Available online at: http://CRAN.R-project.org/package=forecast

Hyndman, R. J., and Athanasopoulos, G. (2014). Forecasting: Principles and Practice . OTexts. Available online at: http://otexts.org/fpp/

Hyndman, R. J., and Khandakar, Y. (2008). Automatic time series forecasting: the forecast package for R. J. Stat. Softw . 26, 1–22.

Jones, R. R., Vaught, R. S., and Weinrott, M. (1977). Time series analysis in operant research. J. Appl. Behav. Anal . 10, 151–166.

Kanner, A. D., Coyne, J. C., Schaefer, C., and Lazarus, R. S. (1981). Comparisons of two modes of stress measurement: daily hassles and uplifts versus major life events. J. Behav. Med . 4, 1–39.

Kelly, J. R., and McGrath, J. E. (1988). On Time and Method . Newbury Park, CA: Sage.

Kerlinger, F. N. (1973). Foundations of Behavioral Research, 2nd Edn . New York, NY: Holt, Rinehart.

Killingsworth, M. A., and Gibert, D. T. (2010). A wandering mind is an unhappy mind. Science 330, 932–933. doi: 10.1126/science.1192439

Kuljanin, G., Braun, M. T., and DeShon, R. P. (2011). A cautionary note on applying growth models to longitudinal data. Psychol. Methods 16, 249–264. doi: 10.1037/a0023348

PubMed Abstract | CrossRef Full Text

Kumari, V., and Corr, P. J. (1996). Menstrual cycle, arousal-induction, and intelligence test performance. Psychol. Rep . 78, 51–58.

Larsen, R. J., and Kasimatis, M. (1990). Individual differences in entrainment of mood to the weekly calendar. J. Pers. Soc. Psychol . 58, 164–171.

Larson, R., and Csikszentmihalyi, M. (1983). The experience sampling method. New Dir. Methodol. Soc. Behav. Sci . 15, 41–56.

Latman, N. (1977). Human sensitivity, intelligence and physical cycles and motor vehicle accidents. Accid. Anal. Prev . 9, 109–112.

Liu, L.-M. (1986). Multivariate Time Series Analysis using Vector ARMA Models . Lisle, IL: Scientific Computing Associates.

Ljung, G. M., and Box, G. E. P. (1978). On a measure of lack of fit in time series models. Biometrika 65, 297–303.

Luce, G. G. (1970). Biological Rhythms in Psychiatry and Medicine . Public Health Service Publication (Public Health Service Publication No. 2088). U.S. Institute of Mental Health.

McCleary, R., Hay, R. A., Meidinger, E. E., and McDowall, D. (1980). Applied Time Series Analysis for the Social Sciences . Beverly Hills, CA: Sage.

McGrath, J. E., and Rotchford, N. L. (1983). Time and behavior in organizations. Res. Psychol. Behav . 5, 57–101.

Meko, D. M. (2013). Applied Time Series Analysis [PDF Documents]. Available online at: http://www.ltrr.arizona.edu/~dmeko/geos585a.html#chandout

Mills, T. C., and Markellos, R. N. (2008). The Econometric Modeling of Financial Time Series, 3rd Edn . Cambridge: Cambridge University Press.

Mitchell, T. R., and James, L. R. (2001). Building better theory: time and the specification of when things happen. Acad. Manage. Rev . 26, 530–547. doi: 10.5465/AMR.2001.5393889

Muenchen, R. A. (2013). The Popularity of Data Analysis Software . Available online at: http://r4stats.com/articles/popularity/

Nua, R. (2014). Statistical Forecasting , [Online Lecture Notes]. Available online at: http://people.duke.edu/~rnau/411home.htm

Ostrom, C. W. (1990). Time Series Analysis: Regression Techniques . Newbury Park, CA: Sage.

Pankratz, A. (1991). Forecasting with Dynamic Regression Models . New York, NY: Wiley.

Pearce, J. L., Stevenson, W. B., and Perry, J. L. (1985). Managerial compensation based on psychological performance: a time series analysis of the effects of merit pay. Acad. Manage. J . 28, 261–278.

Persons, W. M. (1919). Indices of business conditions. Rev. Econ. Stat . 1, 5–107.

Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D., and R Core Team. (2014). NLME: Linear and Nonlinear Mixed Effects Models . R Package Version 3.1-117. Available online at: http://CRAN.R-project.org/package=nlme

Polgreen, P. M., Chen, Y., Pennock, D. M., and Forrest, N. D. (2008). Using internet searches for influenza surveillance. Clin. Infect. Dis . 47, 1443–1448.

Popper, K. R. (1968). The Logic of Scientific Discovery . New York, NY: Harper and Row.

R Development Core Team. (2011). R: A Language and Environment for Statistical Computing . Vienna: R Foundation for Statistical Computing. Available online at: http://www.R-project.org/

Rothman, P. (eds.). (1999). Nonlinear Time Series Analysis of Economic and Financial Data . Dordrecht: Kluwer Academic Publishers.

Said, S. E., and Dickey, D. A. (1984). Testing for unit roots in autoregressive-moving average models of unknown order. Biometrika 71, 599–607.

Sharpe, D. (2013). Why the resistance to statistical innovations? Bridging the communication gap. Psychol. Methods 18, 572–582. doi: 10.1037/a0034177

Shmueli, G. (2010). To explain or to predict? Stat. Sci . 25, 289–310. doi: 10.1214/10-STS330

Shumway, R. H., and Stoffer, D. S. (2006). Time Series Analysis and Its Applications with R Examples, 2nd Edn . New York, NY: Springer.

Stanton, J. M., and Rogelberg, S. G. (2001). Using Internet/Intranet web pages to collect psychological research data. Psychol. Res. Methods 4, 200–217. doi: 10.1177/109442810143002

Tobias, R. (2009). Changing behavior by memory aids: a social–psychological model of prospective memory and habit development tested with dynamic field data. Psychol. Rev . 116, 408-438. doi: 10.1037/a0015512

Trapletti, A., and Hornik, K. (2013). Tseries: Time Series Analysis and Computational Finance . R Package Version 0.10-32.

United States Department of Labor, Bureau of Labor Statistics. (2014). Labor Force Statistics from the Current Population Survey [Data Set]. Available online at: http://data.bls.gov/timeseries/LNU04000000

Wagner, A. K., Soumerai, S. B., Zhang, F., and Ross-Degnan, D. (2002). Segmented regression analysis of interrupted time series studies in medication use research. J. Clin. Pharm. Ther . 27, 299–309. doi: 10.1046/j.1365-2710.2002.00430.x

Wagner, J. A., Rubin, P. A., and Callahan, T. J. (1988). Incentive payment and nonmanagerial productivity: an interrupted time series analysis of magnitude and trend. Organ. Behav. Hum. Decis. Process . 42, 47–74.

Warner, R. M. (1998). Spectral Analysis of Time-Series Data . New York, NY: Guilford Press.

Wei, W. S. (2006). Time Series Analysis: Univariate and Multivariate Methods, 2nd Edn . London: Pearson.

Weiss, H. M., and Cropanzano, R. (1996). “Affective events theory: a theoretical discussion of the structure, causes and consequences of affective experiences at work,” in Research in Psychological Behavior: An Annual Series of Analytical Essays and Critical Reviews , eds B. M. Staw and L. L. Cummings (Greenwich, CT: JAI Press), 1–74.

West, S. G., and Hepworth, J. T. (1991). Statistical issues in the study of temporal data: daily experience. J. Pers . 59, 609–662.

Wood, P., and Brown, D. (1994). The study of intraindividual differences by means of dynamic factor models: rationale, implementation, and interpretation. Psychol. Bull . 116, 166–186.

Zaheer, S., Albert, S., and Zaheer, A. (1999). Time-scale and psychological theory. Acad. Manage. Rev . 24, 725–741.

Zeileis, A., and Hothorn, T. (2002). Diagnostic checking in regression relationships. R News 3, 7–10.

Keywords: time series analysis, longitudinal data analysis, forecasting, regression analysis, ARIMA

Citation: Jebb AT, Tay L, Wang W and Huang Q (2015) Time series analysis for psychological research: examining and forecasting change. Front. Psychol . 6 :727. doi: 10.3389/fpsyg.2015.00727

Received: 19 March 2015; Accepted: 15 May 2015; Published: 09 June 2015.

Reviewed by:

Copyright © 2015 Jebb, Tay, Wang and Huang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Andrew T. Jebb, Department of Psychological Sciences, Purdue University, 703 Third Street, West Lafayette, IN 47907, USA, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

## Evaluating time series forecasting models: an empirical study on performance estimation methods

- Published: 13 October 2020
- Volume 109 , pages 1997–2028, ( 2020 )

## Cite this article

- Vitor Cerqueira ORCID: orcid.org/0000-0002-9694-8423 1 ,
- Luis Torgo 1 , 2 , 3 &
- Igor Mozetič 4

26k Accesses

117 Citations

5 Altmetric

Explore all metrics

Performance estimation aims at estimating the loss that a predictive model will incur on unseen data. This process is a fundamental stage in any machine learning project. In this paper we study the application of these methods to time series forecasting tasks. For independent and identically distributed data the most common approach is cross-validation. However, the dependency among observations in time series raises some caveats about the most appropriate way to estimate performance in this type of data. Currently, there is no consensual approach. We contribute to the literature by presenting an extensive empirical study which compares different performance estimation methods for time series forecasting tasks. These methods include variants of cross-validation, out-of-sample (holdout), and prequential approaches. Two case studies are analysed: One with 174 real-world time series and another with three synthetic time series. Results show noticeable differences in the performance estimation methods in the two scenarios. In particular, empirical experiments suggest that blocked cross-validation can be applied to stationary time series. However, when the time series are non-stationary, the most accurate estimates are produced by out-of-sample methods, particularly the holdout approach repeated in multiple testing periods.

## Similar content being viewed by others

## Model Selection for Time Series Forecasting An Empirical Analysis of Multiple Estimators

## A case study comparing machine learning with statistical methods for time series forecasting: size matters

Resampling strategies for imbalanced time series forecasting.

Avoid common mistakes on your manuscript.

## 1 Introduction

Performance estimation denotes the process of using the available data to estimate the loss that a predictive model will incur in new, yet unseen, observations. Estimating the performance of a predictive model is a fundamental stage in any machine learning project. Practitioners carry out performance estimation to select the most appropriate model and its parameters. Crucially, the process of performance estimation is one of the most reliable approaches to analyse the generalisation ability of predictive models. Such analysis is important not only to select the best model, but also to verify that the respective model solves the underlying predictive task.

Choosing an appropriate performance estimation method usually depends on the characteristics of the data set. When observations are independent and identically distributed (i.i.d.), cross validation is one of the most widely used approaches (Geisser 1975 ). One of the reasons for its popularity is its efficient use of data (Arlot and Celisse 2010 ). However, many data sets in real-world applications are not i.i.d., for example, time series. Time series forecasting is an important machine learning problem. This task has a high practical utility in organizations across many domains of application.

When the observations in a data set are not i.i.d., the standard cross-validation approach is not directly applicable. Cross-validation breaks the temporal order of time series observations, which may lead to unrealistic estimates of predictive performance. In effect, when dealing with this type of data sets, practitioners typically apply an out-of-sample (also known as holdout) approach to estimate the performance of predictive models. Essentially, the predictive model is built in the initial part of the data. The subsequent observations are used for testing. Notwithstanding, there are particular scenarios in which cross-validation may be beneficial. For example, when the time series is stationary, or the sample size is small and data efficiency becomes important (Bergmeir et al. 2018 ).

Several approaches have been developed in recent decades to estimate the performance of forecasting models. However, there is no consensual approach. In this context, we contribute to the literature by carrying out an extensive empirical study which compares several approaches which are often used in practice.

We compare a set of estimation methods which can be broadly split into three categories: out-of-sample (OOS), prequential, and cross-validation (CVAL). OOS approaches are commonly used to estimate the performance of models when the data comprises some degree of temporal dependency. The core idea behind these approaches is to leave the last part of the data for testing. Although this type of approaches do not make a complete use of the available data, they preserve the temporal order of observations. This aspect may be important to cope with the temporal correlation among consecutive observations, and to mimic a realistic deployment scenario. Prequential approaches (Dawid 1984 ) are also common in incremental or high-frequency data sets such as data streams (Gama et al. 2014 ). Prequential denotes an evaluation procedure in which an observation (or a set of observations) is first used for testing, and then to re-train or update the model.

CVAL approaches make a more efficient use of the available data as each observation is used to both train and test a model over the different iterations of the procedure. This property may be beneficial in some scenarios in which the sample size is small (Bergmeir et al. 2018 ). Although the classical K-fold cross validation assumes the data to be i.i.d., some variants of it have been developed which mitigate this problem. In effect, some of these variants have been shown to provide better estimate of performance relative to OOS methods in time series tasks (Bergmeir and Benítez 2012 ; Bergmeir et al. 2014 , 2018 ).

The key factor that distinguishes OOS and prequential approaches from CVAL ones is that the former always preserve the temporal order of observations. This means that a model is never tested on past data relative to the training data set. The central research question we address in this paper is the following: How do estimation methods compare with each other in terms of performance estimation ability for different types of time series data? To accomplish this, we applied different estimation methods in two case studies. One is comprised of 174 real-world time series with potential non-stationarities and the other is a stationary synthetic environment (Bergmeir and Benítez 2012 ; Bergmeir et al. 2014 , 2018 ).

The results suggest that, as Bergmeir et al. point out (Bergmeir et al. 2018 ), cross-validation approaches can be applied to stationary time series. However, many real-world phenomena comprise complex non-stationary sources of variation. In these cases, applying holdout in several testing periods shows the best performance.

This paper is an extension to an article already published (Cerqueira et al. 2017 ). In this work, we substantially increase the experimental setup both in methods and data sets used; we provide additional analyses such as the impact of stationarity; and a more in-depth and critical discussion of the results.

This paper is structured as follows. The literature on performance estimation for time series forecasting tasks is reviewed in Sect. 2 . Materials and methods are described in Sect. 3 , including the predictive task, time series data sets, performance estimation methodology, and experimental design. The results of the experiments are reported in Sect. 4 . A discussion of our results is carried out in Sect. 5 . Finally, the conclusions of our empirical study are provided in Sect. 6 .

In the interest of reproducibility, the methods and data sets are publicly available. Footnote 1

## 2 Background

In this section we provide a background to this paper. We review the typical estimation methods used in time series forecasting and explain the motivation for this study.

In general, performance estimation methods for time series forecasting tasks are designed to cope with the dependence between observations. This is typically accomplished by having a model tested on observations future to the ones used for training.

## 2.1 Out-of-sample (OOS) approaches

When using OOS performance estimation procedures, a time series is split into two parts: an initial fit period in which a model is trained, and a subsequent (temporally) testing period held out for estimating the loss of that model. This simple approach ( Holdout ) is depicted in Fig. 1 . However, within this type of procedure one can adopt different strategies regarding training/testing split point, growing or sliding window settings, and eventual update of the models. In order to produce a robust estimate of predictive performance, (Tashman 2000 ) recommends employing these strategies in multiple test periods. One might create different sub-samples according to, for example, business cycles (Fildes 1989 ). For a more general setting one can also adopt a randomized approach. This is similar to random sub-sampling (or repeated holdout) in the sense that they consist of repeating a learning plus testing cycle several times using different, but possibly overlapping data samples ( Rep-Holdout ). This idea is illustrated in Fig. 2 , where one iteration of a repeated holdout is shown. A point a is randomly chosen from the available sampling window (constrained by the training and testing sizes) of a time series Y. This point then marks the end of the training set, and the start of the testing set.

Simple out-of-sample procedure: an initial part of the available observations are used for fitting a predictive model. The last part of the data is held out, where the predictive model is tested

Example of one iteration of the repeated holdout procedure. A point a is chosen from the available window. Then, a previous part of observations are used for training, while a subsequent part of observations are used for testing

## 2.2 Prequential

OOS approaches are similar to prequential or interleaved-test-then-train evaluation (Dawid 1984 ). Prequential is typically used in data streams mining. The idea is that each observation is first used to test the model, and then to train the model. This can be applied in blocks of sequential instances (Modha and Masry 1998 ). In the initial iteration, only the first two blocks are used, the first for training and the second for testing. In the next iteration, the second block is merged with the first, and the third block is used for test. This procedure continues until all blocks are tested ( Preq-Bls ). This procedure is exemplified in the left side of Fig. 3 , in which the data is split into 5 blocks.

Variants of prequential approach applied in blocks for performance estimation. This strategy can be applied using a growing window (left, right), or a sliding window (middle). One can also introduce a gap between the training and test sets

A variant of this idea is illustrated in the middle scheme of Fig. 3 . Instead of merging the blocks after each iteration (growing window), one can forget the older blocks in a sliding window fashion ( Preq-Sld-Bls ). This idea is typically adopted when past data becomes deprecated, which is common in non-stationary environments. Another variant of the prequential approach is represented in the right side of Fig. 3 . This illustrates a prequential approach applied in blocks, where a gap block is introduced ( Preq-Bls-Gap ). The rationale behind this idea is to increase the independence between training and test sets.

The prequential approaches can be regarded as variations of the holdout procedure applied in multiple testing periods ( Rep-Holdout ). The core distinction is that the train and test prequential splits are not randomized, but pre-determined by the number of blocks or repetitions.

## 2.3 Cross-validation approaches

The typical approach when using K-fold cross-validation is to randomly shuffle the data and split it in K equally-sized folds or blocks. Each fold is a subset of the data comprising t / K randomly assigned observations, where t is the number of observations. After splitting the data into K folds, each fold is iteratively picked for testing. A model is trained on K-1 folds and its loss is estimated on the left out fold ( CV ). In fact, the initial random shuffle of observations before splitting into different blocks is not intrinsic to cross-validation (Geisser 1975 ). Notwithstanding, the random shuffling is a common practice among data science professionals. This approach to cross-validation is illustrated in the left side of Fig. 4 .

Variants of cross-validation estimation procedures

## 2.3.1 Variants designed for time-dependent data

Some variants of K-fold cross-validation have been proposed specially designed for dependent data, such as time series (Arlot and Celisse 2010 ). However, theoretical problems arise by applying this technique directly to this type of data. The dependency among observations is not taken into account since cross-validation assumes the observations to be i.i.d.. This might lead to overly optimistic estimations and consequently, poor generalisation ability of predictive models on new observations. For example, prior work has shown that cross-validation yields poor estimations for the task of choosing the bandwidth of a kernel estimator in correlated data (Hart and Wehrly 1986 ). To overcome this issue and approximate independence between the training and test sets, several methods have been proposed as variants of this procedure. We will focus on variants designed to cope with temporal dependency among observations.

The Blocked Cross-Validation (Snijders 1988 ) ( CV-Bl ) procedure is similar to the standard form described above. The difference is that there is no initial random shuffling of observations. In time series, this renders K blocks of contiguous observations. The natural order of observations is kept within each block, but broken across them. This approach to cross-validation is also illustrated in the left side of Fig. 4 . Since the random shuffle of observations is not being illustrated, the figure for CV-Bl is identical to the one shown for CV .

The Modified CV procedure (McQuarrie and Tsai 1998 ) ( CV-Mod ) works by removing observations from the training set that are correlated with the test set. The data is initially randomly shuffled and split into K equally-sized folds similarly to K-fold cross-validation. Afterwards, observations from the training set within a certain temporal range of the observations of the test set are removed. This ensures independence between the training and test sets. However, when a significant amount of observations are removed from training, this may lead to model under-fit. This approach is also described as non-dependent cross-validation (Bergmeir and Benítez 2012 ). The graph in the middle of Fig. 4 illustrates this approach.

The hv-Blocked Cross-Validation ( CV-hvBl ) proposed by Racine ( 2000 ) extends blocked cross-validation to further increase the independence among observations. Specifically, besides blocking the observations in each fold, which means there is no initial randomly shuffle of observations, it also removes adjacent observations between the training and test sets. Effectively, this creates a gap between both sets. This idea is depicted in the right side of Fig. 4 .

## 2.3.2 Usefulness of cross-validation approaches

Recently there has been some work on the usefulness of cross-validation procedures for time series forecasting tasks. Bergmeir and Benítez ( 2012 ) present a comparative study of estimation procedures using stationary time series. Their empirical results show evidence that in such conditions cross-validation procedures yield more accurate estimates than an OOS approach. Despite the theoretical issue of applying standard cross-validation, they found no practical problem in their experiments. Notwithstanding, the Blocked cross-validation is suggested for performance estimation using stationary time series.

Bergmeir et al. ( 2014 ) extended their previous work for directional time series forecasting tasks. These tasks are related to predicting the direction (upward or downward) of the observable. The results from their experiments suggest that the hv-Blocked CV procedure provides more accurate estimates than the standard out-of-sample approach. These were obtained by applying the methods on stationary time series.

Finally, Bergmeir et al. ( 2018 ) present a simulation study comparing standard cross-validation to out-of-sample evaluation. They used three data generating processes and performed 1000 Monte Carlo trials in each of them. For each trial and generating process, a stationary time series with 200 values was created. The results from the simulation suggest that cross-validation systematically yields more accurate estimates, provided that the model is correctly specified.

In a related empirical study (Mozetič et al. 2018 ), Mozetič et al. compare estimation procedures on several large time-ordered Twitter datasets. They find no significant difference between the best cross-validation and out-of-sample evaluation procedures. However, they do find that standard, randomized cross-validation is significantly worse than the blocked cross-validation, and should not be used to evaluate classifiers in time-ordered data scenarios.

Despite the results provided by these previous works we argue that they are limited in two ways. First, the used experimental procedure is biased towards cross-validation approaches. While these produce several error estimates (one for each fold), the OOS approach is evaluated in a one-shot estimation, where the last part of the time series is withheld for testing. OOS methods can be applied in several windows for more robust estimates, as recommended by Tashman ( 2000 ). By using a single origin, one is prone to particular issues related to that origin.

Second, the results are based on stationary time series, most of them artificial. Time series stationarity is equivalent to identical distribution in the terminology of more traditional predictive tasks. Hence, the synthetic data generation processes and especially the stationary assumption limit interesting patterns that can occur in real-world time series. Our working hypothesis is that in more realistic scenarios one is likely to find time series with complex sources of non-stationary variations.

In this context, this paper provides an extensive comparative study using a wide set of methods for evaluating the performance of univariate time series forecasting models. The analysis is carried out using a real-world scenario as well as a synthetic case study used in the works described previously (Bergmeir and Benítez 2012 ; Bergmeir et al. 2014 , 2018 ).

## 2.4 Related work on performance estimation with dependent data

The problem of performance estimation has also been under research in different scenarios. While we focus on time series forecasting problems, the following works study performance estimation methods in different predictive tasks.

## 2.4.1 Spatio-temporal dependencies

Geo-referenced time series are becoming more prevalent due to the increase of data collection from sensor networks. In these scenarios, the most appropriate estimation procedure is not obvious as spatio-temporal dependencies are at play. Oliveira et al. ( 2018 ) presented an extensive empirical study of performance estimation for forecasting problems with spatio-temporal time series. The results reported by the authors suggest that both cross-validation and out-of-sample methods are applicable in these scenarios. Like previous work in time-dependent domains (Bergmeir and Benítez 2012 ; Mozetič et al. 2018 ), Oliveira et al. suggest the use of blocking when using a cross-validation estimation procedure.

## 2.4.2 Data streams mining

Data streams mining is concerned with predictive models that evolve continuously over time in response to concept drift (Gama et al. 2014 ). Gama et al. ( 2013 ) provide a thorough overview of the evaluation of predictive models for data streams mining. The authors defend the usage of the prequential estimator with a forgetting mechanism, such as a fading factor or a sliding window.

This work is related to ours in the sense that it deals with performance estimation using time-dependent data. Notwithstanding, the paradigm of data streams mining is in line with sequential analysis (Wald 1973 ). As such, the assumption is that the sample size is not fixed in advance, and predictive models are evaluated as observations are collected. In our setting, given a time series data set, we want to estimate the loss that a predictive models will incur in unseen observations future to that data set.

## 3 Materials and methods

In this section we present the materials and methods used in this work. First, we define the prediction task. Second, the time series data sets are described. We then formalize the methodology employed for performance estimation. Finally, we overview the experimental design.

## 3.1 Predictive task definition

A time series is a temporal sequence of values \(Y = \{y_1, y_2, \dots , y_t\}\) , where \(y_i\) is the value of Y at time i and t is the length of Y . We remark that we use the term time series assuming that Y is a numeric variable, i.e., \(y_i \in \mathbb {R}, \forall\) \(y_i \in Y\) .

Time series forecasting denotes the task of predicting the next value of the time series, \(y_{t+1}\) , given the previous observations of Y . We focus on a purely auto-regressive modelling approach, predicting future values of time series using its past lags.

To be more precise, we use time delay embedding (Takens 1981 ) to represent Y in an Euclidean space with embedding dimension p . Effectively, we construct a set of observations which are based on the past p lags of the time series. Each observation is composed of a feature vector \(x_i \in \mathbb {X} \subset \mathbb {R}^p\) , which denotes the previous p values, and a target vector \(y_i \in \mathbb {Y} \subset \mathbb {R}\) , which represents the value we want to predict. The objective is to construct a model \(f : \mathbb {X} \rightarrow \mathbb {Y}\) , where f denotes the regression function.

Summarizing, we generate the following matrix:

Taking the first row of the matrix as an example, the target value is \(y_{p+1}\) , while the attributes (predictors) are the previous p values \(\{y_1, \dots , y_{p}\}\) .

## 3.2 Time series data

Two different case studies are used to analyse the performance estimation methods: a scenario comprised of real-world time series and a synthetic setting used in prior works (Bergmeir and Benítez 2012 ; Bergmeir et al. 2014 , 2018 ) for addressing the issue of performance estimation for time series forecasting tasks.

## 3.2.1 Real-world time series

Regarding real-world time series, we use a set of time series from the benchmark database tsdl (Hyndman and Yang 2019 ). From this database, we selected all the univariate time series with at least 500 observations and which have no missing values. This query returned 149 time series. We also included 25 time series used in previous work by Cerqueira et al. ( 2019 ). From the set of 62 time series used by the authors, we selected those with at least 500 observations and which were not originally from the tsdl database (which are already retrieved as described above). We refer to the work by Cerqueira et al. ( 2019 ) for a description of the time series. In summary, our database of real-world time series comprises 174 time series. The threshold of 500 observations is included so that learning algorithms have enough data to build a good predictive model. The 174 time series represent phenomena from different domains of applications. These include finance, physics, economy, energy, and meteorology. They also cover distinct sampling frequencies, such as hourly or daily. In terms of sample size, the distribution ranges from 506 to 23741 observations. However, we truncated to time series to a maximum of 4000 observations to speed up computations. The database is available online (c.f. footnote 1). We refer to the sources for further information on these time series Hyndman and Yang ( 2019 ); Cerqueira et al. ( 2019 ).

Stationarity

We analysed the stationarity of the time series comprising the real-world case study. Essentially, a time series is said to be stationary if its characteristics do not depend on the time that the data is observed (Hyndman and Athanasopoulos 2018 ). In this work we consider a stationarity of order 2. This means that a time series is considered stationary if it has constant mean, constant variance, and an auto-covariance that does not depend on time. Henceforth we will refer a time series as stationary if it is stationary of order 2.

In order to test if a given time series is stationary we follow the wavelet spectrum test described by Nason ( 2013 ). This test starts by computing an evolutionary wavelet spectral approximation. Then, for each scale of this approximation, the coefficients of the Haar wavelet are computed. Any large Haar coefficient is evidence of a non-stationarity. An hypothesis test is carried out to assess if a coefficient is large enough to reject the null hypothesis of stationarity. In particular, we apply a multiple hypothesis test with a Bonferroni correction and a false discovery rate (Nason 2013 ).

Application of the wavelet spectrum test to a non-stationary time series. Each red horizontal arrow denote a non-stationarity identified by the test

In Fig. 5 is shown an example of the application of the wavelet spectrum test to a non-stationary time series. In the graphic, each red horizontal arrow denotes a non-stationarity found by the test. The left-hand side axis denotes the scale of the time series. The right-hand axis represents the scale of the wavelet periodogram and where the non-stationarities are found. Finally, the lengths of the arrows denote the scale of the Haar wavelet coefficient whose null hypothesis was rejected. For a thorough description of this method we refer to the work by Nason ( 2013 ). Out of the 174 time series used in this work, 97 are stationary, while the remaining 77 are non-stationary.

## 3.2.2 Synthetic time series

We use three synthetic use cases defined in previous works by Bergmeir et al. ( 2014 , ( 2018 ). The data generating processes are all stationary and are designed as follows:

A stable auto-regressive process with lag 3, i.e., the next value of the time series is dependent on the past 3 observations;

An invertible moving average process with lag 1;

A seasonal auto-regressive process with lag 12 and seasonal lag 1.

For the first two cases, S1 and S2, real-valued roots of the characteristic polynomial are sampled from the uniform distribution \([-r;-1.1]\cup [1.1,r]\) , where r is set to 5 (Bergmeir and Benítez 2012 ). Afterwards, the roots are used to estimate the models and create the time series. The data is then processed by making the values all positive. This is accomplished by subtracting the minimum value and adding 1. The third case S3 is created by fitting a seasonal auto-regressive model to a time series of monthly total accidental deaths in the USA (Brockwell and Davis 2013 ). For a complete description of the data generating process we refer to the work by Bergmeir and Benítez ( 2012 ); Bergmeir et al. ( 2018 ). Similarly to Bergmeir et al., for each use case we performed 1000 Monte Carlo simulations. In each repetition a time series with 200 values was generated.

## 3.3 Performance estimation methodology

Performance estimation addresses the issue of estimating the predictive performance of predictive models. Frequently, the objective behind these tasks is to compare different solutions for solving a predictive task. This includes selecting among different learning algorithms and hyper-parameter tuning for a particular one.

Training a learning model and evaluating its predictive ability on the same data has been proven to produce biased results due to overfitting (Arlot and Celisse 2010 ). In effect, several methods for performance estimation have been proposed in the literature, which use new data to estimate the performance of models. Usually, new data is simulated by splitting the available data. Part of the data is used for training the learning algorithm and the remaining data is used to test and estimate the performance of the model.

For many predictive tasks the most widely used of these methods is K-fold cross-validation (Stone 1974 ) (c.f. Sect. 2 for a description). The main advantages of this method is its universal splitting criteria and efficient use of all the data. However, cross-validation is based on the assumption that observations in the underlying data are independent. When this assumption is violated, for example in time series data, theoretical problems arise that prevent the proper use of this method in such scenarios. As we described in Sect. 2 several methods have been developed to cope with this issue, from out-of-sample approaches (Tashman 2000 ) to variants of the standard cross-validation, e.g., block cross-validation (Snijders 1988 ).

Our goal in this paper is to compare a wide set of estimation procedures, and test their suitability for different types of time series forecasting tasks. In order to emulate a realistic scenario we split each time series data in two parts. The first part is used to estimate the loss that a given learning model will incur on unseen future observations. This part is further split into training and test sets as described before. The second part is used to compute the true loss that the model incurred. This strategy allows the computation of unbiased estimates of error since a model is always tested on unseen observations.

The workflow described above is summarised in Fig. 6 . A time series Y is split into an estimation set \(Y^{est}\) and a subsequent validation set \(Y^{val}\) . First, \(Y^{est}\) is used to calculate \(\hat{g}\) , which represents the estimate of the loss that a predictive model m will incur on future new observations. This is accomplished by further splitting \(Y^{est}\) into training and test sets according to the respective estimation procedure \(g_i\) , \(i \in \{1,\dots ,u\}\) . The model m is built on the training set and \(\hat{g}_i\) is computed on the test set.

Second, in order to evaluate the estimates \(\hat{g}_i\) produced by the methods \(g_i\) , \(i \in \{1,\dots ,u\}\) , the model m is re-trained using the complete set \(Y^{est}\) and tested on the validation set \(Y^{val}\) . Effectively, we obtain \(L^m\) , the ground truth loss that m incurs on new data.

In summary, the goal of an estimation method \(g_i\) is to approximate \(L^m\) by \(\hat{g}_i\) as well as possible. In Sect. 3.4.3 we describe how to quantify this approximation.

Experimental comparison procedure (Cerqueira et al. 2017 ): a time series is split into an estimation set \(Y^{est}\) and a subsequent validation set \(Y^{val}\) . The first is used to estimate the error \(\hat{g}\) that the model m will incur on unseen data, using u different estimation methods. The second is used to compute the actual error \(L^m\) incurred by m . The objective is to approximate \(L^m\) by \(\hat{g}\) as well as possible

## 3.4 Experimental design

The experimental design was devised to address the following research question: How do the predictive performance estimates of cross-validation methods compare to the estimates of out-of-sample approaches for time series forecasting tasks?

Existing empirical evidence suggests that cross-validation methods provide more accurate estimations than traditionally used OOS approaches in stationary time series forecasting (Bergmeir and Benítez 2012 ; Bergmeir et al. 2014 , 2018 ) (see Sect. 2 ). However, many real-world time series comprise complex structures. These include cues from the future that may not have been revealed in the past. In such cases, preserving the temporal order of observations when estimating the predictive ability of models may be an important component.

## 3.4.1 Embedding dimension and estimation set size

We estimate the optimal embedding dimension ( p ) (the value which minimises generalisation error) using the method of False Nearest Neighbours (Kennel et al. 1992 ). This method analyses the behaviour of the nearest neighbours as we increase p . According to Kennel et al. ( 1992 ), with a low sub-optimal p many of the nearest neighbours will be false. Then, as we increase p and approach an optimal embedding dimension those false neighbours disappear. We set the tolerance of false nearest neighbours to 1%. Regarding the synthetic case study, we fixed the embedding dimension to 5. The reason for this setup is to try to follow the experimental setup by Bergmeir et al. ( 2018 ).

The estimation set ( \(Y^{est}\) ) in each time series is the first 70% observations of the time series – see Fig. 6 . The validation period is comprised of the subsequent 30% observations ( \(Y^{val}\) ). These values are typically used when partitioning data sets for performance estimation.

## 3.4.2 Estimation methods

In the experiments we apply a total of 11 performance estimation methods, which are divided into cross-validation, out-of-sample, and prequential approaches. The cross-validation methods are the following:

CV Standard, randomized K-fold cross-validation;

CV-Bl Blocked K-fold cross-validation;

CV-Mod Modified K-fold cross-validation;

CV-hvBl hv-Blocked K-fold cross-validation;

The out-of-sample approaches are the following:

Holdout A simple OOS approach–the first 70% of \(Y^E\) is used for training and the subsequent 30% is used for testing;

Rep-Holdout OOS tested in nreps testing periods with a Monte Carlo simulation using 70% of the total observations t of the time series in each test. For each period, a random point is picked from the time series. The previous window comprising 60% of t is used for training and the following window of 10% of t is used for testing.

Finally, we include the following prequetial approaches:

Preq-Bls Prequential evaluation in blocks in a growing fashion;

Preq-Sld-Bls Prequential evaluation in blocks in a sliding fashion–the oldest block of data is discarded after each iteration;

Preq-Bls-Gap ] Prequential evaluation in blocks in a growing fashion with a gap block–this is similar to the method above, but comprises a block separating the training and testing blocks in order to increase the independence between the two parts of the data;

Preq-Grow and Preq-Slide As baselines we also include the exhaustive prequential methods in which an observation is first used to test the predictive model and then to train it. We use both a growing/landmark window ( Preq-Grow ) and a sliding window ( Preq-Slide ).

We refer to Sect. 2 for a complete description of the methods. The number of folds K or repetitions nreps in these methods is 10, which is a commonly used setting in the literature. The number of observations removed in CV-Mod and CV-hvBl (c.f. Sect. 2 ) is the embedding dimension p in each time series.

## 3.4.3 Evaluation metrics

Our goal is to study which estimation method provides a \(\hat{g}\) that best approximates \(L^m\) . Let \(\hat{g}^m_i\) denote the estimated loss by the learning model m using the estimation method g on the estimation set, and \(L^m\) denote the ground truth loss of learning model m on the test set. The objective is to analyze how well \(\hat{g}^m_i\) approximates \(L^m\) . This is quantified by the absolute predictive accuracy error (APAE) metric and the predictive accuracy error (PAE) (Bergmeir et al. 2018 ):

The APAE metric evaluates the error size of a given estimation method. On the other hand, PAE measures the error bias, i.e., whether a given estimation method is under-estimating or over-estimating the true error.

Another question regarding evaluation is how a given learning model is evaluated regarding its forecasting accuracy. In this work we evaluate models according to root mean squared error (RMSE). This metric is traditionally used for measuring the differences between the estimated values and the actual values.

## 3.4.4 Learning algorithms

We applied the following learning algorithms:

RBR A rule-based regression algorithm from the Cubist R package (Kuhn et al. 2014 ), which is a variant of the model tree by Quinlan ( 1993 ). The main parameter, the number of committees (c.f. Kuhn et al. 2014 ), was set to 5.;

RF A Random Forest algorithm, which is an ensemble of decision trees (Breiman 2001 ). We resort to the implementation from the ranger R package (Wright 2015 ). The number of trees in the ensemble was set to 100.;

GLM A generalized linear model (McCullagh 2019 ) regression with a Gaussian distribution and a Ridge penalty mixing. We used the implementation of the glmnet R package (Friedman et al. 2010 ).

The remaining parameters of each method were set to defaults. These three learning algorithms are widely used in regression tasks. For time series forecasting in particular, (Cerqueira et al. 2019 ) showed their usefulness when applied as part of a dynamic heterogeneous ensemble. In particular, the RBR method showed the best performance among 50 other approaches.

## 4 Empirical experiments

4.1 results with synthetic case study.

In this section we start by analysing the average rank, and respective standard deviation, of each estimation method and for each synthetic scenario (S1, S2, and S3), according to the metric APAE. For example, a rank of 1 in a given Monte Carlo repetition means that the respective method was the best estimator in that repetition. These analyses are reported in Figs. 7 , 8 , 9 . This initial experiment is devised to reproduce the results by Bergmeir et al. ( 2018 ). Later, we will analyse how these results compare when using real-world time series.

The results shown by the average ranks corroborate those presented by Bergmeir and Benítez ( 2012 ), Bergmeir et al. ( 2014 ), Bergmeir et al. ( 2018 ). Cross-validation approaches, blocked ones in particular, perform better (i.e., show a lower average rank) relative to the simple out-of-sample procedure Holdout . This can be concluded from all three scenarios: S1, S2, and S3.

Average rank and respective standard deviation of each estimation methods in case study S1 using the RBR learning method

Focusing on scenario S1, the estimation method with the best average rank is Preq-Bls-Gap , followed by the other two prequential variants ( Preq-Sld-Bls , and Preq-Bls ). Although the Holdout procedure is clearly a relatively poor estimator, the repeated holdout in multiple testing periods ( Rep-Holdout ) shows a better average rank than the standard cross validation approach ( CV ). Among cross validation procedures, CV-hvBl presents the best average rank.

Scenario S2 shows a seemingly different story relative to S1. In this problem, the blocked cross validation procedures show the best estimation ability. Among all, CV-hvBl shows the best average rank.

Average rank and respective standard deviation of each estimation methods in case study S2 using the RBR learning method

Regarding the scenario S3, the outcome is less clear than the previous two scenarios. The methods show a closer average rank among them, with large standard deviations. Preq-Sld-Bls shows the best estimation ability, followed by the two blocked cross-validation approaches, CV-Bl and CV-hvBl .

Average rank and respective standard deviation of each estimation methods in case study S3 using the RBR learning method

In summary, this first experiment corroborates the experiment carried out by Bergmeir et al. ( 2018 ). Notwithstanding, other methods that the authors did not test show an interesting estimation ability in these particular scenarios, namely the prequential variants. In this section, we focused on the RBR learning algorithm. In Sect. 1 of the appendix, we include the results for the GLM and RF learning methods. Overall, the results are similar across the learning algorithms.

The synthetic scenarios comprise time series that are stationary. However, real-world time series often comprise complex dynamics that break stationarity. When choosing a performance estimation method one should take this issue into consideration. To account for time series stationarity, in the next section we analyze the estimation methods using real-world time series.

## 4.2 Results with real-world data

In this section we analyze the performance estimation ability of each method using a case study which includes 174 real-world time series from different domains.

Rank distribution of each estimation method across the 174 real-world time series

First, in Fig. 10 , we show the rank distribution of each performance estimation method across the 174 time series. This figure shows a large dispersion of rank across the methods. This outcome indicates that there is no particular performance estimation method which is the most appropriate for all time series. Crucially, this result also motivates the need to study which time series characteristics (e.g. stationarity) most influence which method is more adequate for a given task.

## 4.2.1 Stationary time series

In Fig. 11 , we start by analyzing the average rank, and respective standard deviation, of each estimation method using the APAE metric. We focus on the 97 stationary time series in the database.

Similarly to the synthetic case study, the blocked cross-validation approaches CV-Bl and CV-hvBl show a good estimation ability in terms of average rank. Conversely, the other cross-validation approaches are the worst estimators. This outcome highlights the importance of blocking when using cross-validation. Rep-Holdout is the best estimator among the OOS approaches, while Preq-Bls shows the best score among the prequential methods.

Average rank and respective standard deviation of each estimation method in stationary real-world time series using the RBR learning method

We also study the statistical significance of the obtained results in terms of error size (APAE) according to a Bayesian analysis (Benavoli et al. 2017 ). Particularly, we applied the Bayes signed-rank test to compare pairs of methods across multiple problems. We arbitrarily define the region of practical equivalence (Benavoli et al. 2017 ) (ROPE) to be the interval [-2.5%, 2.5%] in terms of APAE. Essentially, this means that two methods show indistinguishable performance if the difference in performance between them falls within this interval. For a thorough description of the Bayesian analysis for comparing predictive models we refer to the work by Benavoli et al. ( 2017 ). In this analysis, it is necessary to use a scale invariant measure of performance. Therefore, we transform the metric APAE into the percentage difference of APAE relative to a baseline. In this experiment we fix the method Rep-Holdout as the baseline.

According to the illustration in Fig. 12 , the probability of Rep-Holdout winning (i.e., showing a significantly better estimation ability) is generally larger than the opposite. The exception is when it is compared with the blocked cross-validation approaches CV-Bl and CV-hvBl .

Proportion of probability of the outcome when comparing the performance estimation ability of the respective estimation method with the Rep-Holdout method with stationary real-world time series. The probabilities are computed using the Bayes signed-rank test and using the RBR learning method

For stationary time series, the blocked cross-validation approach CV-Bl seems to be the best estimation method among those analysed. The average rank analysis suggests that, on average, the relative position of this method is better than the other approaches. A more rigorous statistical analysis (using the Bayes signed-rank method) suggests that both CV-Bl and CV-hvBl are significantly better choices relative to Rep-Holdout .

## 4.2.2 Non-stationary time series

Average rank and respective standard deviation of each estimation method in non-stationary real-world time series using the RBR learning method

In Fig. 13 we present a similar analysis for the 77 non-stationary time series, whose results are considerably different relative to stationary time series. In this scenario, Rep-Holdout show the best average rank, followed by the blocked cross-validation approaches. Again, the standard cross-validation approach CV shows the worst score.

Proportion of probability of the outcome when comparing the performance estimation ability of the respective estimation method with the Rep-Holdout method with non-stationary real-world time series. The probabilities are computed using the Bayes signed-rank test using the RBR learning method

Figure 14 shows the results of the Bayes signed-rank test. This analysis suggests that Rep-Holdout is a significantly better estimator relative to the other approaches when we are dealing with non-stationary time series.

To summarise, we compared the performance of estimation methods in a large set of real-world time series and controlled for stationarity. The results suggest that, for stationary time series, the blocked cross-validation approach CV-Bl is the best option. However, when the time series are non-stationary, the OOS approach Rep-Holdout is significantly better than the others.

On top of this, the results from the experiments also suggest the following outcomes:

As Tashman pointed out (Tashman 2000 ), applying the holdout approach in multiple testing periods leads to a better performance relative to a single partition of the data set. Specifically, Rep-Holdout shows a better performance relative to Holdout in the two scenarios;

Prequential applied in blocks and in a growing window fashion ( Preq-Bls ) is the best prequential approach. Specifically, its average rank is better than the online approaches Preq-Slide and Preq-Grow ; and the blocked approaches in which the window is sliding (as opposed to growing – Preq-Sld-Bls ) or when a gap is introduced between the training and testing sets ( Preq-Bls-Gap ).

In the Sect. 1 of appendix, we show the results for the GLM and RF learning methods. The conclusions drawn from these algorithms are similar to what was reported above.

## 4.2.3 Error Bias

In order to study the direction of the estimation error, in Fig. 15 we present for each method the (log scaled) percentage difference between the estimation error and the true error according to the PAE metric. In this graphic, values below the zero line denote under-estimations of error, while values above the zero line represent over-estimations. In general, the estimation methods tend to over-estimate the error (i.e. are pessimistic estimators). The online prequential approaches Preq-Slide and Preq-Grow , and the non-blocked versions of cross-validation CV and CV-Mod tend to under-estimate the error (i.e. are optimistic estimators).

Log percentage difference of the estimated loss relative to the true loss for each estimation method in the real-world case study, and using the RBR learning method. Values below the zero line represent under-estimations of error. Conversely, values above the zero line represent over-estimations of error

## 4.2.4 Impact of sample size

Preserving the temporal order of observations, albeit more realistic, comes at a cost since less data is available for estimating predictive performance. As (Bergmeir et al. 2018 argue, this may be important for small data sets, where a more efficient use of the data (e.g. CV ) may be beneficial.

Previous work on time series forecasting has shown that sample size matters when selecting the predictive model (Cerqueira et al. 2019 ). Learning algorithms with a flexible functional form, e.g. decision trees, tend to work better for larger sample sizes relative to traditional forecasting approaches, for example, exponential smoothing.

We carried out an experiment to test whether the sample size of time series has any effect on the relative performance of the estimation methods. In this specific analysis, we focus on a subset of 90 time series (out of the 174 ones) with at least 1000 observations. Further, we truncated these to 1000 observations so all time series have equal size. Then, we proceed as follows. We repeated the performance estimation methodology described in Sect. 3.3 for increasing sample sizes. In the first iteration, each time series comprises an estimation set of 100 observations, and a validation set of 100 observations. The methodology is applied under these conditions. In the next iterations, the estimation set grows by 100 observations (to 200), and the validation set represents the subsequent 100 points. The process is repeated until the time series is fully processed.

Average rank of each performance estimation method with an increasing training sample size

We measure the average rank of each estimation method across the 90 time series after each iteration according to the APAE metric. These scores are illustrated in Fig. 16 . The estimation set size in shown in the x-axis, while the average rank score is on the y-axis. Overall, the relative positions of each method remain stable as the sample size grows. This outcome suggests that this characteristic is not crucial when selecting the estimation method.

We remark that this analysis is restricted to the sample sizes tested (up to 1000 observations). For large scale data sets the recommendation by Dietterich ( 1998 ), and usually adopted in practice, is to apply a simple out-of-sample estimation procedure ( Holdout ).

## 4.2.5 Descriptive model

What makes an estimation method appropriate for a given time series is related to the characteristics of the data. For example, in the previous section we analyzed the impact that stationarity has in terms of what is the best estimation method.

The real-world time series case study comprises a set of time series from different domains. In this section we present, as a descriptive analysis, a tree-based model that relates some characteristics of time series according to the most appropriate estimation method for that time series. Basically, we create a predictive task in which the attributes are some characteristics of a time series, and the categorical target variable is the estimation method that best approximates the true loss in that time series. We use CART (Breiman 2017 ) (classification and regression tree) algorithm for obtaining the model for this task. The characteristics used as predictor variables are the following summary statistics:

Trend , estimating according to the ratio between the standard deviation of the time series and the standard deviation of the differenced time series;

Skewness, for measuring the symmetry of the distribution of the time series;

Kurtosis , as a measure of flatness of the distribution of the time series relative to a normal distribution;

5-th and 95-th Percentiles (Perc05, Perc95) of the standardized time series;

Inter-quartile range ( IQR ), as a measure of the spread of the standardized time series;

Serial correlation, estimated using a Box-Pierce test statistic;

Long-range dependence, using a Hurst exponent estimation with wavelet transform;

Maximum Lyapunov Exponent, as a measure of the level of chaos in the time series;

a boolean variable, indicating whether or not the respective time series is stationary according to the wavelet spectrum test (Nason 2013 ).

These characteristics have been shown to be useful in other problems concerning time series forecasting (Wang et al. 2006 ). The features used in the final decision tree are written in boldface. The decision tree is shown in Fig. 17 . The numbers below the name of the method in each node denote the number of times the respective method is best over the number of time series covered in that node.

Decision tree that maps the characteristics of time series to the most appropriate estimation method. Graphic created using the rpart.plot framework (Milborrow 2018 )

Some of the estimation methods do not appear in the tree model. The tree leaves, which represent a decision, include the estimation methods CV-Bl , Rep-Holdout , Preq-Slide , and Preq-Bls-Gap .

The estimation method in the root node is CV-Bl , which is the method which is the best most of the times across the 174 time series. The first split is performed according to the kurtosis of time series. Basically, if the kurtosis is not above 2, the tree leads to a leaf node with Preq-Slide as the most appropriate estimation method. Otherwise, the tree continues with more tests in order to find the most suitable estimation method for each particular scenario.

## 5 Discussion

5.1 impact of the results.

In the experimental evaluation we compare several performance estimation methods in two distinct scenarios: (1) a synthetic case study in which artificial data generating processes are used to create stationary time series; and (2) a real-world case study comprising 174 time series from different domains. The synthetic case study is based on the experimental setup used in previous studies by Bergmeir et al. for the same purpose of evaluating performance estimation methods for time series forecasting tasks (Bergmeir and Benítez 2012 ; Bergmeir et al. 2014 , 2018 ).

Bergmeir et al. show in previous studies (Bergmeir and Benitez 2011 ; Bergmeir and Benítez 2012 ) that the blocked form of cross-validation, denoted here as CV-Bl , yields more accurate estimates than a simple out-of-sample evaluation ( Holdout ) for stationary time series forecasting tasks. The method CV is also suggested to be “a better choice than OOS[ Holdout ] evaluation” as long as the data are well fitted by the model (Bergmeir et al. 2018 ). To some extent part of the results from our experiments corroborate these conclusions. Specifically, this is verified by the APAE incurred by the estimation procedures in the synthetic case studies. In our experiments we found out that, in the synthetic case study, prequential variants provide a good estimation ability, which is often better relative to cross validation variants. Furthermore, the results in the synthetic stationary case studies do not reflect those obtained using real-world time series. On one hand, we corroborate the conclusions of previous work (Bergmeir and Benítez 2012 ) that blocked cross-validation ( CV-Bl ) is applicable to stationary time series. On the other hand, when dealing with non-stationary data sets, holdout applied with multiple randomized testing periods ( Rep-Holdout ) provides the most accurate performance estimates.

In a real-world environment we are prone to deal with time series with complex structures and different sources of non-stationary variations. These comprise nuances of the future that may not have revealed themselves in the past (Tashman 2000 ). Consequently, we believe that in these scenarios, Rep-Holdout is a better option as performance estimation method relative to cross-validation approaches.

## 5.2 Scope of the real-world case study

In this work we center our study on univariate numeric time series. Nevertheless, we believe that the conclusions of our study are independent of this assumption and should extend for other types of time series. The objective is to predict the next value of the time series, assuming immediate feedback from the environment. Moreover, we focus on time series with a high sampling frequency, for example, hourly or daily data. The main reason for this is because high sampling frequency is typically associated with more data, which is important for fitting the predictive models from a machine learning point of view. Standard forecasting benchmark data are typically more centered around low sampling frequency time series, for example the M competition data (Makridakis et al. 1982 ).

## 5.3 Future work

We showed that stationarity is a crucial time series property to take into account when selecting the performance estimation method. On the other hand, data sample size appears not be have a significant effect, though the analysis is restricted to time series up to 1000 data points.

We studied the possibility of there being other time series characteristics which may be relevant for performance estimation. We built a descriptive model, which partitions the best estimation method according to different time series characteristics, such as kurtosis or trend. We believe that this approach may be interesting for further scientific enquiry. For example, one could leverage this type of model to automatically select the most appropriate performance estimation method. Such model could be embedded into an automated machine learning framework.

Our conclusions were drawn from a database of 174 time series from distinct domains of application. In future work, it would be interesting to carry out a similar analysis on time series from specific domains, for example, finance. The stock market contains rich financial data which attracts a lot of attention. Studying the most appropriate approach to evaluate predictive models embedded within trading systems is an interesting research direction.

## 6 Final remarks

In this paper we analyse the ability of different methods to approximate the loss that a given predictive model will incur on unseen data. We focus on performance estimation for time series forecasting tasks. Since there is currently no settled approach in these problems, our objective is to compare different available methods and test their suitability.

We analyse several methods that can be generally split into out-of-sample approaches and cross-validation methods. These were applied to two case studies: a synthetic environment with stationary time series and a real-world scenario with potential non-stationarities.

In stationary time series, the blocked cross-validation method ( CV-Bl ) is shown to have a competitive estimation ability. However, when non-stationarities are present, the out-of-sample holdout procedure applied in multiple testing periods ( Rep-Holdout ) is a significantly better choice.

https://github.com/vcerqueira/performance_estimation.

Arlot, S., Celisse, A., et al. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys , 4 , 40–79.

Article MathSciNet MATH Google Scholar

Benavoli, A., Corani, G., Demšar, J., & Zaffalon, M. (2017). Time for a change: A tutorial for comparing multiple classifiers through bayesian analysis. The Journal of Machine Learning Research , 18 (1), 2653–2688.

MathSciNet MATH Google Scholar

Bergmeir, C., & Benitez, J.M. (2011) Forecaster performance evaluation with cross-validation and variants. In: 2011 11th international conference on intelligent systems design and applications (ISDA), pp. 849–854. IEEE.

Bergmeir, C., & Benítez, J. M. (2012). On the use of cross-validation for time series predictor evaluation. Information Sciences , 191 , 192–213.

Article Google Scholar

Bergmeir, C., Costantini, M., & Benítez, J. M. (2014). On the usefulness of cross-validation for directional forecast evaluation. Computational Statistics & Data Analysis , 76 , 132–143.

Bergmeir, C., Hyndman, R. J., & Koo, B. (2018). A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis , 120 , 70–83.

Breiman, L. (2001). Random forests. Machine Learning , 45 (1), 5–32.

Article MATH Google Scholar

Breiman, L. (2017). Classification and Regression Trees . New York: Routledge.

Book Google Scholar

Brockwell, P.J., & Davis, R.A. (2013). Time series: theory and methods. Springer Science & Business Media, Berlin

Cerqueira, V., Torgo, L., Pinto, F., & Soares, C. (2019). Arbitrage of forecasting experts. Machine Learning , 108 (6), 913–944.

Cerqueira, V., Torgo, L., Smailović, J., & Mozetič, I. (2017) A comparative study of performance estimation methods for time series forecasting. In 2017 IEEE international conference on data science and advanced analytics (DSAA) (pp. 529–538). IEEE.

Cerqueira, V., Torgo, L., & Soares, C. (2019). Machine learning vs statistical methods for time series forecasting: Size matters. arXiv preprint arXiv:1909.13316 .

Dawid, A. P. (1984). Present position and potential developments: Some personal views statistical theory the prequential approach. Journal of the Royal Statistical Society: Series A (General) , 147 (2), 278–290.

Article MathSciNet Google Scholar

Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation , 10 (7), 1895–1923.

Fildes, R. (1989). Evaluation of aggregate and individual forecast method selection rules. Management Science , 35 (9), 1056–1065.

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software , 33 (1), 1–22.

Gama, J., Sebastião, R., & Rodrigues, P. P. (2013). On evaluating stream learning algorithms. Machine Learning , 90 (3), 317–346.

Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys (CSUR) , 46 (4), 44.

Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the American statistical Association , 70 (350), 320–328.

Hart, J. D., & Wehrly, T. E. (1986). Kernel regression estimation using repeated measurements data. Journal of the American Statistical Association , 81 (396), 1080–1088.

Hyndman, R., & Yang, Y. (2019) tsdl: Time series data library. https://github.com/FinYang/tsdl .

Hyndman, R.J., & Athanasopoulos, G. (2018). Forecasting: principles and practice. OTexts.

Kennel, M. B., Brown, R., & Abarbanel, H. D. (1992). Determining embedding dimension for phase-space reconstruction using a geometrical construction. Physical Review A , 45 (6), 3403.

Kuhn, M., Weston, S., & Keefer, C. (2014). code for Cubist by Ross Quinlan, N.C.C.: Cubist: Rule- and Instance-Based Regression Modeling. R package version 0.0.18.

Makridakis, S., Andersen, A., Carbone, R., Fildes, R., Hibon, M., Lewandowski, R., et al. (1982). The accuracy of extrapolation (time series) methods: Results of a forecasting competition. Journal of Forecasting , 1 (2), 111–153.

McCullagh, P. (2019). Generalized linear models . New York: Routledge.

Book MATH Google Scholar

McQuarrie, A. D., & Tsai, C. L. (1998). Regression and time series model selection . Singapore: World Scientific.

Milborrow, S. (2018). rpart.plot: Plot ’rpart’ Models: An Enhanced Version of ’plot.rpart’. https://CRAN.R-project.org/package=rpart.plot . R package version 3.0.6.

Modha, D. S., & Masry, E. (1998). Prequential and cross-validated regression estimation. Machine Learning , 33 (1), 5–39.

Mozetič, I., Torgo, L., Cerqueira, V., & Smailović, J. (2018). How to evaluate sentiment classifiers for Twitter time-ordered data? PLoS ONE , 13 (3), e0194317.

Nason, G. (2013). A test for second-order stationarity and approximate confidence intervals for localized autocovariances for locally stationary time series. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 75 (5), 879–904.

Oliveira, M., Torgo, L., & Costa, V.S. (2018) Evaluation procedures for forecasting with spatio-temporal data. In Joint European conference on machine learning and knowledge discovery in databases (pp. 703–718). Berlin: Springer.

Quinlan, J.R. (1993). Combining instance-based and model-based learning. In Proceedings of the tenth international conference on machine learning (pp. 236–243).

Racine, J. (2000). Consistent cross-validatory model-selection for dependent data: hv-block cross-validation. Journal of Econometrics , 99 (1), 39–61.

Snijders, T.A. (1988). On cross-validation for predictor evaluation in time series. In On model uncertainty and its statistical implications (pp. 56–69). Berlin: Springer.

Stone, M. (1974). Cross-validation and multinomial prediction. Biometrika (pp. 509–515).

Takens, F. (1981). Dynamical systems and turbulence, Warwick 1980: Proceedings of a Symposium Held at the University of Warwick 1979/80, chap. Detecting strange attractors in turbulence, pp. 366–381. Springer Berlin Heidelberg, Berlin, Heidelberg. https://doi.org/10.1007/BFb0091924 .

Tashman, L. J. (2000). Out-of-sample tests of forecasting accuracy: An analysis and review. International Journal of Forecasting , 16 (4), 437–450.

Wald, A. (1973). Sequential analysis . Philadelphia: Courier Corporation.

MATH Google Scholar

Wang, X., Smith, K., & Hyndman, R. (2006). Characteristic-based clustering for time series data. Data Mining and Knowledge Discovery , 13 (3), 335–364.

Wright MN (2015) Ranger: A fast implementation of random forests . R package version 0.3.0.

Download references

## Acknowledgements

The authors would like to acknowledge the valuable input of anonymous reviewers. The work of V. Cerqueira was financed by National Funds through the Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia, within project UIDB/50014/2020. The work of L. Torgo was undertaken, in part, thanks to funding from the Canada Research Chairs program. The work of I. Mozetič was supported by the Slovene Research Agency through research core funding no. P2-103, and by the EC REC project IMSyPP (Grant No. 875263). The results of this publication reflect only the author’s view and the Commission is not responsible for any use that may be made of the information it contains.

## Author information

Authors and affiliations.

LIAAD-INESC TEC, Porto, Portugal

Vitor Cerqueira & Luis Torgo

University of Porto, Porto, Portugal

Dalhousie University, Halifax, Canada

Jozef Stefan Institute, Ljubljana, Slovenia

Igor Mozetič

You can also search for this author in PubMed Google Scholar

## Corresponding author

Correspondence to Vitor Cerqueira .

## Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editors: Larisa Soldatova, Joaquin Vanschoren.

## 1.1 Synthetic time series results with GLM and RF

See Figs. 18 , 19 , 20 , 21 , 22 , 23 , 24 .

Average rank and respective standard deviation of each estimation methods in case study S1–using GLM learning method (complement to Fig. 7 )

Average rank and respective standard deviation of each estimation methods in case study S1–using RF learning method (complement to Figure 7 )

Average rank and respective standard deviation of each estimation methods in case study S2 - using GLM learning method (complement to Figure 8 )

Average rank and respective standard deviation of each estimation methods in case study S2–using RF learning method (complement to Fig. 8 )

Average rank and respective standard deviation of each estimation methods in case study S3–using GLM learning method (complement to Fig. 9 )

Average rank and respective standard deviation of each estimation methods in case study S3–using RF learning method (complement to Fig. 9 )

## 1.2 Real-world time series results with GLM and RF

1.2.1 6.0.1 stationary time series.

See Figs. 24 , 25 , 26 , 27 .

Average rank and respective standard deviation of each estimation method in stationary real-world time series using the GLM learning method (complement to Fig. 11 )

Average rank and respective standard deviation of each estimation method in stationary real-world time series using the RF learning method (complement to Fig. 11 )

Proportion of probability of the outcome when comparing the performance estimation ability of the respective estimation method with the Rep-Holdout method with stationary real-world time series. The probabilities are computed using the Bayes signed-rank test and using the GLM learning method (complement to Fig. 12 )

Proportion of probability of the outcome when comparing the performance estimation ability of the respective estimation method with the Rep-Holdout method with stationary real-world time series. The probabilities are computed using the Bayes signed-rank test and using the RF learning method (complement to Fig. 12 )

## 1.2.2 6.0.2 Non-stationary time series

See Figs. 28 , 29 , 30 , 31 .

Average rank and respective standard deviation of each estimation method in non-stationary real-world time series using the GLM learning method (complement to Fig. 13 )

Average rank and respective standard deviation of each estimation method in non-stationary real-world time series using the RF learning method (complement to Fig. 13 )

Proportion of probability of the outcome when comparing the performance estimation ability of the respective estimation method with the Rep-Holdout method with non-stationary real-world time series. The probabilities are computed using the Bayes signed-rank test and using the GLM learning method (complement to Fig. 14 )

Proportion of probability of the outcome when comparing the performance estimation ability of the respective estimation method with the Rep-Holdout method with non-stationary real-world time series. The probabilities are computed using the Bayes signed-rank test and using the RF learning method (complement to Fig. 14 )

## 1.2.3 6.0.3 Error bias using the GLM and RF methods

See Figs. 32 , 33 .

Log percentage difference of the estimated loss relative to the true loss for each estimation method in the RWTS case study, and using the GLM learning method. Values below the zero line represent under-estimations of error. Conversely, values above the zero line represent over-estimations of error (complement to Fig. 15 )

Log percentage difference of the estimated loss relative to the true loss for each estimation method in the RWTS case study, and using the RF learning method. Values below the zero line represent under-estimations of error. Conversely, values above the zero line represent over-estimations of error (complement to Fig. 15 )

## Rights and permissions

Reprints and permissions

## About this article

Cerqueira, V., Torgo, L. & Mozetič, I. Evaluating time series forecasting models: an empirical study on performance estimation methods. Mach Learn 109 , 1997–2028 (2020). https://doi.org/10.1007/s10994-020-05910-7

Download citation

Received : 30 May 2019

Revised : 01 June 2020

Accepted : 25 August 2020

Published : 13 October 2020

Issue Date : November 2020

DOI : https://doi.org/10.1007/s10994-020-05910-7

## Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

- Performance estimation
- Model selection
- Cross validation
- Time series
- Forecasting

Advertisement

- Find a journal
- Publish with us
- Track your research

## Time Series Analysis - Science topic

- Recruit researchers
- Join for free
- Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

## Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, time series forecasting.

478 papers with code • 71 benchmarks • 31 datasets

Time Series Forecasting is the task of fitting a model to historical, time-stamped data in order to predict future values. Traditional approaches include moving average, exponential smoothing, and ARIMA, though models as various as RNNs, Transformers, or XGBoost can also be applied. The most popular benchmark is the ETTh1 dataset. Models are typically evaluated using the Mean Square Error (MSE) or Root Mean Square Error (RMSE).

( Image credit: ThaiBinh Nguyen )

## Benchmarks Add a Result

--> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> -->Trend | Dataset | Best Model | Paper | Code | Compare |
---|---|---|---|---|---|

D-PAD | |||||

SegRNN | |||||

PatchMixer | |||||

PatchTST/64 | |||||

PatchMixer | |||||

AutoCon | |||||

PatchMixer | |||||

PatchMixer | |||||

AutoCon | |||||

SegRNN | |||||

SegRNN | |||||

PatchMixer | |||||

SegRNN | |||||

SegRNN | |||||

GBRT | |||||

SegRNN | |||||

PatchMixer | |||||

PatchMixer | |||||

SegRNN | |||||

SegRNN | |||||

MoLE-RMLP | |||||

PatchMixer | |||||

SCINet | |||||

SCINet | |||||

TSMixer | |||||

MoLE-RMLP | |||||

GBRT | |||||

STGCN-Cov | |||||

SCINet | |||||

SCINet | |||||

SCINet | |||||

SCINet | |||||

SCINet | |||||

SCINet | |||||

SCINet | |||||

SCINet | |||||

SCINet | |||||

Informer | |||||

MoLE-DLinear | |||||

SCNN | |||||

SCNN | |||||

TSMixer | |||||

MoLE-DLinear | |||||

TSMixer | |||||

TSMixer | |||||

TSMixer | |||||

TSMixer | |||||

AA-Forecast | |||||

MoLE-RLinear | |||||

MoLE-DLinear | |||||

MoLE-RLinear | |||||

MoLE-RLinear | |||||

GA-LSTM | |||||

AA-Forecast | |||||

MoLE-DLinear | |||||

MoLE-DLinear | |||||

MoLE-DLinear | |||||

MoLE-DLinear | |||||

MoLE-DLinear | |||||

MoLE-DLinear | |||||

MoLE-DLinear | |||||

MoLE-DLinear | |||||

MoLE-DLinear | |||||

MoLE-DLinear | |||||

MoLE-DLinear | |||||

MoLE-DLinear | |||||

TSMixer | |||||

TSMixer | |||||

TSMixer | |||||

TSMixer | |||||

TSMixer |

## Most implemented papers

Sequence to sequence learning with neural networks.

Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.

## Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting

Multi-horizon forecasting problems often contain a complex mix of inputs -- including static (i. e. time-invariant) covariates, known future inputs, and other exogenous time series that are only observed historically -- without any prior information on how they interact with the target.

## Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks

laiguokun/multivariate-time-series-data • 21 Mar 2017

Multivariate time series forecasting is an important machine learning problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation.

## DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks

Probabilistic forecasting, i. e. estimating the probability distribution of a time series' future given its past, is a key enabler for optimizing business processes.

## N-BEATS: Neural basis expansion analysis for interpretable time series forecasting

We focus on solving the univariate times series point forecasting problem using deep learning.

## Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting

Spatiotemporal forecasting has various applications in neuroscience, climate and transportation domain.

## iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

These forecasters leverage Transformers to model the global dependencies over temporal tokens of time series, with each token formed by multiple variates of the same timestamp.

## GluonTS: Probabilistic Time Series Models in Python

We introduce Gluon Time Series (GluonTS, available at https://gluon-ts. mxnet. io), a library for deep-learning-based time series modeling.

## Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

Many real-world applications require the prediction of long sequence time-series, such as electricity consumption planning.

## Deep and Confident Prediction for Time Series at Uber

Reliable uncertainty estimation for time series prediction is critical in many fields, including physics, biology, and manufacturing.

## IEEE Account

- Change Username/Password
- Update Address

## Purchase Details

- Payment Options
- Order History
- View Purchased Documents

## Profile Information

- Communications Preferences
- Profession and Education
- Technical Interests
- US & Canada: +1 800 678 4333
- Worldwide: +1 732 981 0060
- Contact & Support
- About IEEE Xplore
- Accessibility
- Terms of Use
- Nondiscrimination Policy
- Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Help | Advanced Search

## Computer Science > Machine Learning

Title: transformers in time-series analysis: a tutorial.

Abstract: Transformer architecture has widespread applications, particularly in Natural Language Processing and computer vision. Recently Transformers have been employed in various aspects of time-series analysis. This tutorial provides an overview of the Transformer architecture, its applications, and a collection of examples from recent research papers in time-series analysis. We delve into an explanation of the core components of the Transformer, including the self-attention mechanism, positional encoding, multi-head, and encoder/decoder. Several enhancements to the initial, Transformer architecture are highlighted to tackle time-series tasks. The tutorial also provides best practices and techniques to overcome the challenge of effectively training Transformers for time-series analysis.

Comments: | 28 pages, 17 figures |

Subjects: | Machine Learning (cs.LG) |

Cite as: | [cs.LG] |

(or [cs.LG] for this version) | |

Focus to learn more arXiv-issued DOI via DataCite | |

: | Focus to learn more DOI(s) linking to related resources |

## Submission history

Access paper:.

- Other Formats

## References & Citations

- Google Scholar
- Semantic Scholar

## BibTeX formatted citation

## Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

- Institution

## arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

## IMAGES

## VIDEO

## COMMENTS

Citations (58) References (130) Figures (4) Abstract and Figures. Time-series analysis is a statistical method of analyzing data from repeated observations on a single unit or individual at ...

The Journal of Time Series Analysis is the leading mathematical statistics journal focused on the important field of time series analysis. We welcome papers on both fundamental theory and applications in fields such as neurophysiology, astrophysics, economic forecasting, the study of biological data, control systems, signal processing, and communications and vibrations engineering.

Time series modeling for predictive purpose has been an active research area of machine learning for many years. However, no sufficiently comprehensive and meanwhile substantive survey was offered so far. This survey strives to meet this need. A unified presentation has been adopted for entire parts of this compilation. A red thread guides the reader from time series preprocessing to ...

1915 papers with code • 3 benchmarks • 23 datasets. Time Series Analysis is a statistical technique used to analyze and model time-based data. It is used in various fields such as finance, economics, and engineering to analyze patterns and trends in data over time. The goal of time series analysis is to identify the underlying patterns ...

The current paper introduces time series analysis to psychological research, an analytic domain that has been essential for understanding and predicting the behavior of variables across many diverse fields. First, the characteristics of time series data are discussed. ... Time series analysis in operant research. J. Appl. Behav. Anal.

In time series analysis, the autocorrelation coefficient across many lags is called the autocorrelation function (ACF) and plays a significant role in model selection and evaluation (as discussed later). A plot of the ACF of the Google job search time series after seasonal adjustment is presented in the bottom panel of Figure 3.In an ACF plot, the y-axis displays the strength of the ...

Her research interests include optimization theory and applications, statistical analysis, fuzzy systems, neural networks, time series forecasting using linear and non-linear methods, evolutionary computation and bioinformatics. She has published more than 65 papers listed in the Web of Science.

To the best of our knowledge, this paper is the first work to comprehensively and systematically summarize the recent advances of Transformers for modeling time series data. We hope this survey will ignite further research interests in time series Transformers. Accepted by 32nd International Joint Conference on Artificial Intelligence (IJCAI ...

Performance estimation aims at estimating the loss that a predictive model will incur on unseen data. This process is a fundamental stage in any machine learning project. In this paper we study the application of these methods to time series forecasting tasks. For independent and identically distributed data the most common approach is cross-validation. However, the dependency among ...

3.1. Empirical structure of time-series analysis methods. First, we analyse the structure in our library of time-series analysis operations when applied to a representative interdisciplinary set of 875 real-world and model-generated time series (a selection of time series in this set is illustrated in figure 1a). This carefully controlled set ...

Volume 45, Issue 1. Pages: 1-160. January 2024. The Journal of Time Series Analysis is the leading mathematical statistics journal focused on the important field of time series analysis and its applications.

Time series modeling is a challenging and demanding problem. In the recent year, deep learning (DL) has attracted huge attention in many fields of research, including time series analysis and forecasting. While the methods of DL are very broad and wide, we aim to review the most recent and impactful deep learning papers in order to provide insights from the notable DL models and evaluation ...

Forecasting of non-linear and non-stationary time series | Explore the latest full-text research PDFs, articles, conference papers, preprints and more on TIME SERIES ANALYSIS. Find methods ...

6. Paper. Code. **Time Series Forecasting** is the task of fitting a model to historical, time-stamped data in order to predict future values. Traditional approaches include moving average, exponential smoothing, and ARIMA, though models as various as RNNs, Transformers, or XGBoost can also be applied. The most popular benchmark is the ETTh1 ...

2.1 Definition of A Time Series. A time series is a sequential set of data points, measured typically over successive times. It is mathematically defined as a set of vectors x ( t ), t = 0 ,1,2,... where t represents the time. elapsed [21, 23, 31]. The variable x (t ) is treated as a random variable.

In this paper, we delve into the design of deep time series models across various analysis tasks and review the existing literature from two perspectives: basic modules and model architectures. Further, we develop and release Time Series Library (TSLib) as a fair benchmark of deep time series models for diverse analysis tasks, which implements ...

The Journal of Time Series Analysis is the leading mathematical statistics journal focused on the important field of time series analysis. We welcome papers on both fundamental theory and applications in fields such as neurophysiology, astrophysics, economic forecasting, the study of biological data, control systems, signal processing, and communications and vibrations engineering.

Correspondent paper Some Recent Developments in Time Series Analysis P. Newbold Department of Economics, University ofIllinois, Urbana, 61801, U.S.A. Summary This article reviews some research in time series analysis over the past few years. Attention is focused primarily on methodological problems in linear model building and spectral analysis.

Data presented in form of time series as its analysis and applications recently have become increasingly important in different areas and domains. In this paper brief overview of some recently important standard problems, activities and models necessary for time series analysis and applications are presented. Paper also discusses some specific practical applications.

Temporal data, notably time series and spatio-temporal data, are prevalent in real-world applications. They capture dynamic system measurements and are produced in vast quantities by both physical and virtual sensors. Analyzing these data types is vital to harnessing the rich information they encompass and thus benefits a wide range of downstream tasks. Recent advances in large language and ...

Another important characteristic of time-series is stationarity. A time series is called stationary if its statistical features (e.g., mean, standard deviation) continue steadily over time, and this is highly important because if a time-series is stationary, there is a high probability that it will repeat its behavior in the future, and therefore it will be easier to forecast (Jain, 2016).

Research on forecasting methods of time series data has become one of the hot spots. More and more time series data are produced in various fields. It provides data for the research of time series analysis method, and promotes the development of time series research. Due to the generation of highly complex and large-scale time series data, the construction of forecasting models for time series ...

View PDF Abstract: Time series data is being used everywhere, from sales records to patients' health evolution metrics. The ability to deal with this data has become a necessity, and time series analysis and forecasting are used for the same. Every Machine Learning enthusiast would consider these as very important tools, as they deepen the understanding of the characteristics of data.

Transformer architecture has widespread applications, particularly in Natural Language Processing and computer vision. Recently Transformers have been employed in various aspects of time-series analysis. This tutorial provides an overview of the Transformer architecture, its applications, and a collection of examples from recent research papers in time-series analysis. We delve into an ...