METHODS article

Time series analysis for psychological research: examining and forecasting change.

\r\nAndrew T. Jebb*

  • 1 Department of Psychological Sciences, Purdue University, West Lafayette, IN, USA
  • 2 Department of Psychology, University of Central Florida, Orlando, FL, USA
  • 3 Department of Statistics, Purdue University, West Lafayette, IN, USA

Psychological research has increasingly recognized the importance of integrating temporal dynamics into its theories, and innovations in longitudinal designs and analyses have allowed such theories to be formalized and tested. However, psychological researchers may be relatively unequipped to analyze such data, given its many characteristics and the general complexities involved in longitudinal modeling. The current paper introduces time series analysis to psychological research, an analytic domain that has been essential for understanding and predicting the behavior of variables across many diverse fields. First, the characteristics of time series data are discussed. Second, different time series modeling techniques are surveyed that can address various topics of interest to psychological researchers, including describing the pattern of change in a variable, modeling seasonal effects, assessing the immediate and long-term impact of a salient event, and forecasting future values. To illustrate these methods, an illustrative example based on online job search behavior is used throughout the paper, and a software tutorial in R for these analyses is provided in the Supplementary Materials.

Although time series analysis has been frequently used many disciplines, it has not been well-integrated within psychological research. In part, constraints in data collection have often limited longitudinal research to only a few time points. However, these practical limitations do not eliminate the theoretical need for understanding patterns of change over long periods of time or over many occasions. Psychological processes are inherently time-bound, and it can be argued that no theory is truly time-independent ( Zaheer et al., 1999 ). Further, its prolific use in economics, engineering, and the natural sciences may perhaps be an indicator of its potential in our field, and recent technological growth has already initiated shifts in data collection that proliferate time series designs. For instance, online behaviors can now be quantified and tracked in real-time, leading to an accessible and rich source of time series data (see Stanton and Rogelberg, 2001 ). As a leading example, Ginsberg et al. (2009) developed methods of influenza tracking based on Google queries whose efficiency surpassed conventional systems, such as those provided by the Center for Disease Control and Prevention. Importantly, this work was based in prior research showing how search engine queries correlated with virological and mortality data over multiple years ( Polgreen et al., 2008 ).

Furthermore, although experience sampling methods have been used for decades ( Larson and Csikszentmihalyi, 1983 ), nascent technologies such as smartphones allow this technique to be increasingly feasible and less intrusive to respondents, resulting in a proliferation of time series data. As an example, Killingsworth and Gibert (2010) presented an iPhone (Apple Incorporated, Cupertino, California) application which tracks various behaviors, cognitions, and affect over time. At the time their study was published, their database contained almost a quarter of a million psychological measurements from individuals in 83 countries. Finally, due to the growing synthesis between psychology and neuroscience (e.g., affective neuroscience, social-cognitive neuroscience) the ability to analyze neuroimaging data, which is strongly linked to time series methods (e.g., Friston et al., 1995 , 2000 ), is a powerful methodological asset. Due to these overarching trends, we expect that time series data will become increasingly prevalent and spur the development of more time-sensitive psychological theory. Mindful of the growing need to contribute to the methodological toolkit of psychological researchers, the present article introduces the use of time series analysis in order to describe and understand the dynamics of psychological change over time.

In contrast to these current trends, we conducted a survey of the existing psychological literature in order to quantify the extent to which time series methods have already been used in psychological science. Using the PsycINFO database, we searched the publication histories of 15 prominent journals in psychology 1 for the term “time series” in the abstract, keywords, and subject terms. This search yielded a small sample of 36 empirical papers that utilized time series modeling. Further investigation revealed the presence of two general analytic goals: relating a time series to other substantive variables (17 papers) and examining the effects of a critical event or intervention (9 papers; the remaining papers consisted of other goals). Thus, this review not only demonstrates the relative scarcity of time series methods in psychological research, but also that scholars have primarily used descriptive or causal explanatory models for time series data analysis ( Shmueli, 2010 ).

The prevalence of these types of models is typical of social science, but in fields where time series analysis is most commonly found (e.g., econometrics, finance, the atmospheric sciences), forecasting is often the primary goal because it bears on important practical decisions. As a result, the statistical time series literature is dominated by models that are aimed toward prediction , not explanation ( Shmueli, 2010 ), and almost every book on applied time series analysis is exclusively devoted to forecasting methods ( McCleary et al., 1980 , p. 205). Although there are many well-written texts on time series modeling for economic and financial applications (e.g., Rothman, 1999 ; Mills and Markellos, 2008 ), there is a lack of formal introductions geared toward psychological issues (see West and Hepworth, 1991 for an exception). Thus, a psychologist looking to use these methodologies may find themselves with resources that focus on entirely different goals. The current paper attempts to amend this by providing an introduction to time series methodologies that is oriented toward issues within psychological research. This is accomplished by first introducing the basic characteristics of time series data: the four components of variation (trend, seasonality, cycles, and irregular variation), autocorrelation, and stationarity. Then, various time series regression models are explicated that can be used to achieve a wide range of goals, such as describing the process of change through time, estimating seasonal effects, and examining the effect of an intervention or critical event. Not to overlook the potential importance of forecasting for psychological research, the second half of the paper discusses methods for modeling autocorrelation and generating accurate predictions—viz., autoregressive integrative moving average (ARIMA) modeling. The final section briefly describes how regression techniques and ARIMA models can be combined in a dynamic regression model that can simultaneously explain and forecast a time series variable. Thus, the current paper seeks to provide an integrative resource for psychological researchers interested in analyzing time series data which, given the trends described above, are poised to become increasingly prevalent.

The Current Illustrative Application

In order to better demonstrate how time series analysis can accomplish the goals of psychological research, a running practical example is presented throughout the current paper. For this particular illustration, we focused on online job search behaviors using data from Google Trends, which compiles the frequency of online searches on Google over time. We were particularly interested in the frequency of online job searches in the United States 2 and the impact of the 2008 economic crisis on these rates. Our primary research hypothesis was that this critical event resulted in a sharp increase in the series that persisted over time. The monthly frequencies of these searches from January 2004 to June 2011 were recorded, constituting a data set of 90 total observations. Figure 1 displays a plot of this original time series that will be referenced throughout the current paper. Importantly, the values of the series do not represent the raw number of Google searches, but have been normalized (0–100) in order to yield a more tractable data set; each monthly value represents its percentage relative to the maximum observed value 3 .

www.frontiersin.org

Figure 1. A plot of the original Google job search time series and the series after seasonal adjustment .

A Note on Software Implementation

Conceptual expositions of new analytical methods can often be undermined by the practical issue of software implementation ( Sharpe, 2013 ). To preempt this obstacle, for each analysis we provide accompanying R code in the Supplementary Material, along with an intuitive explanation of the meanings and rationale behind the various commands and arguments. On account of its versatility, the open-source statistical package R ( R Development Core Team, 2011 ) remains the software platform of choice for performing time series analyses, and a number of introductory texts are oriented solely toward this program, such as Introductory Time Series with R ( Cowpertwait and Metcalfe, 2009 ), Time Series Analysis with Applications in R ( Cryer and Chan, 2008 ), and Time Series Analysis and Its Applications with R Examples ( Shumway and Stoffer, 2006 ). In recent years, R has become increasingly recognized within the psychological sciences as well ( Muenchen, 2013 ). We believe that psychological researchers with even a minimal amount of experience with R will find this tutorial both informative and accessible.

An Introduction to Time Series Data

Before introducing how time series analyses can be used in psychological research, it is necessary to first explicate the features that characterize time series data. At its simplest, a time series is a set of time-ordered observations of a process where the intervals between observations remain constant (e.g., weeks, months, years, and minor deviations in the intervals are acceptable; McCleary et al., 1980 , p. 21; Cowpertwait and Metcalfe, 2009 ). Time series data is often distinguished from other types of longitudinal data by the number and source of the observations; a univariate time series contains many observations originating from a single source (e.g., an individual, a price index), while other forms of longitudinal data often consist of several observations from many sources (e.g., a group of individuals). The length of time series can vary, but are generally at least 20 observations long, and many models require at least 50 observations for accurate estimation ( McCleary et al., 1980 , p. 20). More data is always preferable, but at the very least, a time series should be long enough to capture the phenomena of interest.

Due to its unique structure, a time series exhibits characteristics that are either absent or less prominent in the kinds of cross-sectional and longitudinal data typically collected in psychological research. In the next sections, we review these features that include autocorrelation and stationarity . However, we begin by delineating the types of patterns that may be present within a time series. That is, the variation or movement in a series can be partitioned into four parts: the trend, seasonal, cyclical , and irregular components ( Persons, 1919 ).

The Four Components of Time Series

Trend refers to any systematic change in the level of a series—i.e., its long-term direction ( McCleary et al., 1980 , p. 31; Hyndman and Athanasopoulos, 2014 ). Both the direction and slope (rate of change) of a trend may remain constant or change throughout the course of the series. Globally, the illustrative time series shown in Figure 1 exhibits a positive trend: The level of the series at the end is systematically higher than at its beginning. However, there are sections in this particular series that do not exhibit the same rate of increase. The beginning of the series displays a slight negative trend, and starting approximately at 2006, the series significantly rises until 2009, after which a small downward trend may even be present.

Because a trend in the data represents a significant source of variability, it must be accounted for when performing any time series analysis. That is, it must be either (a) modeled explicitly or (b) removed through mathematical transformations (i.e., detrending ; McCleary et al., 1980 , p. 32). The former approach is taken when the trend is theoretically interesting—either on its own or in relation to other variables. Conversely, removing the trend (through methods discussed later) is performed when this component is not pertinent to the goals of the analysis (e.g., strict forecasting). The decision of whether to model or remove systematic components like a trend represents an important aspect of time series analysis. The various characteristics of time series data are either of theoretical interest—in which case they should be modeled—or not, in which case they should be removed so that the aspects that are of interest can be more easily analyzed. Thus, it is incumbent upon the analyst to establish the goals of the analysis and determine which components of a time series are of interest and treat them accordingly. This topic will be revisited throughout the forthcoming sections.

Seasonality

Unlike the trend component, the seasonal component of a series is a repeating pattern of increase and decrease in the series that occurs consistently throughout its duration. More specifically, it can be defined as a cyclical or repeating pattern of movement within a period of 1 year or less that is attributed to “seasonal” factors—i.e., those related to an aspect of the calendar (e.g., the months or quarters of a year or the days of a week; Cowpertwait and Metcalfe, 2009 , p. 6; Hyndman and Athanasopoulos, 2014 ). For instance, restaurant attendance may exhibit a weekly seasonal pattern such that the weekends routinely display the highest levels within the series across weeks (i.e., the time period), and the first several weekdays are consistently the lowest. Retail sales often display a monthly seasonal pattern, where each month across yearly periods consistently exhibits the same relative position to the others: viz., a spike in the series during the holiday months and a marked decrease in the following months. Importantly, the pattern represented by a seasonal effect remains constant and occurs over the same duration on each occasion ( Hyndman and Athanasopoulos, 2014 ).

Although its underlying pattern remains fixed, the magnitude of a seasonal effect may vary across periods. Seasonal effects can also be embedded within overarching trends. Along with a marked trend, the series in Figure 1 exhibits noticeable seasonal fluctuations as well; at the beginning of each year (i.e., after the holiday months), online job searches spike and then fall significantly in February. After February, they continue to rise until about July or August, after which the series significantly drops for the remainder of the year, representing the effects of seasonal employment. Notice the consistency of both the form (i.e., pattern of increase and decrease) and magnitude of this seasonal effect. The fact that online job search behavior exhibits seasonal patterns supports the idea that this behavior (and this example in particular) is representative of job search behavior in general. In the United States, thousands of individuals engage in seasonal work which results in higher unemployment rates in the beginning of each year and in the later summer months (e.g., July and August; The United States Department of Labor, Bureau of Labor Statistics, 2014 ), manifesting in a similar seasonal pattern of job search behavior.

One may be interested in the presence of seasonal effects, but once identified, this source of variation is often removed from the time series through a procedure known as seasonal adjustment ( Cowpertwait and Metcalfe, 2009 , p. 21). This is in keeping with the aforementioned theme: Once a systematic component has been identified, it must either be modeled or removed. The popularity of seasonal adjustment is due to the characteristics of seasonal effects delineated above: Unlike other more dynamic components of a time series, seasonal patterns remain consistent across periods and are generally similar in magnitude ( Hyndman and Athanasopoulos, 2014 ). Their effects may also obscure other important features of time series—e.g., a previously unnoticed trend or cycles described in the following section. Put simply, “seasonal adjustment is done to simplify data so that they may be more easily interpreted…without a significant loss of information” ( Bell and Hillmer, 1984 , p. 301). Unemployment rates are often seasonally adjusted to remove the fluctuations due to the effects of weather, harvests, and school schedules that remain more or less constant across years. In our data, the seasonal effects of job search behavior are not of direct theoretical interest relative to other features of the data, such as the underlying trend and the impact of the 2008 economic crisis. Thus, we may prefer to work with the simpler seasonally adjusted series. The lower panel of Figure 1 displays the original Google time series after seasonal adjustment, and the Supplementary Material contains a description of how to implement this procedure in R. It can be seen that the trend is made notably clearer after removing the seasonal effects. Despite the spike at the very end, the suspected downward trend in the later part of the series is much more evident. This insight will prove to be important when selecting an appropriate time series model in the upcoming sections.

A cyclical component in a time series is conceptually similar to a seasonal component: It is a pattern of fluctuation (i.e., increase or decrease) that reoccurs across periods of time. However, unlike seasonal effects whose duration is fixed across occurrences and are associated with some aspect of the calendar (e.g., days, months), the patterns represented by cyclical effects are not of fixed duration (i.e., their length often varies from cycle to cycle) and are not attributable to any naturally-occurring time periods ( Hyndman and Athanasopoulos, 2014 ). Put simply, cycles are any non-seasonal component that varies in a recognizable pattern (e.g., business cycles; Hyndman and Athanasopoulos, 2014 ). In contrast to seasonal effects, cycles generally occur over a period lasting longer than 2 years (although they may be shorter), and the magnitude of cyclical effects is generally more variable than that of seasonal effects ( Hyndman and Athanasopoulos, 2014 ). Furthermore, just as the previous two components—trend and seasonality—can be present with or without the other, a cyclical component may be present with any combination of the other two. For instance, a trend with an intrinsic seasonal effect can be embedded within a greater cyclical pattern that occurs over a period of several years. Alternatively, a cyclical effect may be present without either of these two systematic components.

In the 7 years that constitute the time series of Figure 1 , there do not appear to be any cyclical effects. This is expected, as there are no strong theoretical reasons to believe that online or job search behavior is significantly influenced by factors that consistently manifest across a period of over one year. We have significant a priori reasons to believe that causal factors related to seasonality exist (e.g., searching for work after seasonal employment), but the same does not hold true for long-term cycles, and the time series is sufficiently long enough to capture any potential cyclical behavior.

Irregular Variation (Randomness)

While the previous three components represented three systematic types of time series variability (i.e., signal ; Hyndman and Athanasopoulos, 2014 ), the irregular component represents statistical noise and is analogous to the error terms included in various types of statistical models (e.g., the random component in generalized linear modeling). It constitutes any remaining variation in a time series after these three systematic components have been partitioned out. In time series parlance, when this component is completely random (i.e., not autocorrelated), it is referred to as white noise , which plays an important role in both the theory and practice of time series modeling. Time series are assumed to be in part driven by a white noise process (explicated in a future section), and white noise is vital for judging the adequacy of a time series model. After a model has been fit to the data, the residuals form a time series of their own, called the residual error series . If the statistical model has been successful in accounting for all the patterns in the data (e.g., systematic components such as trend and seasonality), the residual error series should be nothing more than unrelated white noise error terms with a mean of zero and some constant variance. In other words, the model should be successful in extracting all the signal present in the data with only randomness left over ( Cowpertwait and Metcalfe, 2009 , p. 68). This is analogous to evaluating the residuals of linear regression, which should be normally distributed around a mean of zero.

Time Series Decomposition

To visually examine a series in an exploratory fashion, time series are often formally partitioned into each of these components through a procedure referred to as time series decomposition . Figure 2 displays the original Google time series (top panel) decomposed into its constituent parts. This figure depicts what is referred to as classical decomposition , when a time series is conceived of comprising three components: a trend-cycle, seasonal, and random component. (Here, the trend and cycle are combined because the duration of each cycle is unknown; Hyndman and Athanasopoulos, 2014 ). The classic additive decomposition model ( Cowpertwait and Metcalfe, 2009 , p. 19) describes each value of the time series as the sum of these three components:

www.frontiersin.org

Figure 2. The original time series decomposed into its trend, seasonal, and irregular (i.e., random) components . Cyclical effects are not present within this series.

The additive decomposition model is most appropriate when the magnitude of the trend-cycle and seasonal components remain constant over the course of the series. However, when the magnitude of these components varies but still appears proportional over time (i.e., it changes by a multiplicative factor), the series may be better represented by the multiplicative decomposition model, where each observation is the product of the trend-cycle, seasonal, and random components:

In either decomposition model, each component is sequentially estimated and then removed until only the stochastic error component remains (the bottom panel of Figure 2 ). The primary purpose of time series decomposition is to provide the analyst with a better understanding of the underlying behavior and patterns of the time series which can be valuable in determining the goals of the analysis. Decomposition models can be used to generate forecasts by adding or multiplying future estimates of the seasonal and trend-cycle components ( Hyndman and Athanasopoulos, 2014 ). However, such models are beyond the scope of this present paper, and the ARIMA forecasting models discussed later are generally superior 4 .

Autocorrelation

In psychological research, the current state of a variable may partially depend on prior states. That is, many psychological variables exhibit autocorrelation : when a variable is correlated with itself across different time points (also referred to as serial dependence ). Time series designs capture the effect of previous states and incorporate this potentially significant source of variance within their corresponding statistical models. Although the main features of many time series are its systematic components such as trend and seasonality, a large portion of time series methodology is aimed at explaining the autocorrelation in the data ( Dettling, 2013 , p. 2).

The importance of accounting for autocorrelation should not be overlooked; it is ubiquitous in social science phenomena ( Kerlinger, 1973 ; Jones et al., 1977 ; Hartmann et al., 1980 ; Hays, 1981 ). In a review of 44 behavioral research studies with a total of 248 independent sets of repeated measures data, Busk and Marascuilo (1988) found that 80% of the calculated autocorrelations ranged from 0.1 to 0.49, and 40% exceeded 0.25. More specific to the psychological sciences, it has been proposed that state-related constructs at the individual-level, such as emotions and arousal, are often contingent on prior states ( Wood and Brown, 1994 ). Using autocorrelation analysis, Fairbairn and Sayette (2013) found that alcohol use reduces emotional inertia, the extent to which prior affective states determine current emotions. Through this, they were able to marshal support for the theory of alcohol myopia , the intuitive but largely untested idea that alcohol allows a greater enjoyment of the present, and thus formally uncovered an affective motivation for alcohol use (and misuse). Further, using time series methods, Fuller et al. (2003) found that job stress in the present day was negatively related to the degree of stress in the preceding day. Accounting for autocorrelation can therefore reveal new information on the phenomenon of interest, as the Fuller et al. (2003) analysis led to the counterintuitive finding that lower stress was observed after prior levels had been high.

Statistically, autocorrelation simply represents the Pearson correlation for a variable with itself at a previous time period, referred to as the lag of the autocorrelation. For instance, the lag-1 autocorrelation of a time series is the correlation of each value with the immediately preceding observation; a lag-2 autocorrelation is the correlation with the value that occurred two observations before. The autocorrelation with respect to any lag can be computed (e.g., a lag-20 autocorrelation), and intuitively, the strength of the autocorrelation generally diminishes as the length of the lag increases (i.e., as the values become further removed in time).

Strong positive autocorrelation in a time series manifests graphically by “runs” of values that are either above or below the average value of the time series. Such time series are sometimes called “persistent” because when the series is above (or below) the mean value it tends to remain that way for several periods. Conversely, negative autocorrelation is characterized by the absence of runs—i.e., when positive values tend to follow negative values (and vice versa). Figure 3 contains two plots of time series intended to give the reader an intuitive understanding of the presence of autocorrelation: The series in the top panel exhibits positive autocorrelation, while the center panel illustrates negative autocorrelation. It is important to note that the autocorrelation in these series is not obscured by other components and that in real time series, visual analysis alone may not be sufficient to detect autocorrelation.

www.frontiersin.org

Figure 3. Two example time series displaying exaggerated positive (top panel) and negative (center panel) autocorrelation . The bottom panel depicts the ACF of the Google job search time series after seasonal adjustment.

In time series analysis, the autocorrelation coefficient across many lags is called the autocorrelation function (ACF) and plays a significant role in model selection and evaluation (as discussed later). A plot of the ACF of the Google job search time series after seasonal adjustment is presented in the bottom panel of Figure 3 . In an ACF plot, the y-axis displays the strength of the autocorrelation (ranging from positive to negative 1), and the x-axis represents the length of the lags: from lag-0 (which will always be 1) to much higher lags (here, lag-19). The dotted horizontal line indicates the p < 0.05 criterion for statistical significance.

Stationarity

Definition and purpose.

A complication with time series data is that its mean, variance, or autocorrelation structure can vary over time. A time series is said to be stationary when these properties remain constant ( Cryer and Chan, 2008 , p. 16). Thus, there are many ways in which a series can be non-stationary (e.g., an increasing variance over time), but it can only be stationary in one-way (viz., when all of these features do not change).

Stationarity is a pivotal concept in time series analysis because descriptive statistics of a series (e.g., its mean and variance) are only accurate population estimates if they remain constant throughout the series ( Cowpertwait and Metcalfe, 2009 , pp. 31–32). With a stationary series, it will not matter when the variable is observed: “The properties of one section of the data are much like those of any other” ( Chatfield, 2004 , p. 13). As a result, a stationary series is easy to predict: Its future values will be similar to those in the past ( Nua, 2014 ). As a result, stationarity is the most important assumption when making predictions based on past observations ( Cryer and Chan, 2008 , p. 16), and many times series models assume the series already is or can be transformed to stationarity (e.g., the broad class of ARIMA models discussed later).

In general, a stationary time series will have no predictable patterns in the long-term; plots will show the series to be roughly horizontal with some constant variance ( Hyndman and Athanasopoulos, 2014 ). A stationary time series is illustrated in Figure 4 , which is a stationary white noise series (i.e., a series of uncorrelated terms). The series hovers around the same general region (i.e., its mean) with a consistent variance around this value. Despite the observations having a constant mean, variance, and autocorrelation, notice how such a process can generate outliers (e.g., the low extreme value after t = 60), as well as runs of values that are both above or below the mean. Thus, stationarity does not preclude these temporary and fluctuating behaviors of the series, although any systematic patterns would.

www.frontiersin.org

Figure 4. An example of a stationary time series (specifically, a series of uncorrelated white noise terms) . The mean, variance, and autocorrelation are all constant over time, and the series displays no systematic patterns, such as trends or cycles.

However, many time series in real life are dominated by trends and seasonal effects that preclude stationarity. A series with a trend cannot be stationary because, by definition, a trend is when the mean level of the series changes over time. Seasonal effects also preclude stationarity, as they are reoccurring patterns of change in the mean of the series within a fixed time period (e.g., a year). Thus, trend and seasonality are the two time series components that must be addressed in order to achieve stationarity.

Transforming a Series to Stationarity

When a time series is not stationary, it can be made so after accounting for these systematic components within the model or through mathematical transformations. The procedure of seasonal adjustment described above is a method that removes the systematic seasonal effects on the mean level of the series.

The most important method of stationarizing the mean of a series is through a process called differencing , which can be used to remove any trend in the series which is not of interest. In the simplest case of a linear trend, the slope (i.e., the change from one period to the next) remains relatively constant over time. In such a case, the difference between each time period and its preceding one (referred to as the first differences ) are approximately equal. Thus, one can effectively “detrend” the series by transforming the original series into a series of first differences ( Meko, 2013 ; Hyndman and Athanasopoulos, 2014 ). The underlying logic is that forecasting the change in a series from one period to the next is just as useful in practice as predicting the original series values.

However, when the time series exhibits a trend that itself changes (i.e., a non-constant slope), then even transforming a series into a series of its first differences may not render it completely stationary. This is because when the slope itself is changing (e.g., an exponential trend), the difference between periods will be unequal. In such cases, taking the first differences of the already differenced series (referred to as the second differences ) will often stationarize the series. This is because each successive differencing has the effect of reducing the overall variance of the series ( Anderson, 1976 ), as deviations from the mean level are increasingly reduced through this subtractive process. The second differences (i.e., the first differences of the already differenced series) will therefore further stabilize the mean. There are general guidelines on how many orders of differencing are necessary to stationarize a series. For instance, the first or second differences will nearly always stationarize the mean, and in practice it is almost never necessary to go beyond second differencing ( Cryer and Chan, 2008 ; Hyndman and Athanasopoulos, 2014 ). However, for series that exhibit higher-degree polynomial trends, the order of differencing required to stationarize the series is typically equal to that degree (e.g., two orders of differencing for an approximately quadratic trend, three orders for a cubic trend; Cowpertwait and Metcalfe, 2009 , p. 93).

A common mistake in time series modeling to “overdifference” the series, when more orders of differencing than are required to achieve stationarity are performed. This can complicate the process of building an adequate and parsimonious model (see McCleary et al., 1980 , p. 97). Fortunately, overdifferencing is relatively easy to identify; differencing a series with a trend will have the effect of reducing the variance of the series, but an unnecessary degree of differencing will increase its variance ( Anderson, 1976 ). Thus, the optimal order of differencing is that which results in the lowest variance of the series.

If the variance of a times series is not constant over time, a common method of making the variance stationary is through a logarithmic transformation of the series ( Cowpertwait and Metcalfe, 2009 , pp. 109–112; Hyndman and Athanasopoulos, 2014 ). Taking the logarithm has the practical effect of reducing each value at an exponential rate. That is, the larger the value, the more its value is reduced. Thus, this transformation stabilizes the differences across values (i.e., its variance) which is also why it is frequently used to mitigate the effect of outliers (e.g., Aguinis et al., 2013 ). It is important to remember that if one applies a transformation, any forecasts generated by the selected model will be in these transformed units. However, once the model is fitted and the parameters estimated, one can reverse these transformations to obtain forecasts in its original metric.

Finally, there are also formal statistical tests for stationarity, termed unit root tests. A very popular procedure is the augmented Dickey–Fuller test (ADF; Said and Dickey, 1984 ) which tests the null hypothesis that the series is non-stationary. Thus, rejection of the null provides evidence for a stationary series. Table 1 below contains information regarding the ADF test, as well as descriptions of other various statistical tests frequently used in time series analysis that will be discussed in the remainder of the paper. By using the ADF test in conjunction with the transformations described above (or the modeling procedures delineated below), an analyst can ensure that a series conforms to stationarity.

www.frontiersin.org

Table 1. Common tests in time series analysis .

Time Series Modeling: Regression Methods

The statistical time series literature is dominated by methodologies aimed at forecasting the behavior of a time series ( Shmueli, 2010 ). Yet, as the survey in the introduction illustrated, psychological researchers are primarily interested in other applications, such as describing and accounting for an underlying trend, linking explanatory variables to the criterion of interest, and assessing the impact of critical events. Thus, psychological researchers will primarily use descriptive or explanatory models, as opposed to predictive models aimed solely at generating accurate forecasts. In time series analysis, each of the aforementioned goals can be accomplished through the use of regression methods in a manner very similar to the analysis of cross-sectional data. After having explicated the basic properties of time series data, we now discuss these specific modeling approaches that are able fulfill these purposes. The next four sections begin by first providing an overview of each type of regression model, how psychological research stands to gain from the use of these methods, and their corresponding statistical models. We include mathematical treatments, but also provide conceptual explanations so that they may be understood in an accessible and intuitive manner. Additionally, Figure 5 presents a flowchart depicting different time series models and which approaches are best for addressing the various goals of psychological research. As the current paper continues, the reader will come to understand the meaning and structure of these models and their relation to substantive research questions.

www.frontiersin.org

Figure 5. A flowchart depicting various time series modeling approaches and how they are suited to address various goals in psychological research .

It is important to keep in mind that time series often exhibit strong autocorrelation which often manifests in correlated residuals after a regression model has been fit. This violates the standard assumption of independent (i.e., uncorrelated) errors. In the section that follows these regression approaches, we describe how the remaining autocorrelation can be included in the model by building a dynamic regression model that includes ARIMA terms 5 . That is, a regression model can be first fit to the data for explanatory or descriptive modeling, and ARIMA terms can be fit to the residuals in order to account for any remaining autocorrelation and improve forecasts ( Hyndman and Athanasopoulos, 2014 ). However, we begin by introducing regression methods separate from ARIMA modeling, temporarily setting aside the issue of autocorrelation. This is done in order to better focus on the implementation of these models, but also because violating this assumption has minimal effects on the substance of the analysis: The parameter estimates remain unbiased and can still be used for prediction. Its forecasts will not be “wrong,” but inefficient —i.e., ignoring the information represented by the autocorrelation that could be used to obtain better predictions ( Hyndman and Athanasopoulos, 2014 ). Additionally, generalized least squares estimation (as opposed to ordinary least squares) takes into account the effects of autocorrelation which otherwise lead to underestimated standard errors ( Cowpertwait and Metcalfe, 2009 , p. 98). This estimation procedure was used for each of the regression models below. For further information on regression methods for time series, the reader is directed to Hyndman and Athanasopoulos (2014 , chaps. 4, 5) and McCleary et al. (1980) , which are very accessible introductions to the topic, as well as Cowpertwait and Metcalfe (2009 , chap. 5) and Cryer and Chan (2008 , chaps. 3, 11) for more mathematically-oriented treatments.

Modeling Trends through Regression

Modeling an observed trend in a time series through regression is appropriate when the trend is deterministic —i.e., the trend is due to the constant, deterministic effects of a few causal forces ( McCleary et al., 1980 , p. 34). As a result, a deterministic trend is generally stable across time. Expecting any trend to continue indefinitely is often unrealistic, but for a deterministic trend, linear extrapolation can provide accurate forecasts for several periods ahead, as forecasting generally assumes that trends will continue and change relatively slowly ( Cowpertwait and Metcalfe, 2009 , p. 6). Thus, when the trend is deterministic, it is desirable to use a regression model that includes the hypothesized causal factors as predictors ( Cowpertwait and Metcalfe, 2009 , p. 91; McCleary et al., 1980 , p. 34).

Deterministic trends stand in contrast to stochastic trends, those that arise simply from the random movement of the variable over time (long runs of similar values due to autocorrelation; Cowpertwait and Metcalfe, 2009 , p. 91). As a result, stochastic trends often exhibit frequent and inexplicable changes in both slope and direction. When the trend is deemed to be stochastic, it is often removed through differencing. There are also methods for forecasting using stochastic trends (e.g., random walk and exponential smoothing models) discussed in Cowpertwait and Metcalfe (2009 , chaps. 3, 4) and Hyndman and Athanasopoulos (2014 , chap. 7). However, the reader should be aware that these are predictive models only, as there is nothing about a stochastic trend that can be explained through external, theoretically interesting factors (i.e., it is a trend attributable to randomness). Therefore, attempting to model it deterministically as a function of time or other substantive variables via regression can lead to spurious relationships ( Kuljanin et al., 2011) and inaccurate forecasts, as the trend is unlikely to remain stable over time.

Returning to the example Google time series of Figure 1 , the evident trend in the seasonally adjusted series might appear to be stochastic: It is not constant but changes at several points within the series. However, we have strong theoretical reasons for modeling it deterministically, as the 2008 economic crisis is one causal factor that likely had a profound impact on the series. Thus, this theoretical rationale implies that the otherwise inexplicable changes in its trend are due to systematic forces that can be appropriately modeled within an explanatory approach (i.e., as a deterministic function of predictors).

The Linear Regression Model

As noted in the literature review, psychological researchers are often directly interested in describing an underlying trend. For example, ( Fuller et al. (2003) examined the strain of university employees using a time series design. They found that each self-report item displayed the same deterministic trend: Globally, strain increased over time even though the perceived severity of the stressful events did not increase. Levels of strain also decreased at spring break and after finals week, during which mood and job satisfaction also exhibited rising levels. This finding cohered with prior theory on the accumulating nature of stress and the importance of regular strain relief (e.g., Bolger et al., 1989 ; Carayon, 1995) . Furthermore, Wagner et al. (1988) examined the trend in employee productivity after the implementation of an incentive-based wage system. In addition to discovering an immediate increase in productivity, it was found that productivity increased over time as well (i.e., a continuing deterministic trend). This trend gradually diminished over time, but was still present at the end of the study period—nearly 6 years after the intervention first occurred.

By visually examining a time series, an analyst can describe how a trend changes as function of time. However, one can formally assess the behavior of a trend by regressing the series on a variable that represents time (e.g., 1–50 for 50 equally-spaced observations). In the simplest case, the trend can be modeled as a linear function of time, which is conceptually identical to a regression model for cross-sectional data using a single predictor:

where the coefficient b 1 estimates the amount of change in the time series associated with a one-unit increase in time, t is the time variable, and ε t is random error. The constant, b 0 , estimates the level of the series when t = 0.

If a deterministic trend is fully accounted for by a linear regression model, the residual error series (i.e., the collection of residuals which themselves form a time series) will not contain any remaining trend component; that is, this non-stationary behavior of the series will have been accounted for Cowpertwait and Metcalfe (2009) , (p. 121). Returning to our empirical example, the linear regression model displayed in Equation (3) was fit to the seasonally adjusted Google job search data. This is displayed in the top left panel of Figure 6 . The regression line of best-fit is superimposed, and the residual error series is shown in the panel directly to the right. Here, time is a significant predictor ( b 1 = 0.32, p < 0.001), and the model accounts for 67% of the seasonally-adjusted series variance ( R 2 = 0.67, p < 0.001). However, the residual error series displays a notable amount of remaining trend that has been left unaccounted for; the first half of the error series has a striking downward trend that begins to rise at around 2007. This is because the regression line is constrained to linearity and therefore systematically underestimates and overestimates the values of the series when the trend exhibits runs of high and low values, respectively. Importantly, the forecasts from the simple linear model will most likely be very poor as well. Although there is a spike at the end of the series, the linear model predicts that values further ahead in time will be even higher. By contrast, we actually expect these values to decrease, similar to how there was a decreasing trend in 2008 right after the first spike. Thus, despite accounting for a considerable amount of variance and serving as a general approximation of the series trend, the linear model is insufficient in several systematic ways, manifesting in inaccurate forecasts and a significant remaining trend in the residual error series. A method for improving this model is to add in a higher-order polynomial term; modeling the trend as quadratic, cubic, or an even higher-order function may lead to a better-fitting model, but the analyst must be vigilant of overfitting the series—i.e., including so many parameters that the statistical noise becomes modeled. Thus, striking a balance between parsimony and explanatory capability should always be a consideration when modeling time series (and statistical modeling in general). Although a simple linear regression on time is often adequate to approximate a trend ( Cowpertwait and Metcalfe, 2009 , p. 5), in this particular instance a higher-order term may provide a better fit to the complex deterministic trend seen within this series.

www.frontiersin.org

Figure 6. Three different regression models with time as the regressor and their associated residual error series .

Polynomial Regression Models

When describing the trend in the Google data earlier, it was noted that the series began to display a rising trend approximately a third of the way into the series, implying that a quadratic regression model (i.e., a single bend) may yield a good fit to the data. Furthermore, our initial hypothesis was that job search behavior proceeded at a generally constant rate and then spiked once the economic crisis began—also implying a quadratic trend. In some time series, the trend over time will be non-linear, and the predictor terms can be specified to reflect such higher-order terms (quadratic, cubic, etc.). Just like when modeling cross-sectional data, non-linear terms can be incorporated into the statistical model by squaring the predictor (here, time) 6 :

The center panels in Figure 6 show the quadratic model and its residual error series. In line with the initial hypothesis, both the quadratic term ( b 2 = 0.003, p < 0.001) and linear term ( b 1 = 0.32, p < 0.001) were statistically significant. Thus, modeling the trend as a quadratic function of time explained an additional 4% of the series variance relative to the more parsimonious linear model ( R 2 = 0.71, p < 0.001). However, examination of this series and its residuals shows that it is not as different from the linear model than was expected; although the first half of the residual error series has a more stable mean level, there are still noticeable trends in the first half of the residual error series, and the forecasts implied by this model are even higher than those of the linear model. Therefore, a cubic trend may provide an even better fit, as there are two apparent bends in the series:

After fitting this model to the Google data, 87% of the series variance is accounted for ( R 2 = 0.87 p < 0.001), and all three coefficients are statistically significant: b 1 = 0.69, p < 0.001, b 2 = 0.003, p = 0.05, and b 3 = −0.0003, p < 0.001. Furthermore, the forecasts implied by the model are much more realistic. Ultimately, it is unlikely that this model will provide accurate forecasts many periods into the future (as is often the case for regression models; Cowpertwait and Metcalfe, 2009 , p. 6; Hyndman and Athanasopoulos, 2014 ). It is more likely that either (a) a negative trend will return the series back to more moderate levels or (b) the series will simply continue at a generally high level. Furthermore, relative to the linear model, the residual error series of this model appears much closer to stationarity (e.g., Figure 4 ), as the initial downward trend of the time series is captured. Therefore, modeling the series as a cubic function of time is the most successful in terms of accounting for the trend, and adding an even higher-order polynomial term has little remaining variance to explain (<15%) and would likely lead to an overfitted model. Thus, relative to the two previous models, the cubic model strikes a balance between relative parsimony and descriptive capability. However, any forecasts from this model could be improved upon by removing the remaining trend and including other terms that account for any autocorrelation in the data, topics discussed in an upcoming section on ARIMA modeling.

Interrupted Time Series Analysis

Although we are interested in describing the underlying trend within the Google time series as a function of time, we are also interested in the effect of a critical event, represented by the following question: “Did the 2008 economic crisis result in elevated rates job search behaviors?” In psychological science, many research questions center on the impact of an event, whether it be a relationship change, job transition, or major stressor or uplift ( Kanner et al., 1981 ; Dalal et al., 2014 ). In the survey of how time series analysis had been previously used in psychological research, examining the impact of an event was one of its most common uses. In time series methodology, questions regarding the impact of events can be analyzed through interrupted time series analysis (or intervention analysis ; Glass et al., 1975 ), in which the time series observations are “interrupted” by an intervention, treatment, or incident occurring at a known point in time ( Cook and Campbell, 1979 ).

In both academic and applied settings, psychological researchers are often constrained to correlational, cross-sectional data. As a result, researchers rarely have the ability to implement control groups within their study designs and are less capable of drawing conclusions regarding causality. In the majority of cases, it is the theory itself that provides the rationale for drawing causal inferences ( Shmueli, 2010 , p. 290). In contrast, an interrupted time series is the strongest quasi-experimental design to evaluate the longitudinal impact of an event ( Wagner et al., 2002 , p. 299). In a review of previous research on the efficacy of interventions, Beer and Walton (1987) stated, “much of the research overlooks time and is not sufficiently longitudinal. By assessing the events and their impact at only one nearly contemporaneous moment, the research cannot discuss how permanent the changes are” (p. 343). Interrupted time series analysis ameliorates this problem by taking multiple measurements both before and after the event, thereby allowing the analyst to examine the pre- and post-event trend.

Collecting data at multiple time points also offers advantages relative to cross-sectional comparisons based on pre- and post-event means. A longitudinal interrupted time series design allows the analyst to control for the trend prior to the event, which may turn out to be the cause of any alleged intervention effect. For instance, in the field of industrial/organizational psychology, Pearce et al. (1985) found a positive trend in four measures of organizational performance over the course of the 4 years under study. However, after incorporating the effects of the pre-event trend in the analysis, neither the implementation of the policy nor the first year of merit-based rewards yielded any additional effects. That is, the post-event trends were almost totally attributable to the pre-event behavior of the series. Thus, a time series design and analysis yielded an entirely different and more parsimonious conclusion that might have otherwise been drawn. In contrast, Wagner et al. (1988) was able to show that that for non-managerial employees, an incentive-based wage system substantially increased employee productivity in both its baseline level and post-intervention slope (the baseline level jumped over 100%). Thus, interrupted time series analysis is an ideal method for examining the impacts of such events and can be generalized to other criteria of interest.

Modeling an Interrupted Time Series

Statistical modeling of an interrupted time series can be accomplished through segmented regression analysis ( Wagner et al., 2002 , p. 300). Here, the time series is partitioned into two parts: the pre- and post-event segments whose levels (intercepts) and trends (slopes) are both estimated. A change in these parameters represents an effect of the event: A significant change in the level of the series indicates an immediate change, and a change in trend reflects a more gradual change in the outcome (and of course, both are possible; Wagner et al., 2002 , p. 300). The formal model reflects these four parameters of interest:

Here, b 0 represents the pre-event baseline level, t is the predictor time (in our example, coded 1–90), and its coefficient, b 1, estimates the trend prior to the event ( Wagner et al., 2002 , p. 31). The dummy variable event t codes for whether or not each time point occurred before or after the event (0 for all points prior to the event; 1 for all points after). Its coefficient, b 2 , assesses the post-event baseline level (intercept). The variable t after event represents how many units after the event the observation took place (0 for all points prior to the event; 1, 2, 3 … for subsequent time points), and its coefficient, b 3 , estimates the change in trend over the two segments. Therefore, the sum of the pre-event trend ( b 1 ) and its estimated change ( b 3 ) yields the post-event slope ( Wagner et al., 2002 , p. 301).

Importantly, this analysis requires that the time of event occurrence be specified a priori, otherwise a researcher may search the series in an “exploratory” fashion and discover a time point that yields a notable effect, resulting in potentially spurious results ( McCleary et al., 1980 , p. 143). In our example, the event of interest was the economic crisis of 2008. However, as is often the case when analyzing large-scale social phenomena, it was not a discrete, singular incident, but rather unfolded over time. Thus, no exact point in time can perfectly represent its moment of occurrence. In other topics of psychological research, the event of interest is a unique post-event time may be identified. Although interrupted time series analysis requires that events be discrete, this conceptual problem can be easily managed in practice; selecting a point of demarcation that generally reflects when the event occurred will still allow the statistical model to assess the impact of the event on the level and trend of the series. Therefore, due to prior theory and for simplicity, we specified the pre- and post-crisis segments to be separated at January 2008, representing the beginning of the economic crisis and acknowledging that this demarcation was imperfect, but one that would still allow the substantive research question of interest to be answered.

Although not utilized in our analysis, when analyzing an interrupted time series using segmented regression one has the option of actually specifying the post-event segment after the actual event occurred. The rationale behind this is to accommodate the time it takes for the causal effect of the event itself manifest in the time series—the equilibration period (see Mitchell and James, 2001 , p. 539; Wagner et al., 2002 , p. 300). Although an equilibration period is likely a component of all causal phenomena (i.e., causal effects probably never fully manifest at once), two prior reviews have illustrated that researchers account for it only infrequently, both theoretically and empirically ( Kelly and McGrath, 1988 ; Mitchell and James, 2001 ). Statistically, this is accomplished through the segmented regression model above, but simply coding the event as occurring later in the series. Comparing models with different post-event start times can also allow competitive tests of the equilibration period.

Empirical Example

For our working example, a segmented regression model was fit to the seasonally adjusted Google time series: A linear trend estimated the first segment and a quadratic trend was fit to the second due to the noted curvilinear form of the second half of the series. Thus, a new variable and coefficient were added to the formal model to account for this non-linearity: t after event 2 and b 4 , respectively. The results of the analysis indicated that there was a practically significant effect of the crisis: The parameter representing an immediate change in the post-event level was b 2 = 8.66, p < 0.001. Although the level (i.e., intercept) differed across segments, the post-crisis trend appears to be the most notable change in the series. That is, the real effect of the crisis unfolded over time rather than having an immediately abrupt impact. This is reflected in the other coefficients of the model: The pre-crisis trend was estimated to be near zero ( b 1 = −0.03, p = 0.44), and the post-crisis trend terms were b 3 = 0.70, p < 0.001 for the linear component, and b 4 = −0.02, p < 0.001 for the quadratic term, indicating that there was a marked change in trend, but also that it was concave (i.e., on the whole, slowly decreasing over time). Graphically the model seems to capture the underlying trend of both segments exceptionally well ( R 2 = 0.87, p < 0.001), as the residual error series has almost reached stationarity ( ADF = −3.38, p = 0.06). Both are shown in Figure 7 below.

www.frontiersin.org

Figure 7. A segmented regression model used to assess the effect of the 2008 economic crisis on the time series and its associated residual error series .

Estimating Seasonal Effects

Up until now, we have chosen to remove any seasonal effects by working with the seasonally adjusted time series in order to more fully investigate a trend of substantive interest. This was consistent with the following adage of time series modeling: When a systematic trend or seasonal pattern is present, it must either be modeled or removed. However, psychological researchers may also be interested in the presence and nature of a seasonal effect, and seasonal adjustment would only serve to remove this component of interest. Seasonality was defined earlier as any regular pattern of fluctuation (i.e., movement up or down in the level of the series) associated with some aspect of the calendar. For instance, although online job searchers exhibited an underlying trend in our data across years, they also display the same pattern of movement within each year (i.e., across months; see Figure 1 ). Following the need for more time-based theory and empirical research, seasonal effects are also increasingly recognized as significant for psychological science. In a recent conceptual review Dalal et al. (2014) noted that, “mood cycles… are likely to occur simultaneously over the course of a day (relatively short term) and over the course of a year (long term)” (p. 1401). Relatedly, Larsen and Kasimatis (1990) used time series methods to examine the stability of mood fluctuations across individuals. They uncovered a regular weekly fluctuation that was stronger for introverted individuals than for extraverts (due to the latter's sensation-seeking behavior that resulted in greater mood variability).

Furthermore, many systems of interest exhibit rhythmicity. This can be readily observed across a broad spectrum of phenomena that are of interest to psychological researchers. At the individual level, there is a long history in biopsychology exploring the cyclical pattern of human behavior as a function of biological processes. Prior research has consistently shown that humans possess many common physiological and behavioral cycles that range from 90-min to 365-days ( Aschoff, 1984 ; Almagor and Ehrlich, 1990 ) and may affect important psychological outcomes. For instance, circadian rhythms are particularly well-known and are associated with physical, mental, and behavioral changes within a 24-h period ( McGrath and Rotchford, 1983 ). It has been suggested that peak motivation levels may occur at specific points in the day ( George and Jones, 2000 ), and longer cyclical fluctuations of emotion, sensitivity, intelligence, and physical characteristics over days and weeks have been identified (for a review, see Conroy and Mills, 1970 ; Luce, 1970 ; Almagor and Ehrlich, 1990 ). Such cycles have been found to affect intelligence test performance and other physical and cognitive tasks (e.g., Latman, 1977 ; Kumari and Corr, 1996 ).

Regression with Seasonal Indicators

As previously stated, when seasonal effects are theoretically important, seasonal adjustment is undesirable because it removes the time series component pertinent to the research question at large. An alternative is to qualitatively describe the seasonal pattern or formally specify a regression model that includes a variable which estimates the effect of each season. If a simple linear approximation is used for the trend, the formal model can be expressed as:

where b 0 is now the estimate of the linear relationship between the dependent variable and time, and the coefficients b 1:S are estimates of the S seasonal effects (e.g., S = 12 for yearly data; Cowpertwait and Metcalfe, 2009 , p. 100). Put more intuitively, this model can still be conceived of as a linear model but with a different estimated intercept for each season that represents its effect (Notice that the b 1:S parameters are not coefficients but constants).

As an example, the model above was fit to the original, non-seasonally adjusted Google data. Although modeling the series as a linear function of time was found to produce inaccurate forecasts, it can be used when estimating seasonal effects because this component of the model does not affect the estimates of the seasonal effects. For our data, the estimates of each monthly effect were: b 1 = 67.51, b 2 = 59.43, b 3 = 60.11, b 4 = 60.66, b 5 = 63.59, b 6 = 66.77, b 7 = 63.70, b 8 = 62.38, b 9 = 60.49, b 10 = 56.88, b 11 = 52.13, b 12 = 45.66 (Each effect was statistically significant at p < 0.001). The pattern of these intercepts mirrors the pattern of movement qualitatively described in the discussion on the seasonal component: Online job search behaviors begin at its highest levels in January ( b 1 = 67.51), likely due to the end of holiday employment, and then dropped significantly in February ( b 2 = 59.43). Subsequently, its level continued to rise during the next 4 months until June ( b 6 = 66.77), after which the series decreased each successive month until reaching its lowest point in December ( b 12 = 45.66).

Harmonic Seasonal Models

Another approach to modeling seasonal effects is to fit a harmonic seasonal model that uses sine and cosine functions to describe the pattern of fluctuations seen across periods. Seasonal effects often vary in a smooth, continuous fashion, and instead of estimating a discrete intercept for each season, this approach can provide a more realistic model of seasonal change (see Cowpertwait and Metcalfe, 2009 , pp. 101–108). Formally, the model is:

where m t is the estimate of the trend at t (approximated as a linear or polynomial function of time), s i and c i are the unknown parameters of interest, S is the number of seasons within the time period (e.g., 12 months for a yearly period), i is an index that ranges from 1 to S/2 , and t is a variable that is coded to represent time (e.g., 1:90 for 90 equally-spaced observations). Although this model is complex, it can be conceived as including a predictor for each season that contains a sine and/or cosine term. For yearly data, this means that six s and six c coefficients estimate the seasonal pattern ( S/2 coefficients for each parameter type). Importantly, after this initial model is estimated, the coefficients that are not statistically significant can be dropped, which often results in fewer parameters relative to the seasonal indicator model introduced first ( Cowpertwait and Metcalfe, 2009 , p. 104). For our data, the above model was fit using a linear approximation for the trend, and five of the original twelve seasonal coefficients were statistically significant and thus retained: c 1 = −5.08, p < 0.001, s 2 = 2.85, p = 0.005, s 3 = 2.68, p = 0.009, c 3 = −2.25, p = 0.03, c 5 = −2.97, p = 0.004. This model also explained a substantial amount of the series variance ( R 2 = 0.75, p < 0.001). Pre-made and annotated R code for this analysis can be found in the Supplementary Material.

Time Series Forecasting: ARIMA ( p, d, q ) Modeling

In the preceding section, a number of descriptive and explanatory regression models were introduced that addressed various topics relevant to psychological research. First, we sought to determine how the trend in the series could be best described as a function of time. Three models were fit to the data, and modeling the trend as a cubic function provided the best fit: It was the most parsimonious model that explained a very large amount of variation in the series, it did not systematically over or underestimate many successive observations, and any potential forecasts were clearly superior relative to those of the simpler linear and quadratic models. In the subsequent section, a segmented regression analysis was conducted in order to examine the impact of the 2008 economic crisis on job search behavior. It was found that there was both a significant immediate increase in the baseline level of the series (intercept) and a concomitant increase in its trend (i.e., slope) that gradually decreased over time. Finally, the seasonal effects of online search behavior were estimated and mirrored the pattern of job employment rates described in a prior section.

From these analyses, it can be seen that the main features of many times series are the trend and seasonal components that must either be modeled as deterministic functions of predictors or removed from the series. However, as previously described, another critical feature in time series data is its autocorrelation , and a large portion of time series methodology is aimed at explaining this component ( Dettling, 2013 , p. 2). Primarily, accounting for autocorrelation entails fitting an ARIMA model to the original series, or adding ARIMA terms to a previously fit regression model; ARIMA models are the most general class of models that seek to explain the autocorrelation frequently found in time series data ( Hyndman and Athanasopoulos, 2014 ). Without these terms, a regression model will ignore the pattern of autocorrelation among the residuals and produce less accurate forecasts ( Hyndman and Athanasopoulos, 2014 ). Therefore, ARIMA models are predictive forecasting models . Time series models that include both regression and ARIMA terms are referred to as dynamic models and may be a primary type of time series models used by psychological researchers.

Although not strongly emphasized within psychological science, forecasting is an important aspect of scientific verification ( Popper, 1968 ). Standard cross-sectional and longitudinal models are generally used in an explanatory fashion (e.g., estimating the relationships among constructs and testing null hypotheses), but they are quite capable of prediction as well. Because of the ostensible movement to more time-based empirical research and theory, predicting future values will likely become a more important aspect of statistical modeling, as it can validate psychological theory ( Weiss and Cropanzano, 1996 ) and computational models ( Tobias, 2009 ) that specify effects over time.

At the outset, it is helpful to note that the regression and ARIMA modeling approaches are not substantially different: They both formalize the variation in the time series variable as a function of predictors and some stochastic noise (i.e., the error term). The only practical difference is that while regression models are generally built from prior research or theory, ARIMA models are developed empirically from the data (as will be seen presently; McCleary et al., 1980 , p. 20). In describing ARIMA modeling, the following sections take the form of those discussing regression methods: Conceptual and mathematical treatments are provided in complement in order to provide the reader with a more holistic understanding of these methodologies.

Introduction

The first step in ARIMA modeling is to visually examine a plot of the series' ACF (autocorrelation function) to see if there is any autocorrelation present that can be used to improve the regression model—or else the analyst may end up adding unnecessary terms. The ACF for the Google data is shown in Figure 3 . Again, we will work with the seasonally adjusted series for simplicity. More formally, if a regression model has been fit, the Durbin–Watson test can be used to assess if there is autocorrelation among the residuals and if ARIMA terms can be included to improve its forecasts. The Durbin–Watson test tests the null hypothesis that there is no lag-1 autocorrelation present in the residuals. Thus, a rejection of the null means that ARIMA terms can be included (the Ljung–Box test described below can also be used; Hyndman and Athanasopoulos, 2014 ).

Although the modeling techniques described in the present and following sections can be applied to any one of these models, due to space constraints we continue the tutorial on time series modeling using the cubic model of the first section. A model with only one predictor (viz., time) will allow more focus on the additional model terms that will be added to account for the autocorrelation in the data.

I( d ): integrated

ARIMA is an acronym formed by the three constituent parts of these models. The AR( p ) and MA( q ) components are predictors that explain the autocorrelation. In contrast, the integrated (I[ d ]) portion of ARIMA models does not add predictors to the forecasting equation. Rather, it indicates the order of differencing that has been applied to the time series in order to remove any trend in the data and render it stationary. Before any AR or MA terms can be included, the series must be stationary . Thus, ARIMA models allow non-stationary series to be modeled due to this “integrated” component (an advantage over simpler ARMA models that do not include such terms; Cowpertwait and Metcalfe, 2009 , p. 137). A time series that has been made stationary by taking the d difference of the original series is notated as I( d ). For instance, an I(1) model indicates that the series that has been made stationary by taking its first differences, I(2), by the second differences (i.e., the first differences of the first differences), etc. Thus, the order of integrated terms in an ARIMA model merely specifies how many iterations of differencing were performed in order to make the series stationary so that AR and MA terms may be included.

Identifying the Order of Differencing

Identifying the appropriate order of differencing to stationarize the series is the first and perhaps most important step in selecting an ARIMA model ( Nua, 2014 ). It is also relatively straightforward. As stated previously, the order of differencing rarely needs to be greater than two in order to stationarize the series. Therefore, in practice the choice comes down to whether the series is transformed into either its first or second differences, the optimal choice being the order of differencing that results in the lowest series variance (and does not result in an increase in variance that characterizes overdifferencing).

AR( p ): Autoregressive Terms

The first part of an ARIMA model is the AR( p ) component, which stands for autoregressive . As correlation is to regression, autocorrelation is to autoregression. That is, in regression, variables that are correlated with the criterion can be used for prediction, and the model specifies the criterion as a function of the predictors. Similarly, with a variable that is autocorrelated (i.e., correlated with itself across time periods), past values can serve as predictors, and the values of the time series are modeled as a function of previous values (thus, autoregression ). In other words, an ARIMA ( p, d, q ) model with p AR terms is simply a linear regression of the time series values against the preceding p observations. Thus, an ARIMA(1, d, q ) model includes one predictor, the observation immediately preceding the current value, and an ARIMA(2, d, q ) model includes two predictors, the first and second preceding observations. The number of these autoregressive terms is called the order of the AR component of the ARIMA model. The following equation uses one AR term (an AR[1] model) in which the preceding value in the time series is used as a regressor:

where ϕ is the autoregressive coefficient (interpretable as a regression coefficient), and y t−1 is the immediately preceding observation. More generally, a model with AR( p ) terms is expressed as:

Selecting the Number of Autoregressive Terms

The number of autoregressive terms required depends on how many lagged observations explain a significant amount of unique autocorrelation in the time series. Again, an analogy can be made to multiple linear regression: Each predictor should account for a significant amount of variance after controlling for the others. However, a significant autocorrelation at higher lags may be attributable to an autocorrelation at a lower lag. For instance, if a strong autocorrelation exists at lag-1, then a significant lag-3 autocorrelation (i.e., a correlation of time t with t -3) may be a result of t being correlated with t -1, t -1 with t -2, and t -2 with t -3 (and so forth). That is, a strong autocorrelation at an early lag can “persist” throughout the time series, inducing significant autocorrelations at higher lags. Therefore, instead of inspecting the ACF which displays zero-order autocorrelations, a plot of the partial autocorrelation function (PACF) across different lags is the primary method in determining which prior observations explain a significant amount of unique autocorrelation, and accordingly, how many AR terms (i.e., lagged observations as predictors) should be included. Put simply, the PACF displays the autocorrelation of each lag after controlling for the autocorrelation due to all preceding lags ( McCleary et al., 1980 , p. 75). A conventional rule is that if there is a sharp drop in the PACF after p lags, then the previous p -values are responsible for the autocorrelation in the series, and the model should include p autoregressive terms (the partial autocorrelation coefficient typically being the value of the autoregressive coefficient, ϕ; Cowpertwait and Metcalfe, 2009 , p. 81). Additionally, the ACF of such a series will gradually decay (i.e., reduce) toward zero as the lag increases.

Applying this knowledge to the empirical example, Figure 3 depicted the ACF of the seasonally adjusted Google time series, and Figure 8 displays its PACF. Here, only one lagged partial autocorrelation is statistically significant (lag-6), despite over a dozen autocorrelations in the ACF reaching significance. Thus, it is probable that early lags—and the lag-6 in particular—are responsible for the chain of autocorrelation that persists throughout the series. Although the series is considerably non-stationary (i.e., there is a marked trend and seasonal component), if the series was already stationary, then a model with a single AR term (an AR[1] model) would likely provide the best fit, given a single significant partial autocorrelation at lag-6. The ACF in Figure 3 also displays the characteristics of an AR(1) series: It has many significant autocorrelations that gradually reduce toward zero. This coheres with the notion that one AR term is often sufficient for a residual time series ( Cowpertwait and Metcalfe, 2009 , p. 121). However, if the pattern of autocorrelation is more complex, then additional AR terms may be required. Importantly, if a particular number of AR terms have been successful in explaining the autocorrelation of a stationary series, the residual error series should appear as entirely random white noise (as in Figure 4 ).

www.frontiersin.org

Figure 8. A plot of the partial autocorrelation function (PACF) of the seasonally adjusted time series of Google job searches .

MA( q ): Moving Average Terms

In the preceding section, it was shown that one can account for the autocorrelation in the data by regressing on prior values in the series (AR terms). However, sometimes the autocorrelation is more easily explained by the inclusion of MA terms; the use of MA terms to explain the autocorrelation—either on their own or in combination with AR components—can result in greater parameter parsimony (i.e., fewer parameters), relative to relying solely on AR terms ( Cowpertwait and Metcalfe, 2009 , p. 127). As noted above, ARIMA models assume that any systematic components have either been modeled or removed and that the time series is stationary—i.e., a stochastic process. In time series theory, the values of stochastic processes are determined by two forces: prior values, described in the preceding section, and random shocks (i.e., errors; McCleary et al., 1980 , pp. 18–19). Random shocks are the myriad variables that vary across time and interact with such complexity that their behavior is ostensibly random (e.g., white noise; McCleary et al., 1980 , p. 40). Each shock can be conceived of as an unobserved value at each point in time that influences each observed value of the time series. Thus, autocorrelation in the data may be explained by the persistence of prior values (or outputs , as in AR terms) or, alternatively, the lingering effects of prior unobserved shocks (i.e., the inputs , in MA terms). Therefore, if prior random shocks are related to the value of the series, then these can be included in the prediction equation to explain the autocorrelation and improve the efficiency of the forecasts generated by the model. In other words, just as AR terms can be conceived as a linear regression on previous time series values, MA terms are conceptually a linear regression of the current value of the series against prior random shocks. For instance, an MA(1) model can be expressed as:

where ε t is the value of the random shock at time t, ε t− 1 is the value of the previous random shock, and θ is its coefficient (again, interpretable as a regression coefficient). More generally, the order of MA terms is conventionally denoted as q , and an MA( q ) model can be expressed as:

Selecting the Number of MA Terms

Selecting the number of MA terms in the model is conceptually similar to the process of identifying the number of AR terms: One examines plots of the autocorrelation (ACF) and partial autocorrelation functions (PACF) and then specifies an appropriate model. However, while the number of AR terms could be identified by the PACF of the series (more specifically, the point at which the PACF dropped), the number of appropriate MA terms is usually identified by the ACF. Specifically, if the ACF is non-zero for the first q lags and then drops toward zero, then q MA terms should be included in the model ( McCleary et al., 1980 , p. 79). All successive lags of the ACF are expected to be zero, and the PACF of such a series will be gradually decaying ( McCleary et al., 1980 , p. 79). Thus, relative to AR terms, the roles of the ACF and PACF are essentially reversed when determining the number of MA terms. Furthermore, in practice most social processes can be sufficiently modeled by a single MA term; models of order q = 2 are less common, and higher-order models are extremely rare ( McCleary et al., 1980 , p. 63).

Model Building and Further Notes on ARIMA ( p, d, q ) Models

The components of ARIMA models—autoregressive, integrated, and moving average—are aimed at explaining the autocorrelation in a series that is either stationary or can be made so through differencing (i.e., I[ d ] integrated terms). Though already stated, the importance of the following point warrants reiteration: After a successful ARIMA( p, d, q ) model has been fit to the autocorrelated data, the residual error series should be a white noise series. That is, after a good-fitting model has been specified, the residual error series should not display any significant autocorrelations, have a mean of zero, and some constant variance; i.e., there should be no remaining signal that can be used to improve the model's forecasts. Thus, after specifying a particular model, visual inspection of the ACF and PACF of the error series is critical in order to assess model adequacy ( McCleary et al., 1980 , p. 93). All autocorrelations are expected to be zero with 5% expected to be statistically significant due to sampling error.

Furthermore, just as there are formal methods to test that a series is stationary before fitting an ARIMA model, there are also statistical tests for the presence of autocorrelation after the model has been fit. The Ljung–Box test ( Ljung and Box, 1978 ) is one commonly-applied method in which the null hypothesis is that the errors are uncorrelated across many lags ( Cryer and Chan, 2008 , p. 184; Hyndman and Athanasopoulos, 2014 ). Thus, failing to reject the null provides evidence that the model has succeeded in explaining the remaining autocorrelation in the data.

If both formal and informal methods indicate that the residual error series is not a series of white noise terms (i.e., there is remaining autocorrelation), then the analyst must reassess the pattern of autocorrelation and re-specify a new model. Thus, in contrast to regression approaches, ARIMA modeling is an exploratory, iterative process in which the data is examined, models are specified, checked for adequacy, and then re-specified as needed. However, selecting the most appropriate order of AR, I, and MA terms can prove to be a difficult process ( Hyndman and Athanasopoulos, 2014 ). Fortunately, model comparison can be easily performed by comparing the Akaike information criterion (AIC) across models ( Akaike, 1974 ) 7 . This statistic is based on the fit of a model and its number of parameters, and models with lower values should be selected. Generally, models within two AIC values are considered comparable, a difference of 4–7 points indicates considerable support for the better-fitting model, and a difference of 10 points or greater signifies full support of that model ( Burnham and Anderson, 2004 , p. 271). Additionally, the “forecast” R package ( Hyndman, 2014 ) contains a function to automatically derive the best-fitting ARIMA model based on the AIC or other fit criteria (see Hyndman and Khandakar, 2008 ). This procedure is discussed in the Supplementary Material.

Furthermore, a particular pattern of autocorrelation can often be explained by either AR or MA terms. Generally, AR terms are preferable to MA terms because their interpretation of these parameters is more straightforward (e.g., the regression coefficient associated with a previous time series value rather than a coefficient associated with an unobserved random shock). However, a more central concern is parameter parsimony; if a model using MA terms (or a combination of AR and MA terms) can explain the autocorrelation with fewer parameters than one that relies solely on AR terms, then these models are generally preferable.

Finally, although a mixed ARIMA model containing both AR and MA terms can result in greater parameter parsimony ( Cowpertwait and Metcalfe, 2009 , p. 127), in practice, non-mixed models (i.e., those with either with AR or MA terms alone) should always be ruled out prior to fitting these more complex models ( McCleary et al., 1980 , p. 66). Unnecessary model complexity (i.e., redundant parameters) may not become evident at all during the process of model checking, while the inadequacy of simpler models is often easily identified (e.g., noticeable remaining autocorrelation in the ACF plot).

Fitting a Dynamic Regression Model with ARIMA Terms

In this final section, we illustrate how a predictive ARIMA approach to time series modeling can be combined with regression methods through specification of a dynamic regression model. These models can be fit to the data in order to generate accurate forecasts, as well as explain or examine an underlying trend or seasonal effect (as opposed to their removal). We then analyze the predictions from this model and discuss methods of assessing forecast accuracy. For simplicity, we continue with the regression model that modeled the series as a cubic function of time.

Preliminaries

When the predictor is time, one should begin specification of a dynamic regression model by first examining the residual error series after the regression model has been fit. This is done in order to first detect if there is any autocorrelation in the model residuals that would warrant the inclusion of ARIMA terms. The residual error series are of interest because a dynamic regression model can be thought of as a hybrid model that includes a correction for autocorrelated errors. That is, whatever the regression model does not account for (trend, autocorrelation, etc.) can be supplemented by ARIMA modeling. Analytically, this is performed by re-specifying the initial regression model as an ARIMA model with regressors (sometimes called an “ARIMAX” model, the “X” denoting external predictors) and selecting the appropriate order of ARIMA terms that fit the autocorrelation structure in the residuals ( Nua, 2014 ).

Identifying the Order of Differencing: I( d ) Terms

As noted previously, the residual error series of the cubic regression model exhibited a remaining trend and autocorrelation (see Figure 6 ). A significant Durbin–Watson test formally confirms this is the case (i.e., the error terms are not uncorrelated; DW = 0.47, p < 0.001). Thus, ARIMA terms are necessary to (a) stationarize the series (I[ d ] terms) and (b) generate more accurate forecasts (AR[ p ] and/or MA[ q ] terms). As stated above, the conventional first step when formulating an ARIMA is determining the number of I( d ) terms (i.e., order of differencing) required to remove any remaining trend and render the series stationary. We note that, in this case, the systematic seasonal effects have already been removed through seasonal adjustment. It was previously noted that in practice, removing a trend is accomplished almost always by taking either the first or second differences—whichever transformation results in the lowest variance and avoids overdifferencing (i.e., an increase in the series variance). Because the residual trend does not have a markedly changing slope, it is likely that only one order of differencing will be required. The results indicate that this is indeed the case: After first differencing, the series variance is reduced from 13.56 to 6.45, and an augmented Dickey–Fuller test rejects the null hypothesis of a non-stationary series ( ADF = −4.50, p < 0.01). Taking the second differences also results in stationarity (i.e., the trend is removed), but leads to an overdifferenced series with a variance that is inflated to a level higher than the original error series ( s 2 = 14.90).

Identification of AR( p ) and MA( q ) Terms

After the order of I( d ) terms has been identified (here, 1), the next step is to determine whether the pattern of autocorrelation can be better explained by AR terms, MA terms, or a combination of both. As noted, AR terms are often preferred to MA terms because their interpretation is more straightforward, and simpler models with either AR or MA terms are preferable to mixed models. We therefore begin by examining plots of the ACF and PACF for the residual error series shown Figure 9 in order to see if they display either an AR or MA “signature” (e.g., drop-offs or slow decays).

www.frontiersin.org

Figure 9. ACF and PACF of the cubic model residuals used to determine the number of AR and MA terms in an ARIMA model .

From Figure 9 , we can see that there are many high autocorrelations in the ACF plot that slowly decay, indicative that AR terms are probably most suitable (A sharp drop in the ACF would indicate that the autocorrelation is probably better explained by MA terms). As stated earlier, the PACF gives the autocorrelation for a lag after controlling for all earlier lags; a significant drop in the PACF at a particular lag indicates that this lagged value is largely responsible for the large zero-order autocorrelations in the ACF. Based on this PACF, the number of terms to include is less clear; aside from the lag-0 autocorrelation, there is no perceptible drop-off in the PACF, and there are no strong partial autocorrelations to attribute the persistence of the autocorrelation seen in the ACF. However, we know that there is autocorrelation in the model residuals, and that either one or two AR terms are typically sufficient for accounting for any autocorrelation ( Cowpertwait and Metcalfe, 2009 , p. 121). Therefore, we suspect that a single AR term can account for it. After fitting an ARIMA (1, 1, 0) model, a failure to reject the null hypothesis in a Ljung–Box test indicated that the model residuals were indistinguishable from a random white noise series ( χ 2 = 0.005, p = 0.94), and less than 5% of the autocorrelations in the ACF were statistically significant (The AIC of this model was 419.80). For illustrative purposes, several other models were fit to the data that either included additional AR or MA terms, or a combination of both. Their relative fit was analyzed and the results are shown in Table 2 . As can be seen, the ARIMA (1, 1, 0) model provided a level of fit that exceeded all of the other models (i.e., the smallest AIC difference among models was 4, showing considerable support). Thus, this model parsimoniously accounted for the systematic trend through a combination of regression modeling and first differencing and successfully extracted all the autocorrelation (i.e., signal) from the data in order to achieve more efficient forecasts.

www.frontiersin.org

Table 2. Comparison of different ARIMA models .

Forecasting Methods and Diagnostics

Because forecasts into the future cannot be directly assessed for accuracy until the actual values are observed, it is important that the analyst establish the adequacy of the model prior to forecasting. To do this, the analyst can partition the data into two parts: the estimation period , comprising about 80% of the initial observations and used to estimate the model parameters, and the validation period , usually about 20% of the data and used to ensure that the model predictions are accurate. These percentages may shift depending on the length of the series (see Nua, 2014 ), but the size of the validation period should at least equal the number of periods ahead the analyst wishes to forecast ( Hyndman and Athanasopoulos, 2014 ). The predictions generated by the model are then compared to the observed data in the validation period to assess their accuracy. Evaluating forecast accuracy is accomplished by examining the residuals for any systematic patterns of misspecification. Forecasts should ideally be located within the 95% confidence limits, and formal statistics can be calculated from the model residuals in order to evaluate its adequacy. A popular and intuitive statistic is the mean absolute error (MAE): the average absolute deviation from the predicted values. However, this value cannot be used to compare models, as it is scale-dependent (e.g., a residual with an absolute value of 10 is much less egregious when forecasting from a series whose mean is 10,000 relative to a series with a mean of 10). Another statistic, the mean absolute percentage error (MAPE) is useful for comparing across models and is defined as the average percentage that the forecasts deviated from the observed values. Other methods and statistics, such as the root mean squared error (RMSE) and the mean absolute scaled error (MASE) can aid model evaluation and selection and are accessibly discussed by Hyndman and Athanasopoulos (2014 , chap. 2). Once a forecasting model has been deemed sufficiently accurate through these methods, forecasts into the future can then be calculated with relative confidence.

Because we have the benefit of hindsight in our example, all observations were used for estimation, and six forecasts were generated for the remainder of the 2011 year and compared to the actual observed values. The point forecasts (blue line), 80%, and 95% confidence limits are displayed in Figure 10 juxtaposed against the actual values in red. As can be seen, this forecasting model is generally successful: Each observed value lies within the 80% limits, and the residuals have a low mean absolute error ( MAE = 2.03) relative to the series mean ( M = 75.47), as well as a low mean absolute percentage error ( MAPE = 2.33). Additional statistics verified the accuracy of these predictions, and the full results of the analysis can be obtained from the first author.

www.frontiersin.org

Figure 10. Forecasts from the dynamic regression model compared to the observed values . The blue line represents the forecasts, and the red dotted line indicates the observed values. The darker gray region denotes the 80% confidence region, the lighter gray, 90%.

As a final note on ARIMA modeling, if the sole goal of the analysis is to produce accurate forecasts, then the seasonal and trend components represent a priori barriers to this goal and should be removed through seasonal adjustment and the I( d ) terms of an appropriate ARIMA model, respectively. Such predictive models are often easier to implement, as there are no systematic components of interest to describe or estimate; they are simply removed through transformations in order to achieve a stationary series. Finally, we close this section with two tables. The first, Table 3 , compiles the general steps involved in ARIMA time series modeling described above, from selecting the optimal order of ARIMA terms to assessing forecast accuracy. The second, Table 4 , provides a reference for the various time series terms introduced in the current paper.

www.frontiersin.org

Table 3. Steps for specifying an ARIMA forecasting model .

www.frontiersin.org

Table 4. Glossary of time series terms .

Addendum: Further Time Series Techniques and Resources

Finally, because time series analysis contains a wide range of analytic techniques, there was not room to cover them all here (or in any introductory article for that matter). For a discussion of computing correlations between time series (i.e., the cross-correlation function), the reader is directed to McCleary et al. (1980) . For a general introduction to regression modeling, Cowpertwait and Metcalfe (2009) and Ostrom (1990) have excellent discussions, the latter describing the process of identifying lagged effects. For a highly accessible exposition of identifying and cycles or seasonal effects within the data through periodogram and spectral analysis , the reader should consult Warner (1998) , a social scientist-based text which also describes cross-spectral analysis , a method for assessing how well cycles within two series align. For regression modeling using other time series as substantive predictors, the analyst can use transfer function or dynamic regression modeling and is referred to Pankratz (1991) and Shumway and Stoffer (2006) for further reading. For additional information on forecasting with ARIMA models and other methods, we refer the reader to Hyndman and Athanasopoulos (2014) and McCleary et al. (1980) . Finally, multivariate time series analysis can model reciprocal causal relations among time series in a modeling technique called vector ARMA models, and for discussions we recommend Liu (1986) , Wei (2006) , and the introduction in Pankratz (1991 , chap. 10). Future work should attempt to incorporate these analytic frameworks within psychological research, as the analysis of time series brings in a host of complex issues (e.g., detecting cycles, guarding against spurious regression and correlation) that must be handled appropriately for proper data analysis and the development of psychological theory.

Time series analysis has proved to be integral for many disciplines over many decades. As time series data becomes more accessible to psychologists, these methods will be increasingly central to addressing substantive research questions in psychology as well. Indeed, we believe that such shifts have already started and that at an introduction to time series data is substantially important. By integrating time-series methodologies within psychological research, scholars will be impelled to think about how variables at various psychological levels may exhibit trends, cyclical or seasonal patterns, or a dependence on prior states (i.e., autocorrelation). Furthermore, when examining the influence of salient events or “shocks,” essential questions, such as “What was the pre-event trend?” and “How long did its effects endure, and what was its trajectory?” will become natural extensions. In other words, researchers will think in an increasingly longitudinal manner and will possess the necessary statistical knowledge to answer any resulting research questions—the importance of which was demonstrated above.

The ultimate goal of this introductory paper is to foster such fruitful lines of conceptualizing research. The more proximal goal is to provide an accessible yet comprehensive exposition of a number of time series modeling techniques fit for addressing a wide range of research questions. These models were based in descriptive, explanatory, and predictive frameworks—all three of which are necessary to accommodate the complex, dynamic nature of psychological theory and its data.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary Material

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpsyg.2015.00727/abstract

1. ^ These journals were: Psychological Review, Psychological Bulletin, Journal of Personality and Social Psychology, Journal of Abnormal Psychology, Cognition, American Psychologist, Journal of Applied Psychology, Psychological Science, Perspectives on Psychological Science, Current Directions in Psychological Science, Journal of Experimental Psychology: General, Cognitive Psychology, Trends in Cognitive Sciences, Personnel Psychology, and Frontiers in Psychology .

2. ^ The specific search term was, “jobs – Steve Jobs” which excluded the popular search phrase “Steve Jobs” that would have otherwise unduly influenced the data.

3. ^ Thus, the highest value in the series must be set at 100—i.e., 100% of itself. Furthermore, although measuring a variable in terms of percentages can be misleading when assessing practical significance (e.g., a change from 1 to 4 yields a 400% increase, but may not be a large change in practice), the presumably large raw numbers of searches that include the term “jobs” entail that even a single point increase or decrease in the data is notable.

4. ^ In addition to the two classical models (additive and multiplicative) described above, there are further techniques for time series decomposition that lie beyond the scope of this introduction (e.g., STL or X-12-ARIMA decomposition). These overcome the known shortcomings of classical decomposition (e.g., the first and last several estimates of the trend component are not calculated; Hyndman and Athanasopoulos, 2014 ) which still remains the most commonly used method for time series decomposition. For information regarding these alternative methods the reader is directed to Cowpertwait and Metcalfe (2009 , pp. 19–22) and Hyndman and Athanasopoulos (2014 , chap. 6).

5. ^ Importantly, the current paper discusses dynamic models that specify time as the regressor (either as a linear or polynomial function). For modeling substantive predictors, more sophisticated techniques are necessary, and the reader is directed to Pankratz (1991) for a description of this method.

6. ^ Just like in traditional regression, the parent term t is centered before creating the polynomial term in order to mitigate collinearity.

7. ^ The use of additional fit indices, such as the AIC c (a variant of the AIC for small samples) and Bayesian information criterion (BIC) is also recommended, but we focus on the AIC here for simplicity.

Aguinis, H., Gottfredson, R. K., and Joo, H. (2013). Best-practice recommendations for defining, identifying, and handling outliers. Psychol. Res. Methods 16, 270–301. doi: 10.1177/1094428112470848

CrossRef Full Text | Google Scholar

Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans. Automat. Contr . 19, 716–723. doi: 10.1109/TAC.1974.1100705

Almagor, M., and Ehrlich, S. (1990). Personality correlates and cyclicity in positive and negative affect. Psychol. Rep . 66, 1159–1169.

PubMed Abstract | Google Scholar

Anderson, O. (1976). Time Series Analysis and Forecasting: The Box-Jenkins Approach . London: Butterworths.

Google Scholar

Aschoff, J. (1984). “A survey of biological rhythms,” in Handbook of Behavioral Neurobiology, Vol. 4, Biological Rhythms , ed J. Aschoff (New York, NY: Plenum), 3–10.

Beer, M., and Walton, A. E. (1987). Organization change and development. Annu. Rev. Psychol . 38, 339–367.

Bell, W. R., and Hillmer, S. C. (1984). Issues involved with the seasonal adjustment of time series. J. Bus. Econ. Stat . 2, 291–320. doi: 10.2307/1391266

Bolger, N., DeLongis, A, Kessler, R. C., and Schilling, E. A. (1989). Effects of daily stress on negative mood. J. Pers. Soc. Psychol . 57, 808–818.

Burnham, K., and Anderson, D. (2004). Multimodel inference: understanding AIC and BIC in model selection. Sociol. Methods Res . 33, 261–304. doi: 10.1177/0049124104268644

Busk, P. L., and Marascuilo, L. A. (1988). Autocorrelation in single-subject research: a counterargument to the myth of no autocorrelation. Behav. Assess . 10, 229–242.

Carayon, P. (1995). “Chronic effect of job control, supervisor social support, and work pressure on office worker stress,” in Organizational Risk Factors for Job Stress , eds S. L. Sauter and L. R. Murphy (Washington, DC: American Psychological Association), 357–370.

Chatfield, C. (2004). The Analysis of Time Series: An Introduction, 6th Edn . New York, NY: Chapman and Hall/CRC.

Conroy, R. T., and Mills, W. L. (1970). Human Circadian Rhythms . Baltimore, MD: The Williams and Wilkins Company.

Cook, T. D., and Campbell, D. T. (1979). Quasi-Experimentation: Design and Analysis Issues for Field Settings . Boston, MA: Houghton Mifflin.

Cowpertwait, P. S., and Metcalfe, A. (2009). Introductory Time Series with R . New York, NY: Springer-Verlag.

Cryer, J. D., and Chan, K.-S. (2008). Time Series Analysis: With Applications in R, 2nd Edn . New York, NY: Springer.

Dalal, R. S., Bhave, D. P., and Fiset, J. (2014). Within-person variability in job performance: A theoretical review and research agenda. J. Manag . 40, 1396–1436. doi: 10.1177/0149206314532691

Dettling, M. (2013). Applied Time Series Analysis [PDF Document] . Available online at: http://stat.ethz.ch/education/semesters/ss2012/atsa/ATSA-Scriptum-SS2012-120521.pdf

Fairbairn, C. E., and Sayette, M. A. (2013). The effect of alcohol on emotional inertia: a test of alcohol myopia. J. Abnorm. Psychol . 122, 770–781. doi: 10.1037/a0032980

PubMed Abstract | CrossRef Full Text | Google Scholar

Friston, K. J., Holmes, A. P., Poline, J. B., Grasby, P. J., Williams, S. C. R., Frackowiak, R. S. J., et al. (1995). Analysis of fMRI time series revisited. Neuroimage 2, 45–53.

Friston, K. J., Josephs, O., Zarahn, E., Holmes, A. P., Rouquette, S., and Poline, J-B. (2000). To smooth or not to smooth? Bias and efficiency in fMRI time series analysis. Neuroimage 12, 196–208. doi: 10.1006/nimg.2000.0609

Fuller, J. A., Stanton, J. M., Fisher, G. G., Spitzmuller, C., and Russell, S. S. (2003). A lengthy look at the daily grind: time series analysis of events, mood, stress, and satisfaction. J. Appl. Psychol . 88, 1019–1033. doi: 10.1037/0021-9010.88.6.1019

George, J. M., and Jones, G. R. (2000). The role of time in theory and theory building. J. Manage . 26, 657–684. doi: 10.1177/014920630002600404

Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., and Brilliant, L. (2009). Detecting influenza epidemics using search engine query data. Nature 457, 1012–1014. doi: 10.1038/nature07634

Glass, G. V., Willson, V. L., and Gottman, J. M. (1975). Design and Analysis of Time Series Experiments . Boulder, CO: Colorado Associated University Press.

Hartmann, D. P., Gottman, J. M., Jones, R. R., Gardner, W., Kazdin, A. E., and Vaught, R. S. (1980). Interrupted time-series analysis and its application to behavioral data. J. Appl. Behav. Anal . 13, 543–559.

Hays, W. L. (1981). Statistics, 2nd Edn . New York, NY: Holt, Rinehart, and Winston.

Hyndman, R. J. (2014). Forecast: Forecasting Functions for Time Series and Linear Models . R Package Version 5.4. Available online at: http://CRAN.R-project.org/package=forecast

Hyndman, R. J., and Athanasopoulos, G. (2014). Forecasting: Principles and Practice . OTexts. Available online at: http://otexts.org/fpp/

Hyndman, R. J., and Khandakar, Y. (2008). Automatic time series forecasting: the forecast package for R. J. Stat. Softw . 26, 1–22.

Jones, R. R., Vaught, R. S., and Weinrott, M. (1977). Time series analysis in operant research. J. Appl. Behav. Anal . 10, 151–166.

Kanner, A. D., Coyne, J. C., Schaefer, C., and Lazarus, R. S. (1981). Comparisons of two modes of stress measurement: daily hassles and uplifts versus major life events. J. Behav. Med . 4, 1–39.

Kelly, J. R., and McGrath, J. E. (1988). On Time and Method . Newbury Park, CA: Sage.

Kerlinger, F. N. (1973). Foundations of Behavioral Research, 2nd Edn . New York, NY: Holt, Rinehart.

Killingsworth, M. A., and Gibert, D. T. (2010). A wandering mind is an unhappy mind. Science 330, 932–933. doi: 10.1126/science.1192439

Kuljanin, G., Braun, M. T., and DeShon, R. P. (2011). A cautionary note on applying growth models to longitudinal data. Psychol. Methods 16, 249–264. doi: 10.1037/a0023348

PubMed Abstract | CrossRef Full Text

Kumari, V., and Corr, P. J. (1996). Menstrual cycle, arousal-induction, and intelligence test performance. Psychol. Rep . 78, 51–58.

Larsen, R. J., and Kasimatis, M. (1990). Individual differences in entrainment of mood to the weekly calendar. J. Pers. Soc. Psychol . 58, 164–171.

Larson, R., and Csikszentmihalyi, M. (1983). The experience sampling method. New Dir. Methodol. Soc. Behav. Sci . 15, 41–56.

Latman, N. (1977). Human sensitivity, intelligence and physical cycles and motor vehicle accidents. Accid. Anal. Prev . 9, 109–112.

Liu, L.-M. (1986). Multivariate Time Series Analysis using Vector ARMA Models . Lisle, IL: Scientific Computing Associates.

Ljung, G. M., and Box, G. E. P. (1978). On a measure of lack of fit in time series models. Biometrika 65, 297–303.

Luce, G. G. (1970). Biological Rhythms in Psychiatry and Medicine . Public Health Service Publication (Public Health Service Publication No. 2088). U.S. Institute of Mental Health.

McCleary, R., Hay, R. A., Meidinger, E. E., and McDowall, D. (1980). Applied Time Series Analysis for the Social Sciences . Beverly Hills, CA: Sage.

McGrath, J. E., and Rotchford, N. L. (1983). Time and behavior in organizations. Res. Psychol. Behav . 5, 57–101.

Meko, D. M. (2013). Applied Time Series Analysis [PDF Documents]. Available online at: http://www.ltrr.arizona.edu/~dmeko/geos585a.html#chandout

Mills, T. C., and Markellos, R. N. (2008). The Econometric Modeling of Financial Time Series, 3rd Edn . Cambridge: Cambridge University Press.

Mitchell, T. R., and James, L. R. (2001). Building better theory: time and the specification of when things happen. Acad. Manage. Rev . 26, 530–547. doi: 10.5465/AMR.2001.5393889

Muenchen, R. A. (2013). The Popularity of Data Analysis Software . Available online at: http://r4stats.com/articles/popularity/

Nua, R. (2014). Statistical Forecasting , [Online Lecture Notes]. Available online at: http://people.duke.edu/~rnau/411home.htm

Ostrom, C. W. (1990). Time Series Analysis: Regression Techniques . Newbury Park, CA: Sage.

Pankratz, A. (1991). Forecasting with Dynamic Regression Models . New York, NY: Wiley.

Pearce, J. L., Stevenson, W. B., and Perry, J. L. (1985). Managerial compensation based on psychological performance: a time series analysis of the effects of merit pay. Acad. Manage. J . 28, 261–278.

Persons, W. M. (1919). Indices of business conditions. Rev. Econ. Stat . 1, 5–107.

Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D., and R Core Team. (2014). NLME: Linear and Nonlinear Mixed Effects Models . R Package Version 3.1-117. Available online at: http://CRAN.R-project.org/package=nlme

Polgreen, P. M., Chen, Y., Pennock, D. M., and Forrest, N. D. (2008). Using internet searches for influenza surveillance. Clin. Infect. Dis . 47, 1443–1448.

Popper, K. R. (1968). The Logic of Scientific Discovery . New York, NY: Harper and Row.

R Development Core Team. (2011). R: A Language and Environment for Statistical Computing . Vienna: R Foundation for Statistical Computing. Available online at: http://www.R-project.org/

Rothman, P. (eds.). (1999). Nonlinear Time Series Analysis of Economic and Financial Data . Dordrecht: Kluwer Academic Publishers.

Said, S. E., and Dickey, D. A. (1984). Testing for unit roots in autoregressive-moving average models of unknown order. Biometrika 71, 599–607.

Sharpe, D. (2013). Why the resistance to statistical innovations? Bridging the communication gap. Psychol. Methods 18, 572–582. doi: 10.1037/a0034177

Shmueli, G. (2010). To explain or to predict? Stat. Sci . 25, 289–310. doi: 10.1214/10-STS330

Shumway, R. H., and Stoffer, D. S. (2006). Time Series Analysis and Its Applications with R Examples, 2nd Edn . New York, NY: Springer.

Stanton, J. M., and Rogelberg, S. G. (2001). Using Internet/Intranet web pages to collect psychological research data. Psychol. Res. Methods 4, 200–217. doi: 10.1177/109442810143002

Tobias, R. (2009). Changing behavior by memory aids: a social–psychological model of prospective memory and habit development tested with dynamic field data. Psychol. Rev . 116, 408-438. doi: 10.1037/a0015512

Trapletti, A., and Hornik, K. (2013). Tseries: Time Series Analysis and Computational Finance . R Package Version 0.10-32.

United States Department of Labor, Bureau of Labor Statistics. (2014). Labor Force Statistics from the Current Population Survey [Data Set]. Available online at: http://data.bls.gov/timeseries/LNU04000000

Wagner, A. K., Soumerai, S. B., Zhang, F., and Ross-Degnan, D. (2002). Segmented regression analysis of interrupted time series studies in medication use research. J. Clin. Pharm. Ther . 27, 299–309. doi: 10.1046/j.1365-2710.2002.00430.x

Wagner, J. A., Rubin, P. A., and Callahan, T. J. (1988). Incentive payment and nonmanagerial productivity: an interrupted time series analysis of magnitude and trend. Organ. Behav. Hum. Decis. Process . 42, 47–74.

Warner, R. M. (1998). Spectral Analysis of Time-Series Data . New York, NY: Guilford Press.

Wei, W. S. (2006). Time Series Analysis: Univariate and Multivariate Methods, 2nd Edn . London: Pearson.

Weiss, H. M., and Cropanzano, R. (1996). “Affective events theory: a theoretical discussion of the structure, causes and consequences of affective experiences at work,” in Research in Psychological Behavior: An Annual Series of Analytical Essays and Critical Reviews , eds B. M. Staw and L. L. Cummings (Greenwich, CT: JAI Press), 1–74.

West, S. G., and Hepworth, J. T. (1991). Statistical issues in the study of temporal data: daily experience. J. Pers . 59, 609–662.

Wood, P., and Brown, D. (1994). The study of intraindividual differences by means of dynamic factor models: rationale, implementation, and interpretation. Psychol. Bull . 116, 166–186.

Zaheer, S., Albert, S., and Zaheer, A. (1999). Time-scale and psychological theory. Acad. Manage. Rev . 24, 725–741.

Zeileis, A., and Hothorn, T. (2002). Diagnostic checking in regression relationships. R News 3, 7–10.

Keywords: time series analysis, longitudinal data analysis, forecasting, regression analysis, ARIMA

Citation: Jebb AT, Tay L, Wang W and Huang Q (2015) Time series analysis for psychological research: examining and forecasting change. Front. Psychol . 6 :727. doi: 10.3389/fpsyg.2015.00727

Received: 19 March 2015; Accepted: 15 May 2015; Published: 09 June 2015.

Reviewed by:

Copyright © 2015 Jebb, Tay, Wang and Huang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Andrew T. Jebb, Department of Psychological Sciences, Purdue University, 703 Third Street, West Lafayette, IN 47907, USA, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Time series analysis for psychological research: examining and forecasting change

Affiliations.

  • 1 Department of Psychological Sciences, Purdue University West Lafayette, IN, USA.
  • 2 Department of Psychology, University of Central Florida Orlando, FL, USA.
  • 3 Department of Statistics, Purdue University West Lafayette, IN, USA.
  • PMID: 26106341
  • PMCID: PMC4460302
  • DOI: 10.3389/fpsyg.2015.00727

Psychological research has increasingly recognized the importance of integrating temporal dynamics into its theories, and innovations in longitudinal designs and analyses have allowed such theories to be formalized and tested. However, psychological researchers may be relatively unequipped to analyze such data, given its many characteristics and the general complexities involved in longitudinal modeling. The current paper introduces time series analysis to psychological research, an analytic domain that has been essential for understanding and predicting the behavior of variables across many diverse fields. First, the characteristics of time series data are discussed. Second, different time series modeling techniques are surveyed that can address various topics of interest to psychological researchers, including describing the pattern of change in a variable, modeling seasonal effects, assessing the immediate and long-term impact of a salient event, and forecasting future values. To illustrate these methods, an illustrative example based on online job search behavior is used throughout the paper, and a software tutorial in R for these analyses is provided in the Supplementary Materials.

Keywords: ARIMA; forecasting; longitudinal data analysis; regression analysis; time series analysis.

Forecast Methods for Time Series Data: A Survey

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, time series forecasting.

396 papers with code • 66 benchmarks • 28 datasets

Time Series Forecasting is the task of fitting a model to historical, time-stamped data in order to predict future values. Traditional approaches include moving average, exponential smoothing, and ARIMA, though models as various as RNNs, Transformers, or XGBoost can also be applied. The most popular benchmark is the ETTh1 dataset. Models are typically evaluated using the Mean Square Error (MSE) or Root Mean Square Error (RMSE).

( Image credit: ThaiBinh Nguyen )

research papers on time series analysis

Benchmarks Add a Result

research papers on time series analysis

Most implemented papers

Sequence to sequence learning with neural networks.

research papers on time series analysis

Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.

Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting

Multi-horizon forecasting problems often contain a complex mix of inputs -- including static (i. e. time-invariant) covariates, known future inputs, and other exogenous time series that are only observed historically -- without any prior information on how they interact with the target.

Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks

laiguokun/multivariate-time-series-data • 21 Mar 2017

Multivariate time series forecasting is an important machine learning problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation.

DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks

Probabilistic forecasting, i. e. estimating the probability distribution of a time series' future given its past, is a key enabler for optimizing business processes.

N-BEATS: Neural basis expansion analysis for interpretable time series forecasting

We focus on solving the univariate times series point forecasting problem using deep learning.

Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting

Spatiotemporal forecasting has various applications in neuroscience, climate and transportation domain.

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

Many real-world applications require the prediction of long sequence time-series, such as electricity consumption planning.

Deep and Confident Prediction for Time Series at Uber

Reliable uncertainty estimation for time series prediction is critical in many fields, including physics, biology, and manufacturing.

GluonTS: Probabilistic Time Series Models in Python

research papers on time series analysis

We introduce Gluon Time Series (GluonTS, available at https://gluon-ts. mxnet. io), a library for deep-learning-based time series modeling.

Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting

Timely accurate traffic forecast is crucial for urban traffic control and guidance.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • World J Clin Cases
  • v.11(29); 2023 Oct 16
  • PMC10631421

Applications of time series analysis in epidemiology: Literature review and our experience during COVID-19 pandemic

Latchezar tomov.

Department of Informatics, New Bulgarian University, Sofia 1618, Bulgaria. moc.liamg@trasehcul

Lyubomir Chervenkov

Department of Diagnostic Imaging, Medical University Plovdiv, Plovdiv 4000, Bulgaria

Dimitrina Georgieva Miteva

Department of Genetics, Faculty of Biology, Sofia University "St. Kliment Ohridski", Sofia 1164, Bulgaria

Hristiana Batselova

Department of Epidemiology and Disaster Medicine, Medical University, University Hospital "St George", Plovdiv 4000, Bulgaria

Tsvetelina Velikova

Department of Medical Faculty, Sofia University, St. Kliment Ohridski, Sofia 1407, Bulgaria

Supported by European Union-NextGenerationEU, Through the National Recovery and Resilience Plan of the Republic of Bulgaria, No. BG-RRP-2.004-0008-C01.

Corresponding author: Latchezar Tomov, PhD, Academic Research, Department of Informatics, New Bulgarian University, Montevideo 21 Str, Sofia 1618, Bulgaria. moc.liamg@trasehcul

Time series analysis is a valuable tool in epidemiology that complements the classical epidemiological models in two different ways: Prediction and forecast. Prediction is related to explaining past and current data based on various internal and external influences that may or may not have a causative role. Forecasting is an exploration of the possible future values based on the predictive ability of the model and hypothesized future values of the external and/or internal influences. The time series analysis approach has the advantage of being easier to use (in the cases of more straightforward and linear models such as Auto-Regressive Integrated Moving Average). Still, it is limited in forecasting time, unlike the classical models such as Susceptible-Exposed-Infectious-Removed. Its applicability in forecasting comes from its better accuracy for short-term prediction. In its basic form, it does not assume much theoretical knowledge of the mechanisms of spreading and mutating pathogens or the reaction of people and regulatory structures (governments, companies, etc. ). Instead, it estimates from the data directly. Its predictive ability allows testing hypotheses for different factors that positively or negatively contribute to the pandemic spread; be it school closures, emerging variants, etc. It can be used in mortality or hospital risk estimation from new cases, seroprevalence studies, assessing properties of emerging variants, and estimating excess mortality and its relationship with a pandemic.

Core tip: Time-series analysis allows us to do easily and, in less time, precise short-term forecasting in novel pandemics by estimating directly from data. These models do not need extensive knowledge of pandemic mechanisms and interactions between peoples, societal structures, and pathogens. Its secondary but equally important role is distinguishing factors contributing to the spread or slowing it down. Of course, the time series analysis approach cannot give a forecast for an end of a pandemic, nor the precise moment of its peak, but it is invaluable for fast response based on sound statistical methodology.

INTRODUCTION

Time series analysis studies are consecutive collections of observations in time to predict or forecast behavior[ 1 ]. Prediction is related to studying the possible relationships between variables or factors that influence each other or are correlated. Forecasting predicts future values based on past values of the variable and possible future values of other variables or factors, towards which we regress[ 1 ]. Suppose we do not know the future values of the variables that influence our variable. In that case, we investigate different scenarios or the structure “IF–THEN”. If an external factor takes value A, we forecast the variable value B.

Since correlation cannot be estimated directly for nonstationary processes via the standard regression techniques, the so-called spurious correlations[ 2 ], the existing number of time series models such as Box-Jenkins оr Auto-Regressive Integrated Moving Average (ARIMA) models that can deal with nonstationarity. Why do we use them in epidemiology? First, these are linear models that are simpler and easier to use than the classical nonlinear epidemiological models such as Susceptible-Infectious-Removed (SIR), Susceptible-Exposed-Infectious-Removed (SEIR), etc. , for which no closed-form exact analytical solutions exist and need numerical simulations or special techniques for approximation for the long-term behavior of the model[ 3 ].

However, the difficulty in predicting cases and deaths during the coronavirus disease 2019 (COVID-19) pandemic was also emphasized by Roda et al [ 4 ]. They raised concerns about utilizing the affirmed case information as nonidentifiability in model alignments[ 4 ].

Talking about the classical nonlinear epidemiological models for epidemics, SIR is able to propose a simple model based on the two-reaction mechanism. In this way, the conditions for epidemic development, the course of a simple closed epidemic, as well as the mitigation strategies could be explained[ 5 ]. SIR has been used successfully to estimate the number of cases and deaths in outbreaks such as influenza H1N1 (2009-2010) and Ebola (2014-2016) viruses, examining the early growth, including with modifications, such as SIR with reactive behavioral changes and SIR with inhomogeneous mixing.

The SEIR model was also used mostly for influenza epidemics. Zhan et al [ 6 ] additionally used the COVID-19 historical data of 367 Chinese cities to create the transmission mechanisms and contact topology utilizing a set of profile codes. Then the method was applied to South Korea, Italy, and Iran, to predict the infection peaks before the end of March 2020[ 7 ].

However, when comparing the SIR and the ARIMA models, it was shown that ARIMA outperformed the SIR in predicting the cases of COVID-19[ 8 ].

Abolmaali and Shirzaei[ 7 ] were among the first to compare multiple epidemiological methods that can be used during the COVID-19 pandemic to monitor and even prevent the spread of infection. The authors demonstrated that predictive models, such as SIR, SEIR, ARIMA, etc. , have proven useful and effective in predicting the incidence of infections. However, it was shown that SIR could not provide helpful early prediction in cases of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection, giving a significant error in estimating for countries such as Brazil, United States, India and Russia. Moreover, SIR was useful mainly for predictions in the short term. On the contrary, SEIR showed better results than SIR in the long term in forecasting COVID-19[ 9 ].

Regarding ARIMA, it was shown that this model requires further information for a more precise point-by-point forecast, but the spread of COVID-19 was forecasted ultimately. Additionally, ARIMA itself helps the data to remain stationary, resulting in modeling flexibility while capturing any changes at every stage. Last but not least, the ARIMA model managed to maintain minimum error in the forecasting. Based on all these outcomes, ARIMA outperforms all other compared models[ 7 ].

Many other research groups confirmed the effectiveness and usefulness of the ARIMA model for COVID-19 pandemic forecasting[ 10 - 12 ].

Additionally, the ARIMA model has three parameters: p, d and q, which define the order of its three components-autoregression, differencing and moving average. The autoregressive part captures the dependence on past values or the process dynamics-the variable is regressed on its previous or lagged values. The moving average part indicates that regression errors are not independent but are a linear combination of error terms from different earlier moments[ 13 ]. The differencing part is used to transform the process into a broad sense stationary process vs consecutive differencing of its values. The combinations of these three parts with proper orders depending on the process allow us to make linear regression (LR) that accounts for the dependence of error terms for nonstationarity and assess dependence with previous moments of time. The ARIMA model can be combined with external regressors or factors. Thus, it can be applied as regression with ARIMA errors for the process, made stationary by differencing its values. The use of LR for prediction is, therefore, straightforward.

One example is the model in our recent research[ 13 ]. We predicted some weekly deaths from COVID-19 in Bulgaria from new cases by age groups with different lags. We used the viral variants as external regressors.

Another example is the analysis of the incidence and death of AIDS and HIV in China[ 14 ]. Time series models such as ARIMA need historical data to be able to extrapolate future trends. They are beneficial for short-term forecast[ 15 ] and for prediction, even for complex dynamics, such as for COVID-19, with different policies to control the spread and different viral variants in various stages of the pandemic[ 16 ]. In addition, there is a practical advantage of time series analysis over theoretical epidemiological models, and no detailed knowledge of the causal mechanism is required; instead, the structure is inferred from real-life data. This allows fast development and deployment of models during an ongoing pandemic with limited knowledge of key parameters related to the influence of measures or the characteristics of newly emerging variants.

Based on the characteristics and effectiveness of ARIMA, we created the following objectives for our study: (1) To apply time series analysis for assessment of SARS-CoV-2 variants spread; (2) to assess the excess mortality from COVID-19 in Bulgaria; (3) to demonstrate the importance of time series analysis during the COVID-19 pandemic in radiology departments; and (4) to determine the SARS-CoV-2 seroprevalence with time series analysis.

TIME SERIES ANALYSIS FOR ASSESSMENT OF SPREAD OF SARS-CoV-2 VARIANTS

Time series regression studies have been extensively used in environmental epidemiology, especially to assess short-term associations between exposures[ 15 ]. The most commonly studied factors are air pollution, weather, and pollen, but health outcomes like mortality or disease-specific hospital admissions could also be investigated. Typically, data are available at regular time intervals for both exposure and result ( e.g. , daily pollution levels and daily death counts). The goal is to investigate short-term relationships between them.

In line with this, since COVID-19 has expanded globally, this results in a continuous pandemic, imposing limitations and expenditures on many governments. Therefore, anticipating the number of new cases and fatalities throughout this period can be crucial in predicting future expenses and facilities[ 17 ]. Furthermore, it is necessary to analyze the data during the pandemic to expect one or other intervention strategies, mitigations, etc. In such a way, the pandemic could be monitored, controlled, and properly managed. Many studies demonstrated that among the epidemiological models, the ARIMA model showed the desired precision in predicting the number of cases and fatalities with minimum error[ 10 , 18 ].

Although some researchers hesitantly avoided ARIMA for analyzing COVID-19 epidemiological data, Alabdulrazzaq et al [ 19 ] demonstrated that the ARIMA technique showed accurate and valid forecasting; especially the ARIMA best-fit model for predicting the confirmed and recovered cases of COVID-19. Despite the many dynamic aspects based on the novelty of the virus and the nature of the disease, the actual values for most of the periods were within the model prediction of a 95% confidence interval. Pearson’s correlation showed high correlations between the forecast points and the actual recorded data ( r = 0.996)[ 19 ]. This confirms why ARIMA is one of the best-suited models with satisfactory results and minimum error.

Different methods in time series analyses can be used, such as the hybrid machine learning approach (using multiple simple algorithms to complement and facilitate each other) to anticipate the number of infected people and mortality rate[ 20 ]; LR (based on using regression models enabling subject-matter interpretation of the data); Least Absolute Shrinkage and Selection Operator (a model that uses over regression methods for more accurate predictions); support vector machine (using optimal hyperplane in an N-dimensional space, separating the data points in different classes); exponential smoothing (forecasting univariate time series data) to determine the affected by the virus people and the deceased cases[ 21 ]; numerical modeling to assess the effect of the population age on the mortality rate[ 22 ]; numerical modeling methods such as polynomial regression (fitting of a nonlinear relationship between the value of something and the condition mean of other); Bayesian Edge (estimating probability influenced by the belief of the likelihood of a certain outcome) and long short-term memory (having the ability to learn long term sequences of observations) to estimate the prevalence of SARS-CoV-2 infection and to predict the scale of the pandemic along with the mortality rate[ 23 ]; a deep learning system for the prediction of the COVID-19 time series[ 24 ]; mathematical model about the spread of COVID-19[ 25 ]; a stochastic model considering comorbidities and age[ 26 ]; an SIQR model made stochastic, considering the uncertainty of infection progress[ 27 ]; a fractional-order dynamical system[ 28 ]; fractional calculus and natural decomposition[ 29 ]; Caputo-Fabrizio fractional derivative[ 30 , 31 ]; and a nonpharmaceutical intervention approach to reduce the outbreak of COVID-19[ 32 ] .

EXCESS MORTALITY FROM COVID-19 IN BULGARIA: OUR EXPERIENCE

We describe an example of application of time series analysis in epidemiology: Analysis of overall mortality and its dependence on the number of registered cases of COVID-19. In other words, we used ARIMA and tried to establish which part of the COVID-19 deaths were unregistered and which part of the overall mortality was influenced by the pandemic. In our modeling, we did not try to define and determine which part of the mortality was excess-this needs serious theoretical modeling that includes seasonal climate variations, dependence on population growth models, identification of other causes of mortality and their distribution to determine expected mortality, etc. With time series analysis, we could identify the excess part directly from data, by testing how much of the mortality variance for 2020-2021 could be explained by new COVID-19 cases. We used weekly data for deaths from the National Statistical Institute[ 33 ] and from John Hopkins University for total cases[ 34 ].

Material and methods

We added to the model categorical variable to account for the changes of dominant variants in Bulgaria during the first 2 years of the pandemic wild type, alpha and delta. The coevolutionary arms race between the acquired immunity for the survival of infections and reinfections and the mutating virus did not allow capturing of a clean and powerful connection between mortality and the disease severity caused by the virus variants because mortality was the product of interaction between the two coevolving agents.

Since the immune system adapts to decrease mortality, the mortality from previous waves also suppressed further mortality via natural selection. Thus, any positive correlation between variants and mortality in time-series analysis models carries more information than it appears purely from the coefficients and their standard estimation errors. Even the slightest edge of the variants over the immune system and the process of natural selection should be treated as significant here. This was the reason to include not only weekly COVID-19 cases with lags of 0-7 and 7-14 d (L1 and L2) but also variants as factor variables with the same lags in our optimal model.

The model, described in Tables ​ Tables1 1 and ​ and2 2 and Figures ​ Figures1 1 and ​ and2, 2 , resulted from selecting different lags of the two chosen variables that produced the best fit (Figure ​ (Figure2). 2 ). Also, the model was developed with the language R. It contained differencing of order I, as suggested by the ndiffs function via different unit root tests to achieve stationarity and meaningful correlations[ 35 ].

An external file that holds a picture, illustration, etc.
Object name is WJCC-11-6974-g001.jpg

Residuals of our model of mortality. ACF: Autocorrelation function.

An external file that holds a picture, illustration, etc.
Object name is WJCC-11-6974-g002.jpg

Model prediction of mortality based on new coronavirus disease 2019 weekly cases. COVID-19: Coronavirus disease 2019.

Correlations between the variables for Model I-factors contributing to the spread among children

Regression models with Auto-Regressive Integrated Moving Average (0, 1, 1) errors-Model I-overall mortality and coronavirus disease 2019

RMSE: Root mean square error; MAPE: Mean absolute percentage error.

Factor variables such as variants L1 and L2 were not different. There was a significant influence on cases with lags L1 and L2 with a coefficient ratio to standard error over 3:1 (2:1 is required as a rule) and a sufficiently small correlation coefficient between their first differences of 0.69. Variants L1 and L2 had positive values but high standard errors of estimation.

Nonetheless, we considered their appearance significant after three different variants and the enormous increase of mortality over the 2 years of nearly 25% (on average) over the mean for 2015–2019 and with a high number of officially registered cases which was 11.5% of the overall population, with polymerase chain reaction (PCR) positivity on the average 12.14% (maximum 33.75%) indicating many more unregistered cases. Even this slight edge that we detected here indicated increasing severity with variants up to and including delta. Our model explains > 95% of the variation in deaths even though there was considerable variation in the mean age of new cases during these 2 years, and the exponentially increasing hospitalization (and therefore, mortality) risk with age[ 36 ]. This is evidence that most of the mortality increase in 2020 and 2021 was due to COVID-19. The model had shallow bias and mean absolute percentage error.

Conclusions from this model. Time-series analysis can serve as a first step in studying causal connections between an epidemic and excess mortality. Models such as ARIMA can show whether two or more nonstationary processes are moving together so that we can predict one behavior from another. Our models allowed us to catch when several different processes contributed with additional time lags to a resulting process, helping us uncover the link between two or more processes invisible to the naked eye. In this case, new weekly cases from the previous 2 wk, together with the changes in viral variants (factor variables), could explain 95% of the excess mortality. This is relevant as an answer to the often-appearing question of which part of the mortality during an epidemic is hidden from the official figures.

Moreover, during a pandemic with a high hospital burden, other patients have delayed treatment and are collateral victims of the pandemic. Time-series analysis can help quantify the excess mortality caused by a pandemic. It can help answer the question: Is the excess mortality due to closures or other mitigation measures, or due to the pandemic itself (although we did not try to answer this question in our study)? Closures could be added as a factor variable via the Oxford Stringency Index[ 37 ], and in a similar fashion, viral variants were added by us. Our research tried to check if variant evolution contributed to increased deaths. Still, it cannot be conclusively shown-the standard error of the relevant parameters, variants L1 and variants L2, is not small enough for that purpose (it should be at most half of the absolute value of the coefficient). This is possibly due to the only delta variant being significantly deadlier for that period[ 38 ].

However, it is essential to mention that virus variants change over time along with the clinical and epidemiological picture, as it was discussed recently by Miteva et al [ 39 ].

RADIOLOGY DEPARTMENTS DURING SARS-CoV-2 PANDEMIC AND THE IMPORTANCE OF TIMES SERIES ANALYSIS

Medical imaging is crucial for initial diagnosis, staging, and follow-up. Therefore, the organization in departments of radiology is a significant topic because there has to be a separation of COVID-19 and possible COVID-19 patients from other patients in the hospital. Usually, it is done by arranging so-called COVID corridors in the hospital, when only COVID-19 patients are being scanned. This causes an interruption of regular hospital activity.

Usually, there is a delay in diagnosing patients with other diseases, which is a problem, especially with emergency patients. In the COVID corridor, the personnel in the radiology department are fully equipped to diagnose patients. However, the equipment was not always available, especially at the beginning of the pandemic. Moreover, it was costly which limited its use. Also, deep cleaning of the department is done after the end of the corridor, which causes even more delays in the other patients’ diagnoses and more expenses.

Forecasting the COVID-19 waves is crucial because the management of the radiology departments can be done according to it. A durable prediction model allows the departmental heads to organize the necessary COVID-19 corridors according to the expected wave. The duration and the exact hours of the corridors can be correctly adjusted, thus providing the required diagnosis of COVID-19 and non-COVID-19 patients with minimization of the delay of diagnosis for each group.

Also, the personnel shifts can be arranged according to the predicted model, providing enough X-ray technicians and radiologists. Furthermore, the necessary equipment will be provided in advance, reducing the needed time for changing clothes, and lowering expenses. The heads of the departments in which patients require diagnostic imaging can organize their work according to the COVID-19 corridor active hours, thus providing a safe and calm environment. Ambulatory patients can also arrange their examination according to the available hours.

In 2020, a genetic programming prediction model was introduced and further developed into a gene expression programming model. This model predicts the cases according to two parameters-confirmed cases and number of deaths[ 40 ]. Another prediction model used in India is ARIMA, which has value in predicting cases and shows the effect of unlocking after lockdown. The ARIMA model relies on the number of positive cases, the number of performed tests per day and the average positive percentage[ 41 ]. In the United Kingdom, weighted interval scoring was used for the prediction model, which used the data from the linear progression of 7-day cases[ 42 ]. In Chile, ARIMA (henceforth), exponential smoothing techniques, and Poisson models for time-dependent count data are used[ 43 ].

TIME SERIES ANALYSIS FOR SARS-CoV-2 SEROPREVALENCE

SARS-CoV-2 serology is used to identify previous infections, both in individuals and in populations. For this purpose, changes in antibody levels against SARS-CoV-2 are monitored to assess how they change over time and how long the protective immunity is preserved[ 44 - 46 ].

Many seroepidemiological studies are aimed at specific populations, such as health staff, police officers, and hospitalized patients with chronic diseases and COVID-19. Sometimes they use poorly and not well-validated laboratory methods[ 47 ] and mainly aim to study only the immunoglobulin G (IgG) response[ 48 ]. Therefore, various antibody responses began to be tested to improve the evaluation and diagnosis.

Many studies demonstrated that people with confirmed COVID-19 infection developed IgA, IgM and IgG against the S1 domain of the spike protein and nucleocapsid protein within 2 wk of symptoms[ 49 , 50 ]. Specific IgM antibodies are detected after 5-7 d of symptoms. After approximately 14 d, IgG begins to appear. IgA responses are detected almost simultaneously with IgM or earlier[ 51 - 53 ].

It was also found that the levels of antibodies correlated with the severity of the disease[ 54 - 57 ]. Previous research has found that with time, immunity to SARS-CoV-2 natural infection is short-lived and leads to a risk of reinfection[ 58 - 60 ].

А cohort study was conducted among health workers at the first SARS-CoV-2 epidemic peak in London[ 61 ]. They were tested weekly for symptoms, with RT-PCR and blood samples for 16-21 wk. Serological analysis was for IgG to the S1-domain of the spike protein and nucleocapsid protein. Asymptomatic or mild SARS-CoV-2 infection has been shown to elicit faster heterogeneous responses, and antibodies are cleared more quickly, which may affect the longevity of humoral immunity to SARS-CoV-2.

Another population-based study in Catalonia was conducted from February to November 2020. A multiplex serological test was used on 5000 participants from blood samples. Responses to 15 isotype-antigen combinations were monitored, and seroprevalence of 18.1% was found in adults and 15.3% in a simulation of the total population of Catalonia[ 62 ]. Based on the severity of the disease, immune profiles reveal that with increasing severity of infection, serum responses are more stable. The age and sex of the participants, overweight/obesity, compared with normal weight, and if they were or were not smokers when included in the study. There were no significant differences in seroprevalence between the two sexes. The seroprevalence among children was lower than in adults. Children were at lower risk of seropositivity than their parents in one family. Overweight/obese participants had higher antibody levels than those with normal weight. This is confusing because of suggesting that higher levels were adjusted for the severity of the disease[ 62 ].

A study on the seroprevalence of SARS-CoV-2 is currently being conducted in Barcelona from February 2021 to March 2022. The SeroCAP sentinel monitoring system is being used. IgG detection against SARS-CoV-2 spike protein will be performed monthly from blood samples collected from three hospitals. About 3000 samples are taken per month, and the prevalence will be assessed by age, sex and time in the three health zones in Barcelona. A complete analysis of the prevalence of SARS-CoV-2 infections will be performed, considering the demographic, social and economic factors. The correlation between seroprevalence confirmed cases of COVID-19. All measures applied so far will be studied[ 63 ].

A systematic review analyzed 47 studies from 23 countries[ 64 ]. In addition, other representative population-based studies at the national or regional level have been published[ 65 - 68 ]. The data showed that the SARS-CoV-2 seroprevalence in the general population varied from 0.37% to 22.1%. Biological, behavioral and social factors, including vaccine coverage, influence these percentages. We must acknowledge that the titers of protection against SARS-CoV-2 are currently unknown. However, virus-neutralizing antibodies are needed to protect and control the infection[ 69 ].

Mathematical modeling of pandemics is of vital importance for several reasons. One is to study the mechanisms of interactions between people, societal structures, and pathogens for past epidemics to develop knowledge that would help predict and control future ones. These are the classical epidemiological models, such as SIR and SEIR. Another reason is to enable us to manage a current epidemic by distinguishing productive from unproductive measurements and delivering precise short-term forecasts for the number of new cases, the number of new hospital admissions, expected deaths, etc.

This is the application of time series analysis, which relies on past values to predict future ones without extensive use of theoretical knowledge with all its uncertainties during an ongoing epidemic. It is used in mortality risk estimation, seroprevalence studies, reliable short-term forecasting for the healthcare system burden, and excess mortality estimation and analysis. It has a broad spectrum of linear and nonlinear, and single and multidimensional models. This allows one to choose ease of use vs capabilities according to different contexts where they are applied. Time series analysis supplements the classical epidemiological models in predictive and forecasting capabilities. It reinforces our decisions on how to act vs an epidemic with additional analytical approaches and results on which to step on.

Conflict-of-interest statement: The authors declare no conflict of interest.

Provenance and peer review: Invited article; Externally peer reviewed.

Peer-review model: Single blind

Peer-review started: April 26, 2023

First decision: July 27, 2023

Article in press: September 4, 2023

Specialty type: Medicine, research and experimental

Country/Territory of origin: Bulgaria

Peer-review report’s scientific quality classification

Grade A (Excellent): 0

Grade B (Very good): 0

Grade C (Good): C

Grade D (Fair): 0

Grade E (Poor): 0

P-Reviewer: Gupta L, Indonesia S-Editor: Fan JR L-Editor: Kerr C P-Editor: Fan JR

Contributor Information

Latchezar Tomov, Department of Informatics, New Bulgarian University, Sofia 1618, Bulgaria. moc.liamg@trasehcul .

Lyubomir Chervenkov, Department of Diagnostic Imaging, Medical University Plovdiv, Plovdiv 4000, Bulgaria.

Dimitrina Georgieva Miteva, Department of Genetics, Faculty of Biology, Sofia University "St. Kliment Ohridski", Sofia 1164, Bulgaria.

Hristiana Batselova, Department of Epidemiology and Disaster Medicine, Medical University, University Hospital "St George", Plovdiv 4000, Bulgaria.

Tsvetelina Velikova, Department of Medical Faculty, Sofia University, St. Kliment Ohridski, Sofia 1407, Bulgaria.

time series analysis Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

What Will Be the Economic Impact of the New Medical Device Regulation? An Interrupted Time-Series Analysis of Foreign Trade Data

Use of interrupted time series analysis in understanding the course of the congenital syphilis epidemic in brazil, feedback between surface and deep processes: insight from time series analysis of sedimentary record, assessment of land degradation and restoration in coal mines of central india: a time series analysis, belief entropy-of-entropy and its application in the cardiac interbeat interval time series analysis, time series analysis and spatial distribution map of aggregate risk index due to tropospheric no2 and o3 based on satellite observation, pitfalls in using the geochronological information from the earthchem portal for precambrian time-series analysis, impact of the expert consensus on carbapenem consumption trends and patterns in public healthcare institutes: an interrupted time series analysis, 2017–2020.

Background: Carbapenems are considered the last line of defence against bacterial infections, but their high consumption and the resulting antibacterial resistance are an increasing global concern. In this context, the Chinese health authority issued an expert consensus on the clinical applications of carbapenems. However, the long- and short-term effects of the expert consensus on carbapenem use are not clear.Methods: This study was conducted in Shaanxi, a northwest province of China. We collected all available carbapenem procurement data between January 2017 and December 2020 from the Provincial Drug Centralized Bidding Procurement System. A quasi-experimental interrupted time series analysis was used to evaluate the longitudinal effectiveness of expert consensus by measuring the change in the Defined Daily Dosesper 1,000 inhabitants per day (DID), the percentage of carbapenem expenditures to total antimicrobial expenditure, the total carbapenem expenditure, and the defined daily cost (DDDc). We used Stata SE version 15.0 for data analysis, and p &lt; 0.05 was considered statistically significant.Results: After the distribution of the expert consensus, the level (p = 0.769) and trend (p = 0.184) of DID decreased, but the differences were not statistically significant. The percentage of carbapenem expenditures to total antimicrobial expenditure decreased abruptly (p &lt; 0.001) after the intervention, but the long-term trend was still upward. There was no statistically significant relationship between the release of the expert consensus and carbapenem expenditure in the long term, but there was a decreasing trend (p = 0.032). However, the expert consensus had a positive impact on the economic burden of carbapenem usage in patients, as the level (p &lt; 0.001), and trend (p = 0.003) of DDDc significantly decreased.Conclusion: The long-term effects of the distribution of the expert consensus on the use and expenditure of carbapenems in public health institutions in Shaanxi Province were not optimal. It is time to set up more administrative measures and scientific supervision to establish a specific index to limit the application of carbapenems.

Monitoring land subsidence using interferometric synthetic aperture radar time-series analysis: a case of Shanghai Pilot Free Trade Zone

Ramadan is not associated with increased infection risk in pakistani and bangladeshi populations: findings from controlled interrupted time series analysis of uk primary care data.

Background The effect of fasting on immunity is unclear. Prolonged fasting is thought to increase the risk of infection due to dehydration. This study describes antibiotic prescribing patterns before, during, and after Ramadan in a primary care setting within the Pakistani and Bangladeshi populations in the UK, most of whom are Muslims, compared to those who do not observe Ramadan. Method Retrospective controlled interrupted time series analysis of electronic health record data from primary care practices. The study consists of two groups: Pakistanis/Bangladeshis and white populations. For each group, we constructed a series of aggregated, daily prescription data from 2007 to 2017 for the 30 days preceding, during, and after Ramadan, respectively. Findings Controlling for the rate in the white population, there was no evidence of increased antibiotic prescription in the Pakistani/Bangladeshi population during Ramadan, as compared to before Ramadan (IRR: 0.994; 95% CI: 0.988–1.001, p = 0.082) or after Ramadan (IRR: 1.006; 95% CI: 0.999–1.013, p = 0.082). Interpretation In this large, population-based study, we did not find any evidence to suggest that fasting was associated with an increased susceptibility to infection.

Export Citation Format

Share document.

Numbers, Facts and Trends Shaping Your World

Read our research on:

Full Topic List

Regions & Countries

  • Publications
  • Our Methods
  • Short Reads
  • Tools & Resources

Read Our Research On:

What the data says about abortion in the U.S.

Pew Research Center has conducted many surveys about abortion over the years, providing a lens into Americans’ views on whether the procedure should be legal, among a host of other questions.

In a  Center survey  conducted nearly a year after the Supreme Court’s June 2022 decision that  ended the constitutional right to abortion , 62% of U.S. adults said the practice should be legal in all or most cases, while 36% said it should be illegal in all or most cases. Another survey conducted a few months before the decision showed that relatively few Americans take an absolutist view on the issue .

Find answers to common questions about abortion in America, based on data from the Centers for Disease Control and Prevention (CDC) and the Guttmacher Institute, which have tracked these patterns for several decades:

How many abortions are there in the U.S. each year?

How has the number of abortions in the u.s. changed over time, what is the abortion rate among women in the u.s. how has it changed over time, what are the most common types of abortion, how many abortion providers are there in the u.s., and how has that number changed, what percentage of abortions are for women who live in a different state from the abortion provider, what are the demographics of women who have had abortions, when during pregnancy do most abortions occur, how often are there medical complications from abortion.

This compilation of data on abortion in the United States draws mainly from two sources: the Centers for Disease Control and Prevention (CDC) and the Guttmacher Institute, both of which have regularly compiled national abortion data for approximately half a century, and which collect their data in different ways.

The CDC data that is highlighted in this post comes from the agency’s “abortion surveillance” reports, which have been published annually since 1974 (and which have included data from 1969). Its figures from 1973 through 1996 include data from all 50 states, the District of Columbia and New York City – 52 “reporting areas” in all. Since 1997, the CDC’s totals have lacked data from some states (most notably California) for the years that those states did not report data to the agency. The four reporting areas that did not submit data to the CDC in 2021 – California, Maryland, New Hampshire and New Jersey – accounted for approximately 25% of all legal induced abortions in the U.S. in 2020, according to Guttmacher’s data. Most states, though,  do  have data in the reports, and the figures for the vast majority of them came from each state’s central health agency, while for some states, the figures came from hospitals and other medical facilities.

Discussion of CDC abortion data involving women’s state of residence, marital status, race, ethnicity, age, abortion history and the number of previous live births excludes the low share of abortions where that information was not supplied. Read the methodology for the CDC’s latest abortion surveillance report , which includes data from 2021, for more details. Previous reports can be found at  stacks.cdc.gov  by entering “abortion surveillance” into the search box.

For the numbers of deaths caused by induced abortions in 1963 and 1965, this analysis looks at reports by the then-U.S. Department of Health, Education and Welfare, a precursor to the Department of Health and Human Services. In computing those figures, we excluded abortions listed in the report under the categories “spontaneous or unspecified” or as “other.” (“Spontaneous abortion” is another way of referring to miscarriages.)

Guttmacher data in this post comes from national surveys of abortion providers that Guttmacher has conducted 19 times since 1973. Guttmacher compiles its figures after contacting every known provider of abortions – clinics, hospitals and physicians’ offices – in the country. It uses questionnaires and health department data, and it provides estimates for abortion providers that don’t respond to its inquiries. (In 2020, the last year for which it has released data on the number of abortions in the U.S., it used estimates for 12% of abortions.) For most of the 2000s, Guttmacher has conducted these national surveys every three years, each time getting abortion data for the prior two years. For each interim year, Guttmacher has calculated estimates based on trends from its own figures and from other data.

The latest full summary of Guttmacher data came in the institute’s report titled “Abortion Incidence and Service Availability in the United States, 2020.” It includes figures for 2020 and 2019 and estimates for 2018. The report includes a methods section.

In addition, this post uses data from StatPearls, an online health care resource, on complications from abortion.

An exact answer is hard to come by. The CDC and the Guttmacher Institute have each tried to measure this for around half a century, but they use different methods and publish different figures.

The last year for which the CDC reported a yearly national total for abortions is 2021. It found there were 625,978 abortions in the District of Columbia and the 46 states with available data that year, up from 597,355 in those states and D.C. in 2020. The corresponding figure for 2019 was 607,720.

The last year for which Guttmacher reported a yearly national total was 2020. It said there were 930,160 abortions that year in all 50 states and the District of Columbia, compared with 916,460 in 2019.

  • How the CDC gets its data: It compiles figures that are voluntarily reported by states’ central health agencies, including separate figures for New York City and the District of Columbia. Its latest totals do not include figures from California, Maryland, New Hampshire or New Jersey, which did not report data to the CDC. ( Read the methodology from the latest CDC report .)
  • How Guttmacher gets its data: It compiles its figures after contacting every known abortion provider – clinics, hospitals and physicians’ offices – in the country. It uses questionnaires and health department data, then provides estimates for abortion providers that don’t respond. Guttmacher’s figures are higher than the CDC’s in part because they include data (and in some instances, estimates) from all 50 states. ( Read the institute’s latest full report and methodology .)

While the Guttmacher Institute supports abortion rights, its empirical data on abortions in the U.S. has been widely cited by  groups  and  publications  across the political spectrum, including by a  number of those  that  disagree with its positions .

These estimates from Guttmacher and the CDC are results of multiyear efforts to collect data on abortion across the U.S. Last year, Guttmacher also began publishing less precise estimates every few months , based on a much smaller sample of providers.

The figures reported by these organizations include only legal induced abortions conducted by clinics, hospitals or physicians’ offices, or those that make use of abortion pills dispensed from certified facilities such as clinics or physicians’ offices. They do not account for the use of abortion pills that were obtained  outside of clinical settings .

(Back to top)

A line chart showing the changing number of legal abortions in the U.S. since the 1970s.

The annual number of U.S. abortions rose for years after Roe v. Wade legalized the procedure in 1973, reaching its highest levels around the late 1980s and early 1990s, according to both the CDC and Guttmacher. Since then, abortions have generally decreased at what a CDC analysis called  “a slow yet steady pace.”

Guttmacher says the number of abortions occurring in the U.S. in 2020 was 40% lower than it was in 1991. According to the CDC, the number was 36% lower in 2021 than in 1991, looking just at the District of Columbia and the 46 states that reported both of those years.

(The corresponding line graph shows the long-term trend in the number of legal abortions reported by both organizations. To allow for consistent comparisons over time, the CDC figures in the chart have been adjusted to ensure that the same states are counted from one year to the next. Using that approach, the CDC figure for 2021 is 622,108 legal abortions.)

There have been occasional breaks in this long-term pattern of decline – during the middle of the first decade of the 2000s, and then again in the late 2010s. The CDC reported modest 1% and 2% increases in abortions in 2018 and 2019, and then, after a 2% decrease in 2020, a 5% increase in 2021. Guttmacher reported an 8% increase over the three-year period from 2017 to 2020.

As noted above, these figures do not include abortions that use pills obtained outside of clinical settings.

Guttmacher says that in 2020 there were 14.4 abortions in the U.S. per 1,000 women ages 15 to 44. Its data shows that the rate of abortions among women has generally been declining in the U.S. since 1981, when it reported there were 29.3 abortions per 1,000 women in that age range.

The CDC says that in 2021, there were 11.6 abortions in the U.S. per 1,000 women ages 15 to 44. (That figure excludes data from California, the District of Columbia, Maryland, New Hampshire and New Jersey.) Like Guttmacher’s data, the CDC’s figures also suggest a general decline in the abortion rate over time. In 1980, when the CDC reported on all 50 states and D.C., it said there were 25 abortions per 1,000 women ages 15 to 44.

That said, both Guttmacher and the CDC say there were slight increases in the rate of abortions during the late 2010s and early 2020s. Guttmacher says the abortion rate per 1,000 women ages 15 to 44 rose from 13.5 in 2017 to 14.4 in 2020. The CDC says it rose from 11.2 per 1,000 in 2017 to 11.4 in 2019, before falling back to 11.1 in 2020 and then rising again to 11.6 in 2021. (The CDC’s figures for those years exclude data from California, D.C., Maryland, New Hampshire and New Jersey.)

The CDC broadly divides abortions into two categories: surgical abortions and medication abortions, which involve pills. Since the Food and Drug Administration first approved abortion pills in 2000, their use has increased over time as a share of abortions nationally, according to both the CDC and Guttmacher.

The majority of abortions in the U.S. now involve pills, according to both the CDC and Guttmacher. The CDC says 56% of U.S. abortions in 2021 involved pills, up from 53% in 2020 and 44% in 2019. Its figures for 2021 include the District of Columbia and 44 states that provided this data; its figures for 2020 include D.C. and 44 states (though not all of the same states as in 2021), and its figures for 2019 include D.C. and 45 states.

Guttmacher, which measures this every three years, says 53% of U.S. abortions involved pills in 2020, up from 39% in 2017.

Two pills commonly used together for medication abortions are mifepristone, which, taken first, blocks hormones that support a pregnancy, and misoprostol, which then causes the uterus to empty. According to the FDA, medication abortions are safe  until 10 weeks into pregnancy.

Surgical abortions conducted  during the first trimester  of pregnancy typically use a suction process, while the relatively few surgical abortions that occur  during the second trimester  of a pregnancy typically use a process called dilation and evacuation, according to the UCLA School of Medicine.

In 2020, there were 1,603 facilities in the U.S. that provided abortions,  according to Guttmacher . This included 807 clinics, 530 hospitals and 266 physicians’ offices.

A horizontal stacked bar chart showing the total number of abortion providers down since 1982.

While clinics make up half of the facilities that provide abortions, they are the sites where the vast majority (96%) of abortions are administered, either through procedures or the distribution of pills, according to Guttmacher’s 2020 data. (This includes 54% of abortions that are administered at specialized abortion clinics and 43% at nonspecialized clinics.) Hospitals made up 33% of the facilities that provided abortions in 2020 but accounted for only 3% of abortions that year, while just 1% of abortions were conducted by physicians’ offices.

Looking just at clinics – that is, the total number of specialized abortion clinics and nonspecialized clinics in the U.S. – Guttmacher found the total virtually unchanged between 2017 (808 clinics) and 2020 (807 clinics). However, there were regional differences. In the Midwest, the number of clinics that provide abortions increased by 11% during those years, and in the West by 6%. The number of clinics  decreased  during those years by 9% in the Northeast and 3% in the South.

The total number of abortion providers has declined dramatically since the 1980s. In 1982, according to Guttmacher, there were 2,908 facilities providing abortions in the U.S., including 789 clinics, 1,405 hospitals and 714 physicians’ offices.

The CDC does not track the number of abortion providers.

In the District of Columbia and the 46 states that provided abortion and residency information to the CDC in 2021, 10.9% of all abortions were performed on women known to live outside the state where the abortion occurred – slightly higher than the percentage in 2020 (9.7%). That year, D.C. and 46 states (though not the same ones as in 2021) reported abortion and residency data. (The total number of abortions used in these calculations included figures for women with both known and unknown residential status.)

The share of reported abortions performed on women outside their state of residence was much higher before the 1973 Roe decision that stopped states from banning abortion. In 1972, 41% of all abortions in D.C. and the 20 states that provided this information to the CDC that year were performed on women outside their state of residence. In 1973, the corresponding figure was 21% in the District of Columbia and the 41 states that provided this information, and in 1974 it was 11% in D.C. and the 43 states that provided data.

In the District of Columbia and the 46 states that reported age data to  the CDC in 2021, the majority of women who had abortions (57%) were in their 20s, while about three-in-ten (31%) were in their 30s. Teens ages 13 to 19 accounted for 8% of those who had abortions, while women ages 40 to 44 accounted for about 4%.

The vast majority of women who had abortions in 2021 were unmarried (87%), while married women accounted for 13%, according to  the CDC , which had data on this from 37 states.

A pie chart showing that, in 2021, majority of abortions were for women who had never had one before.

In the District of Columbia, New York City (but not the rest of New York) and the 31 states that reported racial and ethnic data on abortion to  the CDC , 42% of all women who had abortions in 2021 were non-Hispanic Black, while 30% were non-Hispanic White, 22% were Hispanic and 6% were of other races.

Looking at abortion rates among those ages 15 to 44, there were 28.6 abortions per 1,000 non-Hispanic Black women in 2021; 12.3 abortions per 1,000 Hispanic women; 6.4 abortions per 1,000 non-Hispanic White women; and 9.2 abortions per 1,000 women of other races, the  CDC reported  from those same 31 states, D.C. and New York City.

For 57% of U.S. women who had induced abortions in 2021, it was the first time they had ever had one,  according to the CDC.  For nearly a quarter (24%), it was their second abortion. For 11% of women who had an abortion that year, it was their third, and for 8% it was their fourth or more. These CDC figures include data from 41 states and New York City, but not the rest of New York.

A bar chart showing that most U.S. abortions in 2021 were for women who had previously given birth.

Nearly four-in-ten women who had abortions in 2021 (39%) had no previous live births at the time they had an abortion,  according to the CDC . Almost a quarter (24%) of women who had abortions in 2021 had one previous live birth, 20% had two previous live births, 10% had three, and 7% had four or more previous live births. These CDC figures include data from 41 states and New York City, but not the rest of New York.

The vast majority of abortions occur during the first trimester of a pregnancy. In 2021, 93% of abortions occurred during the first trimester – that is, at or before 13 weeks of gestation,  according to the CDC . An additional 6% occurred between 14 and 20 weeks of pregnancy, and about 1% were performed at 21 weeks or more of gestation. These CDC figures include data from 40 states and New York City, but not the rest of New York.

About 2% of all abortions in the U.S. involve some type of complication for the woman , according to an article in StatPearls, an online health care resource. “Most complications are considered minor such as pain, bleeding, infection and post-anesthesia complications,” according to the article.

The CDC calculates  case-fatality rates for women from induced abortions – that is, how many women die from abortion-related complications, for every 100,000 legal abortions that occur in the U.S .  The rate was lowest during the most recent period examined by the agency (2013 to 2020), when there were 0.45 deaths to women per 100,000 legal induced abortions. The case-fatality rate reported by the CDC was highest during the first period examined by the agency (1973 to 1977), when it was 2.09 deaths to women per 100,000 legal induced abortions. During the five-year periods in between, the figure ranged from 0.52 (from 1993 to 1997) to 0.78 (from 1978 to 1982).

The CDC calculates death rates by five-year and seven-year periods because of year-to-year fluctuation in the numbers and due to the relatively low number of women who die from legal induced abortions.

In 2020, the last year for which the CDC has information , six women in the U.S. died due to complications from induced abortions. Four women died in this way in 2019, two in 2018, and three in 2017. (These deaths all followed legal abortions.) Since 1990, the annual number of deaths among women due to legal induced abortion has ranged from two to 12.

The annual number of reported deaths from induced abortions (legal and illegal) tended to be higher in the 1980s, when it ranged from nine to 16, and from 1972 to 1979, when it ranged from 13 to 63. One driver of the decline was the drop in deaths from illegal abortions. There were 39 deaths from illegal abortions in 1972, the last full year before Roe v. Wade. The total fell to 19 in 1973 and to single digits or zero every year after that. (The number of deaths from legal abortions has also declined since then, though with some slight variation over time.)

The number of deaths from induced abortions was considerably higher in the 1960s than afterward. For instance, there were 119 deaths from induced abortions in  1963  and 99 in  1965 , according to reports by the then-U.S. Department of Health, Education and Welfare, a precursor to the Department of Health and Human Services. The CDC is a division of Health and Human Services.

Note: This is an update of a post originally published May 27, 2022, and first updated June 24, 2022.

Portrait photo of staff

Support for legal abortion is widespread in many countries, especially in Europe

Nearly a year after roe’s demise, americans’ views of abortion access increasingly vary by where they live, by more than two-to-one, americans say medication abortion should be legal in their state, most latinos say democrats care about them and work hard for their vote, far fewer say so of gop, positive views of supreme court decline sharply following abortion ruling, most popular.

1615 L St. NW, Suite 800 Washington, DC 20036 USA (+1) 202-419-4300 | Main (+1) 202-857-8562 | Fax (+1) 202-419-4372 |  Media Inquiries

Research Topics

  • Age & Generations
  • Coronavirus (COVID-19)
  • Economy & Work
  • Family & Relationships
  • Gender & LGBTQ
  • Immigration & Migration
  • International Affairs
  • Internet & Technology
  • Methodological Research
  • News Habits & Media
  • Non-U.S. Governments
  • Other Topics
  • Politics & Policy
  • Race & Ethnicity
  • Email Newsletters

ABOUT PEW RESEARCH CENTER  Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of  The Pew Charitable Trusts .

Copyright 2024 Pew Research Center

Terms & Conditions

Privacy Policy

Cookie Settings

Reprints, Permissions & Use Policy

IMAGES

  1. (PDF) Time series analysis for psychological research: Examining and

    research papers on time series analysis

  2. (PDF) Time Series Analysis

    research papers on time series analysis

  3. Time series analysis with KNIME and Spark

    research papers on time series analysis

  4. (Complete Guide) Time Series Analysis: Types & Examples

    research papers on time series analysis

  5. Time Series Analysis

    research papers on time series analysis

  6. PPT

    research papers on time series analysis

VIDEO

  1. Time Series Analysis

  2. Time Series Analysis in R, and other advanced statistical tests & different models in R [4 of 4]

  3. Time Series Analysis using Excel

  4. Time Series 101: Data Considerations and Assumptions

  5. Lecture 1: Time Series analysis. The Nature of Time Series Data and Components of a Time Series

  6. Mass Spectrometry: Past Papers

COMMENTS

  1. (PDF) Time Series Analysis

    Time-series analysis is a statistical method of analyzing data from repeated observations on a single unit or individual at regular intervals over a large number of observations. Time-series ...

  2. Journal of Time Series Analysis

    The Journal of Time Series Analysis is the leading mathematical statistics journal focused on the important field of time series analysis. We welcome papers on both fundamental theory and applications in fields such as neurophysiology, astrophysics, economic forecasting, the study of biological data, control systems, signal processing, and communications and vibrations engineering.

  3. Time series analysis for psychological research: examining and

    The current paper introduces time series analysis to psychological research, an analytic domain that has been essential for understanding and predicting the behavior of variables across many diverse fields. First, the characteristics of time series data are discussed. ... Time series analysis in operant research. J. Appl. Behav. Anal.

  4. Frontiers

    In time series analysis, the autocorrelation coefficient across many lags is called the autocorrelation function (ACF) and plays a significant role in model selection and evaluation (as discussed later). A plot of the ACF of the Google job search time series after seasonal adjustment is presented in the bottom panel of Figure 3.In an ACF plot, the y-axis displays the strength of the ...

  5. An Introductory Study on Time Series Modeling and Forecasting

    2.1 Definition of A Time Series. A time series is a sequential set of data points, measured typically over successive times. It is mathematically defined as a set of vectors x ( t ), t = 0 ,1,2,... where t represents the time. elapsed [21, 23, 31]. The variable x (t ) is treated as a random variable.

  6. Time Series Analysis and Modeling to Forecast: a Survey

    Time series modeling for predictive purpose has been an active research area of machine learning for many years. However, no sufficiently comprehensive and meanwhile substantive survey was offered so far. This survey strives to meet this need. A unified presentation has been adopted for entire parts of this compilation. A red thread guides the reader from time series preprocessing to ...

  7. Time Series Analysis

    1878 papers with code • 3 benchmarks • 20 datasets. Time Series Analysis is a statistical technique used to analyze and model time-based data. It is used in various fields such as finance, economics, and engineering to analyze patterns and trends in data over time. The goal of time series analysis is to identify the underlying patterns ...

  8. [2202.07125] Transformers in Time Series: A Survey

    To the best of our knowledge, this paper is the first work to comprehensively and systematically summarize the recent advances of Transformers for modeling time series data. We hope this survey will ignite further research interests in time series Transformers. Accepted by 32nd International Joint Conference on Artificial Intelligence (IJCAI ...

  9. Time series analysis for psychological research: examining and ...

    The current paper introduces time series analysis to psychological research, an analytic domain that has been essential for understanding and predicting the behavior of variables across many diverse fields. First, the characteristics of time series data are discussed. Second, different time series modeling techniques are surveyed that can ...

  10. Highly comparative time-series analysis: the empirical structure of

    3.1. Empirical structure of time-series analysis methods. First, we analyse the structure in our library of time-series analysis operations when applied to a representative interdisciplinary set of 875 real-world and model-generated time series (a selection of time series in this set is illustrated in figure 1a). This carefully controlled set ...

  11. 105975 PDFs

    Forecasting of non-linear and non-stationary time series | Explore the latest full-text research PDFs, articles, conference papers, preprints and more on TIME SERIES ANALYSIS. Find methods ...

  12. Time Series Analysis: Research on Data Modeling Methods

    Since the autoregressive model was proposed by British statistician G. U. Yule in the early part of the last century, time series analysis has become a popular research direction for its wide application in the fields of economy, finance, engineering, and many others. The aim of this Special Issue is to bring together papers on the following ...

  13. Time-series forecasting of seasonal items sales using machine learning

    Another important characteristic of time-series is stationarity. A time series is called stationary if its statistical features (e.g., mean, standard deviation) continue steadily over time, and this is highly important because if a time-series is stationary, there is a high probability that it will repeat its behavior in the future, and therefore it will be easier to forecast (Jain, 2016).

  14. time series Latest Research Papers

    Time-series analysis, modeling and forecasting is an important research area that explores the hidden insights from larger set of time-bound data for arriving better decisions. In this work, data analysis on COVID-19 dataset is performed by comparing the top six populated countries in the world.

  15. Time series analysis and possible applications

    Data presented in form of time series as its analysis and applications recently have become increasingly important in different areas and domains. In this paper brief overview of some recently important standard problems, activities and models necessary for time series analysis and applications are presented. Paper also discusses some specific practical applications.

  16. PDF Neural Time Series Analysis with Fourier Transform: A Survey

    and efficiency of time series analysis. The ad-vantages of the Fourier transform for time series analysis, such as efficiency and global view, have been rapidly explored and exploited, exhibiting a promising deep learning paradigm for time series analysis. However, although increasing attention has been attracted and research is flourishing in this

  17. Forecast Methods for Time Series Data: A Survey

    Research on forecasting methods of time series data has become one of the hot spots. More and more time series data are produced in various fields. It provides data for the research of time series analysis method, and promotes the development of time series research. Due to the generation of highly complex and large-scale time series data, the construction of forecasting models for time series ...

  18. Time Series Forecasting

    5. Paper. Code. **Time Series Forecasting** is the task of fitting a model to historical, time-stamped data in order to predict future values. Traditional approaches include moving average, exponential smoothing, and ARIMA, though models as various as RNNs, Transformers, or XGBoost can also be applied. The most popular benchmark is the ETTh1 ...

  19. Applications of time series analysis in epidemiology: Literature review

    Time series analysis is a valuable tool in epidemiology that complements the classical epidemiological models in two different ways: Prediction and forecast. ... Previous research has found that with time, immunity to SARS-CoV-2 natural infection is short-lived and leads to a risk of reinfection[58-60].

  20. Time Series Analysis and Econometrics with Applications

    Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications. ... Time series analysis and ...

  21. time series analysis Latest Research Papers

    Assessment of land degradation and restoration in coal mines of central India: A time series analysis. Ecological Engineering . 10.1016/j.ecoleng.2021.106493 . 2022 . Vol 175 . pp. 106493. Author (s): Tarun Kumar Thakur . Joystu Dutta .

  22. What the data says about abortion in the U.S.

    The CDC says that in 2021, there were 11.6 abortions in the U.S. per 1,000 women ages 15 to 44. (That figure excludes data from California, the District of Columbia, Maryland, New Hampshire and New Jersey.) Like Guttmacher's data, the CDC's figures also suggest a general decline in the abortion rate over time.

  23. A systematic review of Python packages for time series analysis

    This paper presents a systematic review of Python packages with a focus on time series analysis. The objective is to provide (1) an overview of the different time series analysis tasks and preprocessing methods implemented, and (2) an overview of the development characteristics of the packages (e.g., documentation, dependencies, and community size). This review is based on a search of ...

  24. Sensors

    With the acceleration of urbanisation, urban areas are subject to the combined effects of the accumulation of various natural factors, such as changes in temperature leading to the thermal expansion or contraction of surface materials (rock, soil, etc.) and changes in precipitation and humidity leading to an increase in the self-weight of soil due to the infiltration of water along the cracks ...

  25. PDF arXiv:2104.00164v2 [cs.LG] 26 Sep 2021

    The aim of time series segmentation is to identify the boundary points of segments in the data ow, to characterize the dynamical properties associated with each segment. Forecasting in time series is an important area of machine learning. Time series analysis helps in analyzing the past, which comes in handy to forecast the future.

  26. Economic analysis for impact of some monetary policy variables on the

    The research aims to analysis the impact of some economic policy variables on the value of agricultural resultant in Iraq for the period (2004-2020) using quarterly time series. The independent variables were used (foreign exchange window FC, narrow money supply M1, equilibrium exchange rate CE, interest rate imposed on agricultural loans R, value of agricultural imports M, while the value of ...