Autoregression as a means of assessing the strength of seasonality in a time series

Background The study of the seasonal variation of disease is receiving increasing attention from health researchers. Available statistical tests for seasonality typically indicate the presence or absence of statistically significant seasonality but do not provide a meaningful measure of its strength. Methods We propose the coefficient of determination of the autoregressive regression model fitted to the data () as a measure for quantifying the strength of the seasonality. The performance of the proposed statistic is assessed through a simulation study and using two data sets known to demonstrate statistically significant seasonality: atrial fibrillation and asthma hospitalizations in Ontario, Canada. Results The simulation results showed the power of the in adequately quantifying the strength of the seasonality of the simulated observations for all models. In the atrial fibrillation and asthma datasets, while the statistical tests such as Bartlett's Kolmogorov-Smirnov (BKS) and Fisher's Kappa support statistical evidence of seasonality for both, the quantifies the strength of that seasonality. Corroborating the visual evidence that asthma is more conspicuously seasonal than atrial fibrillation, the calculated for atrial fibrillation indicates a weak to moderate seasonality ( = 0.44, 0.28 and 0.45 for both genders, males and females respectively), whereas for asthma, it indicates a strong seasonality ( = 0.82, 0.78 and 0.82 for both genders, male and female respectively). Conclusions For the purposes of health services research, evidence of the statistical presence of seasonality is insufficient to determine the etiologic, clinical and policy relevance of findings. Measurement of the strength of the seasonal effect, as can be determined using the technique, is also important in order to provide a robust sense of seasonality.

Measurement of the strength of the seasonal effect, as can be determined using the technique, is also important in order to provide a robust sense of seasonality.

Background
Seasonality is an important component of disease manifestation. The presence of predictable seasonality is a clue to the possible etiology of disease, be it from microbial, environmental or social factors. Understanding seasonality is also essential for setting rational policy, particularly with respect to the planning for seasonal demands for health services.
For studying seasonality, several statistical methods are available ranging from simple graphical techniques to more advanced statistical methods. Additionally, autocorrelation functions can be examined to assess regularity of periodicity or seasonality. Several statistical tests have been introduced for studying the cyclical variation of time series data. For example, Edwards [1] developed a statistical test that locates weights corresponding to the number of observed cases for each month at 12 equally spaced points on a circle. The test is said to be significant if the calculated centre of the mass significantly deviates from the circle's centre. Jones et al [2] developed a test for determining whether incidence data for two or more groups have the same seasonal pattern. Further, Marrero [3] compared the performance of several tests for seasonality by simulation, which can be used as a guideline for selecting appropriate tests for a given data set based on the size of the data set and the shape of the sinusoidal curve. To apply any of these tests, however, observations must be aggregated into 12 monthly data points.
Several alternative tests, which do not require aggregated data, have also been developed. These include Fisher's Kappa (FK), which tests whether the largest periodogram is statistically different from the mean of periodograms; Bartlett's Kolmogorov-Smirnov (BKS) test, which statistically compares the normalized cumulative periodogram with the cumulative distribution function of a uniform zero and one random variable; and the X-11 procedure as used by the census bureaus in the United States and Canada [2,[4][5][6][7][8][9][10][11]. These tests utilize the frequency and time domain to detect seasonality. Each test provides an indication of the presence or absence of statistical significance of seasonality, however, they do not provide a sense of the magnitude of seasonality or how much variance is explained by seasonal occurrence in the data. This is particularly important in health care, as the presence of statistically significant seasonality may not translate into either etiologic or policy relevance.
In an effort to address the shortcomings in existing statistical methods, we propose the application of autoregressive regression models as a means for assessing the degree of accuracy to which a new observation can be predicted by stable (seasonal factors are constant over time) seasonal variation and use it for quantifying the strength of the seasonality within a set of serially correlated observations. In classical regression analysis the coefficient of determination, R 2 , is a standard statistical tool for estimating the proportion of total variation of the dependent variable, which can be explained by explanatory variables. A crucial point in standard regression is that observations are independent of one another. However, time series observations can be serially autocorrelated and this correlation must be taken into account.
Autoregressive regression models are a natural generalization of standard regression models for analyzing correlated data. For monthly data, one can use dummy variables for months in a regression model as a single predictor, and then, after correcting for the autocorrelation, calculate the coefficient of determination, . When the time series is stationary and the trend is eliminated, the statistical significance of the dummy variables (months) indicates seasonality. The relationship between the stable seasonal factors and the estimates of the regression equation parameters are as follows: suppose there are k years monthly, n = 12 k, trend removed and centred (mean deleted) observations. Let , i = 1,2,...,12 denote the monthly average. The monthly averages, s, can be interpreted as crude estimates of stable seasonal factors, therefore, the range of parameter estimates is a good estimate of the magnitude of seasonal variation.  The coefficient of determination, which lies between 0 and 1, can be used as a measure for the strength of the stable seasonality because it measures how well the next value can be predicted using month as the only predictor.
When is zero, there is no seasonality. When is equal to 1, observations can be perfectly predicted for each month, which means that the variable month explains 100% of the variation in the data. In other words, there is a perfect seasonality. In practice we may characterize the strength of the seasonality based on different ranges of values for . Similar to other measures of correlation or goodness of fit, we can interpret as follows: values ranging from 0 to less than 0.4 may be characterized as non-existent to weak seasonality, 0.4 to less than 0.7 represent moderate to strong seasonality, and values ranging from 0.7 to 1 represent strong to perfect seasonality. The coefficient of determination, , does not quantify the magnitude of the seasonal effect (the difference between peak and trough, which can be estimated by the difference between maximum and minimum parameter estimates of the regression equation) but rather it quantifies its strength (i.e., how well new observations can be predicted when month is the only predictor).
The purpose of this paper is to evaluate the utility of Rsquared autoregression in explaining variance in assessing stable seasonality. To this end, we have examined the performance of the R-squared autoregression through a simulation study and using two data sets known to demonstrate statistically significant weak and strong seasonality: monthly hospitalizations for atrial fibrillation and asthma.

Statistical methods
The autoregressive linear regression model for monthly observations is defined as: where Y t is the observed time series, X t is the design matrix (a k × 12 matrix of 0 and 1), β = (µ, β 1 , …, β 11 )' is the vector of parameters, ε t is the error term that follows an autoregressive model of order p. Also, we assume that e t is normally and independently distributed with mean zero and variance [12]. Standard statistical packages (e.g. SAS) can be used for estimating the parameters and the coefficient of determination, , after adjusting for correlated error terms.

Simulation
In order to assess the performance of the proposed for measuring the strength of the seasonality of a time series, a simulation study was conducted. Following this, the proposed technique was applied to two real data sets. The SAS software, version 8.2 (SAS Institute Inc. Cary, North Carolina) is used for simulating monthly observations and calculating .
We simulated 1000 replications of monthly observations over 10 years from the following model: and calculated the for each replication. By changing α, φ 1 , φ 2 , and φ 12 the coefficients, this model generates observations with a cyclical trend component of period 12, observations from a seasonal ARMA model with a seasonal period of 12, and a combination of both. By changing the coefficients of the model we can generate pure white noise to highly correlated data with seasonal patterns. For example, a model with φ 1 = φ 2 = φ 12 = 0 generates a series of observations, which is a mixture of white noise plus a cyclical trend. The size of α controls the contribution of the seasonal trend in the generated observations. Parameters φ 1 and φ 2 control the correlation structure of the simulated data. When φ 1 = 0.9, for example, highly correlated observations are generated. When parameter φ 12 is nonzero, the model generates observations with a stochastic seasonal component. Similarly, nonzero φ 12 combined with nonzero φ 1 or φ 2 generates a series with a stochastic seasonal component which is correlated with the non-seasonal components. When all parameters are nonzero the generated observations have a complex structure which depends on a cyclical trend, a stochastic seasonal component, and its correlation structure. In our simulation, we set the order of the error terms in the autoregression model to 2 (for all simulated observations). The mean and standard deviation of 1000 calculated procedures are given in Table 1.
The calculated increases as α increases regardless To investigate the linkages between the and p, the order of the autoregression model, we repeated the simulation experiments with p = 1,2,4,8, and 13. Also an additional simulation experiment using the stepwise autoregression method was conducted to select the order of the autoregressive error model whereby the maximum possible autoregressive order was set equal to 13. The stepwise autoregression method involves fitting a high order model and then sequentially removing parameters until all remaining autoregressive parameters remain statistically significant [13]. For the fixed order, p, simulation results (not shown here) showed that is robust to over-fitting the data and p does not significantly affect the estimated . Results from the stepwise experiments showed that relative to fixed orders, p, there were no significant differences in the . Results were, however, slightly more conservative using the stepwise method.
In order to apply the autoregression procedure to actual data it is important to eliminate the nonstationarity in the mean and variance of the observations. For the data used in the study the Dickey-Fuller unit root test [14] is used to test the stationarity of the series and determine the order of differencing required for the nonstationary series. The SAS procedure, AUTOREG, was used for calculating . In all cases p, the order of the autoregressive model for error terms was selected using the stepwise autoregression method. The maximum possible order was set equal to 13.

Data Sources
The data were derived from two retrospective, populationbased, cross-sectional time series studies assessing temporal patterns in all discharge separations for asthma (from April 1, 1988 to March 31, 2000) and atrial fibrillation (from April 1, 1988 to March 31, 2001) for the population of Ontario. Approximately 14 million residents of Ontario, Canada eligible for universal health care coverage during this time were included for analysis. The database used was the Canadian Institute for Health Information (CIHI) Discharge Abstract Database which records discharges from all Ontario acute care hospitals. All records with a most responsible discharge diagnosis of atrial fibrillation (ICD-9 code: 427.3) and asthma (ICD-9 code: 493) were selected. The numerator consisted of the total number of discharge separations for each month. Denominators were constructed from annual census data provided by Statistics Canada for each age group for residents of Ontario. Monthly population estimates were created through linear interpolation.

Atrial fibrillation
Overall, there were 90,199 (45,477 female and 44,472 male) discharge separations for all ages. Figure 1 shows the monthly rates of admission per 100,000 population. There is a conspicuous upward trend in admissions over the first four years (Figure 1). Visual inspection does not support conspicuous seasonality.

Parameters
(Std) The calculated for both genders, females, and males are 0.44, 0.28, and 0.45, respectively. The difference between maximum and minimum months parameter estimates are 2.79, 1.87, 3.85 for both genders, females and males respectively which can be interpreted directly as the difference in hospitalizations per 100,000 population between peak and trough months in one year. The small values for the amplitude of seasonal factors, and the low values of the indicate a weak to non-existent seasonality.

Asthma
In total, there were 206,561 (104,283 female and 102,278 male) asthma hospital discharges for all ages. Figure 2 shows the monthly rates of asthma per 100,000 population. The visual inspection of Figure 2 shows a clear autumn peak and summer trough seasonal pattern occurring every year over the 12 year period. The Dicky-Fuller unit root test confirms that the series is stationary. The Atrial fibrillation hospitalizations per 100,000 population Figure 1 Atrial fibrillation hospitalizations per 100,000 population

Discussion
The results of this study show that the can be use-  be predicted by using month as the only predictor. In other words it shows the contribution of seasonality in the total variation of the data.
When the technique was applied to the two data sets, it corroborated the visual evidence that asthma is more conspicuously seasonal than atrial fibrillation. The seasonality of asthma has been conclusively demonstrated in several studies and is likely a key to understanding the etiology of exacerbations of asthma [15][16][17]. The strength and consistency of the effect is likely of relevance to health policy and planning. The seasonality of atrial fibrillation has been reported, but outcomes were reported as relative risks [18]. The analysis provided here indicates that the seasonality of admissions for atrial fibrillation is not likely of policy or clinical significance as the magnitude is quite small.
One important question that remains to be answered is how the magnitude of seasonal factor changes over time affect the . For non-stable seasonal variation, a proper transformation such as log may be required to transfer a non-stable seasonal variation to a stable one.
The value of may change if the sampling period changes (e.g. monthly data converted to weekly data). By including year and month as predictors we can adjust for moving seasonality, however, further research is required.

Conclusions
The proposed autoregression method is a statistical technique well suited to the study of seasonality in health data. Although monthly data was used for this analysis, it can easily be applied to weekly or seasonal data. The approach allows researchers to quantify and compare the strength of the seasonality for different genders and age groups. The coefficient of determination is easy to calculate and interpret. And finally, it is well known to health care researchers and is frequently used as a measure for goodness of fit. For the purposes of health services research and population health measurement, evidence of the statistical presence of stable seasonality is insufficient to determine the etiologic, clinical and policy relevance of findings. Measurement of the strength of the seasonal effect is also required in order to provide a robust sense of seasonality. We believe that this autoregression technique, in concert with statistical testing, graphical representation and measures of the absolute magnitude of seasonal effect, is an important component to this robust approach.