Seasonality is an important component of disease manifestation. The presence of predictable seasonality is a clue to the possible etiology of disease, be it from microbial, environmental or social factors. Understanding seasonality is also essential for setting rational policy, particularly with respect to the planning for seasonal demands for health services.
For studying seasonality, several statistical methods are available ranging from simple graphical techniques to more advanced statistical methods. Additionally, autocorrelation functions can be examined to assess regularity of periodicity or seasonality. Several statistical tests have been introduced for studying the cyclical variation of time series data. For example, Edwards [1] developed a statistical test that locates weights corresponding to the number of observed cases for each month at 12 equally spaced points on a circle. The test is said to be significant if the calculated centre of the mass significantly deviates from the circle's centre. Jones et al [2] developed a test for determining whether incidence data for two or more groups have the same seasonal pattern. Further, Marrero [3] compared the performance of several tests for seasonality by simulation, which can be used as a guideline for selecting appropriate tests for a given data set based on the size of the data set and the shape of the sinusoidal curve. To apply any of these tests, however, observations must be aggregated into 12 monthly data points.
Several alternative tests, which do not require aggregated data, have also been developed. These include Fisher's Kappa (FK), which tests whether the largest periodogram is statistically different from the mean of periodograms; Bartlett's Kolmogorov-Smirnov (BKS) test, which statistically compares the normalized cumulative periodogram with the cumulative distribution function of a uniform zero and one random variable; and the X-11 procedure as used by the census bureaus in the United States and Canada [2, 4–11]. These tests utilize the frequency and time domain to detect seasonality. Each test provides an indication of the presence or absence of statistical significance of seasonality, however, they do not provide a sense of the magnitude of seasonality or how much variance is explained by seasonal occurrence in the data. This is particularly important in health care, as the presence of statistically significant seasonality may not translate into either etiologic or policy relevance.
In an effort to address the shortcomings in existing statistical methods, we propose the application of autoregressive regression models as a means for assessing the degree of accuracy to which a new observation can be predicted by stable (seasonal factors are constant over time) seasonal variation and use it for quantifying the strength of the seasonality within a set of serially correlated observations. In classical regression analysis the coefficient of determination, R
2, is a standard statistical tool for estimating the proportion of total variation of the dependent variable, which can be explained by explanatory variables. A crucial point in standard regression is that observations are independent of one another. However, time series observations can be serially autocorrelated and this correlation must be taken into account.
Autoregressive regression models are a natural generalization of standard regression models for analyzing correlated data. For monthly data, one can use dummy variables for months in a regression model as a single predictor, and then, after correcting for the autocorrelation, calculate the coefficient of determination,
. When the time series is stationary and the trend is eliminated, the statistical significance of the dummy variables (months) indicates seasonality. The relationship between the stable seasonal factors and the estimates of the regression equation parameters are as follows: suppose there are k years monthly, n = 12 k, trend removed and centred (mean deleted) observations. Let
, i = 1,2,...,12 denote the monthly average. The monthly averages,
s, can be interpreted as crude estimates of stable seasonal factors, therefore, the range of parameter estimates is a good estimate of the magnitude of seasonal variation. For estimating
one defines 11 dummy variables m
i
= 1 if month equals i, m
i
= 0 otherwise and then regress m
i
s on y
t
s. It is not difficult to show the ordinary least squared estimates of the parameters of the regression equation
are
and
. In practice
and parameters β
i
s are estimated simultaneously, therefore, the estimated parameters
s can be used as seasonal factor estimates.
The coefficient of determination, which lies between 0 and 1, can be used as a measure for the strength of the stable seasonality because it measures how well the next value can be predicted using month as the only predictor. When
is zero, there is no seasonality. When
is equal to 1, observations can be perfectly predicted for each month, which means that the variable month explains 100% of the variation in the data. In other words, there is a perfect seasonality. In practice we may characterize the strength of the seasonality based on different ranges of values for
. Similar to other measures of correlation or goodness of fit, we can interpret
as follows: values ranging from 0 to less than 0.4 may be characterized as non-existent to weak seasonality, 0.4 to less than 0.7 represent moderate to strong seasonality, and values ranging from 0.7 to 1 represent strong to perfect seasonality. The coefficient of determination,
, does not quantify the magnitude of the seasonal effect (the difference between peak and trough, which can be estimated by the difference between maximum and minimum parameter estimates of the regression equation) but rather it quantifies its strength (i.e., how well new observations can be predicted when month is the only predictor).
The purpose of this paper is to evaluate the utility of R-squared autoregression in explaining variance in assessing stable seasonality. To this end, we have examined the performance of the R-squared autoregression through a simulation study and using two data sets known to demonstrate statistically significant weak and strong seasonality: monthly hospitalizations for atrial fibrillation and asthma.