Evaluation of stability of directly standardized rates for sparse data using simulation methods

Background Directly standardized rates (DSRs) adjust for different age distributions in different populations and enable, say, the rates of disease between the populations to be directly compared. They are routinely published but there is concern that a DSR is not valid when it is based on a “small” number of events. The aim of this study was to determine the value at which a DSR should not be published when analyzing real data in England. Methods Standard Monte Carlo simulation techniques were used assuming the number of events in 19 age groups (i.e., 0–4, 5–9, ... 90+ years) follow independent Poisson distributions. The total number of events, age specific risks, and the population sizes in each age group were varied. For each of 10,000 simulations the DSR (using the 2013 European Standard Population weights), together with the coverage of three different methods (normal approximation, Dobson, and Tiwari modified gamma) of estimating the 95% confidence intervals (CIs), were calculated. Results The normal approximation was, as expected, not suitable for use when fewer than 100 events occurred. The Tiwari method and the Dobson method of calculating confidence intervals produced similar estimates and either was suitable when the expected or observed numbers of events were 10 or greater. The accuracy of the CIs was not influenced by the distribution of the events across categories (i.e., the degree of clustering, the age distributions of the sampling populations, and the number of categories with no events occurring in them). Conclusions DSRs should not be given when the total observed number of events is less than 10. The Dobson method might be considered the preferred method due to the formulae being simpler than that of the Tiwari method and the coverage being slightly more accurate.


Background
Directly standardized rates (DSRs) are routinely produced by national organizations to compare rates, such as for diseases, across different geographic areas. They are calculated by applying the observed age specific rates in each population to a population with a standard age distribution. As the rates usually apply to specific causes of death or disease, which are generally independent (apart from, for example, contagious diseases) and occur relatively infrequently, the Poisson distribution is used to model the occurrence of such events and to derive the confidence intervals for the standardized rates. There are four different approaches to estimating the confidence intervals for these rates: (i) using the normal approximation of the Poisson distribution, (ii) treating the DSR as a weighted sum of Poisson variables, (iii) assuming a gamma distribution, or (iv) assuming a beta distribution. The three most common methods of calculating confidence intervals for directly standardized rates are, firstly, the normal approximation for the total number of events [1] used, for example, in the IARC Cancer Incidence in Five Continents [2]. Secondly, the Dobson method which is an example of treating the DSR as a weighted sum of Poisson variables [3] and used, for example, by Public Health England in Official Statistics, such as for the Public Health Outcomes Framework [4]. Thirdly the Tiwari modified gamma method [5] which uses the beta distribution and is a modification of the gamma method proposed by Fay and Feuer [6] used by the Surveillance, Epidemiology, and End Results (SEER) Program of the National Cancer Institute in Bethesda, Maryland and the Italian Association of Cancer Registries [7].
It is known that the Normal approximation is appropriate only for large numbers of events (such as those occurring in whole countries) [8]. The Stata module "distrate" for calculating confidence intervals for directly standardized rates [7,9] provides two methods for calculating confidence intervals: the Tiwari method as the default method with the Dobson method available as an option. It states that the Tiwari method "produces valid confidence intervals even when the number of cases is very small." The Stata manual does not specify what "small" is [9]. There is uncertainty as to how "small" is too small, with the Australian Institute of Health and Welfare and the US Centers for Disease Control and Prevention recommending below 25 events [10,11].
However, small numbers of events often occur. For example, in the IARC Cancer Incidence in Five Continents publication, there were many cancers with less than ten cases in several populations [2]. A simulation study by Ng et al. [12] examined in detail the relative performance of many different models over several different scenarios. They concluded that the methods by Dobson and Tiwari performed the most consistently. However, their simulations were limited to only considering standardizing for six age groups, where all populations had more than 10 events and did not have widely varying age-specific event rates and age distributions. We were therefore interested in examining the sensitivity of the Dobson and Tiwari methods for fewer than 10 events occurring over 19 different age groups and with widely varying age-specific risks and population sizes. The aim of this study was to investigate the accuracy of the two methods compared with that of the normal approximation using simulation procedures which mirror a variety of plausible real-world scenarios occurring in the production of age-specific rates in England. Real data were then used to demonstrate the methods.

Definition of an accurate confidence interval
A confidence interval is considered "accurate" if the probability that it includes the true value (the "coverage") is close to the stated target probability; i.e., a 95% confidence interval should include the true value approximately 95% of the time. In this analysis we defined the terms "conservative," "liberal," and "accurate" as follows. If the coverage exceeds the stated probability by more than 40% the confidence interval is "conservative." If the coverage is under the stated probability by more than 40% the confidence interval is "liberal." Otherwise, it is accurate. In other words, a 95% confidence interval is considered accurate if its coverage is between 93% (100-(1.4 × 5))% and 97 (100-(0.6 × 5))%. It is generally considered preferable for confidence intervals to be conservative rather than liberal, but that can depend on the context of the analysis and the nature of any decisions to be based on the analysis.

Simulation methods
The number of events occurring in each of 19 5-year age specific categories (i.e., 0-4, 5-9, ... 90+ years) was estimated using the random Poisson generating function in Stata for the following scenarios and restrictions which are sufficient to uniquely specify the age specific rates (θ i ) and the size of the exposed population in each age specific category (d i ).
The scenarios considered and assumptions made were: 1. The size of the exposed population in each age category (d i ) was assumed to be a linear function of age with the ratio of the population in the youngest category to that in the oldest category being 1 (all age categories the same size), 5 or 50. This is not restrictive as when calculating the DSRs the order of the age groups is unimportant. The total exposed population was 190,000, with an average of 10,000 in each category. An additional scenario assumed the sample population had the same age distribution as that of the European Standard Population. 2. The age specific rates (θ i ) were assumed to be a linear function of age with the ratio of the rate (i.e., relative risk) in the oldest to that in the youngest being 1 (no association with age), 50, 500, or 5000. This is not restrictive as when calculating the DSRs the order of the age groups is unimportant.
For each scenario the observed DSRs were calculated using formula A below. The 95% confidence intervals were calculated for the normal approximation method, the Dobson method [3] and Tiwari method [5] using the formulae B, C, and D below. This was repeated to generate 10,000 sets of data. For each scenario: 1. the true DSR ( 1 P w i P w i θ i ) was calculated using the age specific rates (θ i ) and the European Standard Population weights (w i ) 2. the observed DSRs and confidence intervals were divided by the true DSR to enable simple comparisons between the scenarios to be made 3. the inclusion of 1 within the specified confidence intervals was noted 4. the variation of (European standard population/ sample population) divided by sum of (European standard population/sample population) 2 was calculated as a measure of how much the sample population differs from the European standard population.
Stata Version 14 was used to perform all simulations [9].

Calculating a directly standardized rate (DSR)
The DSR, R, is calculated as a weighted average of n age-specific rates ð x i d i Þ: Where x i are the observed age-specific numerator events; d i are the age-specific denominator populations; w i are taken from the 2013 European Standard Population (ESP): The normal approximation for calculating confidence intervals [4] The 100(1-α)% lower and upper confidence limits, LCL and UCL, are defined as:

The Dobson method of calculating confidence intervals [3]
The 100(1 − α)% lower and upper confidence limits, LCL and UCL, are defined as: The Tiwari modified gamma method of calculating confidence intervals [5] The 100(1 − α)% lower and upper confidence limits, LCL and UCL, are defined as: R, x i , d i and w i and Inv Χ 2 (π, ν) are defined as in A and C

Analysis of real data
The numbers of suicides occurring from 2013 to 2015 in the 326 local authority districts in England for males and females were analyzed using the procedure "distrate" in Stata Version 14. In each district the gender-specific DSRs were calculated using the 2013 European Standard Population weights and the Tiwari method to calculate the 95% confidence intervals. In addition, using the European Standard Population, the overall standardized rates for England were calculated for males and females separately. Suicides were chosen for the analysis because of the small numbers of deaths occurring at district level. Table 1 gives the results of the simulations for 30 scenarios: (Expected total number of events: 5,10,15,25 and 100) x (Sample populations: all age groups same size, 50 times larger in youngest compared with oldest, same distribution as European standard population) x (Relative risk in oldest vs. youngest: one and 5000).

Results
The Dobson and Tiwari methods differ in the method of adjusting for the effect of differing weights given to each age specific rate, with the Tiwari method in effect having a greater adjustment for the different weights. Both methods give identical confidence intervals and hence identical coverage when there is no weighting (the scenario in which the age distribution in the sample is the same as that in the standardized population) as shown in Fig. 1. For all three methods with very small numbers of events the coverage does not always improve with increasing sample size. This is because of the discrete nature of the data with only integer counts being able to occur. The coverage illustrated in Fig. 1 for a relative risk in the oldest age group vs. the youngest age group of 5000 is not materially different from that when the incidence is the same for all age groups (Table 1). Figure 2a shows that when the sample population has an equal number of people in each age group (very different from the standard population) and when the relative risk in the oldest age group vs. the youngest age group is 5000 the Tiwari method is able to provide more accurate coverage when the expected number of events is less than five, with both Dobson and Tiwari having accurate coverage when the expected number of events is six or more. Figure 2b plots the median values of the upper and lower 95% confidence limits and shows that the Dobson and Tiwari methods are very similar, with the Normal approximation predicting both the upper and lower confidence limits to be lower than those predicted by the other methods. In addition, using the Normal approximation to calculate the lower 95% confidence interval occasionally resulted in incorrect negative values being predicted.
In the situation where the youngest age groups have sample populations 50 times greater than the oldest age groups and the relative risks are the same in all age groups (Fig. 3), the Tiwari method will overcompensate for the differing weights (compensation is unnecessary as there is no age effect). Dobson is accurate for four or more expected number of events and Tiwari will be slightly conservative for fewer than 15 observations and accurate for 15 or more observations.
The effect of varying the distribution of incidence by age was examined by simulating data with the incidence ratio (oldest:youngest) varying from one (the events being evenly distributed across all categories) to 5000 (the events being strongly clustered towards the older ages). The effect of the degree of clustering was also examined by looking at the number of categories in which no events occurred (Fig. 4) using the Dobson method with the ratio of the incidence in the oldest compared with youngest category being 500. When 100 events are expected amongst 19 categories the lowest two groups are not expected to have any events occurring in them. This happened in 43% of all simulations. Seven percent of the simulations had four categories with zeroseven when this happened the estimated confidence intervals were "accurate." As the total number of events decreases the expected numbers of categories with zero events increases and therefore having a large number of zero categories does not automatically result in 95% CIs that do not reach the nominal coverage level (i.e., that are too liberal). In Fig. 4 the coverage appears to be lower when there are either a relatively high number of categories with zero events or a relatively low number of such categories. In fact, these situations occur rarely (the numbers on the graph indicate the relative frequency) and therefore it is the rarity of the situation that is associated with a low coverage rather than the number of categories per se with no events in them. The smaller the total number of events expected the greater the occurrence of extreme simulations (for 10 events, 10% of simulations had "inaccurate" 95% CIs whereas < 1% of simulations for 100 events had "inaccurate" 95% CIs).
In Figs. 1, 2, and 3 both Dobson and Tiwari provide either accurate or conservative 95% confidence intervals when the expected number of events is five or more. However, in most situations the expected number of events is unknown and all that is known is the observed number of events. Using the simulated data, Table 2 shows that 10 or more events are observed < 0.1% of the time if the expected number of events is < 3, 0.1% if the expected number is three and only 0.8% if the expected number is four. This means that if 10 or more events are observed the expected number of events is unlikely to be below five. Similarly if eight or more events are observed then 6.4% of the time the expected number is less than five, which is judged to be a relatively common occurrence. If nine events are observed then 2.4% of the time the expected number of events is less than five. The authors judge that it would be more cautious to insist that at least 10 events are observed, but clearly there could be situations were nine events might be considered a more appropriate cut-off. Therefore, if the observed number of events is 10 or more the 95% confidence interval is very likely to be "accurate." Figures 2b  and 3b show that when the expected total number of events was 10 or more, the upper confidence limit was slightly more than twice the DSR and the lower confidence limit was slightly less than 50% of the DSR. The median width of the confidence interval for 10 expected events is identical to the median width for 10 observed events. Therefore if 10 events are observed the 95% confidence interval is very likely to be accurate.
Real data analysis Table 3 shows that many of the local authority districts in England had very small numbers of suicides, particularly amongst females. Out of the 326 districts there were at least 10 male suicides occurring in 317 (97%) and at least 10 female suicides occurring in only 141 (43%). In those cases the confidence intervals calculated by the Dobson and Tiwari methods agreed to within +/− 1% for both the lower and upper bounds. For the cases with fewer than 10 suicides, as expected, the confidence intervals were less consistent with only around half agreeing to within 1%.
Amongst the nine districts in which fewer than 10 suicides in men were observed there were only two districts where the confidence interval (consistent for both methods) did not include the DSR for England of 1.58 per 10,000 (six and nine suicides in these two districts). Similarly amongst the 185 districts in which fewer than 10 suicides in women were observed there were only three districts where the confidence interval (consistent for both methods) did not include the DSR for England of 0.47 per 10,000 (one, one, and three suicides in these districts). In routine publication of DSRs the confidence intervals are presented to aid interpretation, with the guidance being that districts where the 95% CI includes the England comparator value should not assume there is any underlying reason for the difference as it is likely to be due to chance variation. For these data there are therefore only two districts in which this assumption may be misleading for the male suicide rates. For females the numbers of counts are so small that the confidence intervals are not reliable in many districts.

Discussion
The strength of this study is that it is based on realistic simulated data. Firstly, as is standard practice, the  European Standard Population weights for 19 5-year age groups were used. Secondly, the age specific rates were allowed to vary by as much as 5000 times for the oldest compared with the youngest age group. These variations do occur in real observed data: for example, the age specific rate of stroke deaths in men aged 90+ is over 5000 times that in men aged 15-19 [13]. Thirdly, the age distributions in the samples were allowed to vary considerably from the European Standard Population. This variation is measured by calculating the standardized variation of the ratio of the European Standard Population over the sample population for each age group (see methods section). The standardized variation ranged from 0 (for the scenario with the sample population having the same distribution as the standard population) to 0.00108 (ratio of size of the largest to smallest sample age group is equal to five) and 0.0133 (ratio of size of the largest to smallest sample age group is equal to 50).
The observed variations in 326 local authority districts in England for males and females from 2013 to 2015 were from 0.0000183 to 0.00175, smaller than the most extreme scenarios modeled.
In all the simulations the total population size was 190,000. This is similar to 170,000, which is the average population size in the 326 local authority districts in England and also allows 10,000 people in each of the 19 age groups. The results are not sensitive to this assumption.  Coverage of 95% confidence intervals according to the observed number of categories with no events [Numerical labels denote relative occurrence] and the total expected number of events for the Dobson method when the incidence in the oldest age is 500 times the incidence in the youngest age and the number of people is the same for each age in the sample population. [Grey band indicates "Accurate Coverage"] Rare scenarios, i.e., occurring < 1% of the time, have not been plotted Table 2 The proportion of times the observed number of events or more occurred according to the expected number of events when the incidence in the oldest age group is 500 times the incidence in the youngest age group  A limitation of this analysis is that it depends on the assumption that the occurrence of an event follows a Poisson distribution. Poisson distributions assume events are independent, an infinite number of events may occur over a long period of time, and that events occur only rarely in short periods of time. For most diseases and causes of death event independence is likely to be a reasonable assumption. It will not apply to rates of contagious diseases, for example, where the occurrences are not independent, or where extreme weather or events cause spikes in deaths from a single external factor. However, these confidence interval methods should not be used in such cases. The assumption of an infinite number of events occurring over a long period of time is also generally reasonable as for example when analysing a specific cancer the size of the population (i.e., the total number of deaths that could occur) is considerably greater than the numbers of actual deaths likely to occur and can therefore be thought of as infinite.
As stated in the methods section, specifying linear functions for the incidence and the population sizes is not restrictive, because when calculating the DSRs the order of the age groups is irrelevant. The effect of altering the incidence and age association is to create different amounts of clustering within the numbers of observed events; a ratio of 5000:1 in incidence between the highest and lowest categories will result in the lowest categories having very few events in contrast to a ratio of 1 which will spread the events evenly.
Both the Dobson and the Tiwari methods of calculating confidence intervals are influenced by the variation of the ratio of the population years in the standard population to the population years in the sample. Ng et al. [12] specified that if this variation was small (< 0.01as is the case in most of our modeled scenarios and the real data) then the Tiwari method was the recommended method if it was acceptable that the coverage was above the nominal valueas has been seen in these simulations. Ng et al. recommended the original gamma method proposed by Fay and Feuer [6] if one is more concerned with a symmetrical confidence interval that is closer to the nominal coverage.
The normal approximation for calculating confidence intervals was included in this study in order to provide a benchmark against which to judge the Dobson and Tiwari methods. In practice if the normal approximation was the method of choice then the use of continuity corrections to improve the fit to the normal distribution would need to be investigated [14], such as that suggested by Begaud et al. [15]. However as even standard spreadsheet packages can calculate the chi-squared distribution and can therefore calculate confidence intervals using both Dobson and Tiwari methods, efforts to improve the normal approximation were not considered further.
In agreement with Ng et al. [12] the coverage of the Dobson and Tiwari methods are both considered accurate for 10 or more observed events. Coverage from the Dobson method is consistently closer to 95%, with the coverage from the Tiwari method tending to be above 95%. However, as the confident limits from the two methods differ by less than 0.1% the differences in estimates are not significant.

Conclusion
The results from this simulation confirm those predicted from other studies [3,12,15] and lead to the recommendation that at least 10 events must have occurred for a directly standardized rate to be published. Both the Dobson and Tiwari methods produce "accurate" confidence intervals when 10 or more events are observed. As expected, the normal approximation should not be used for fewer than 25 events. The Dobson method might be considered the preferred method due to the formulae being simpler than that of the Tiwari method and the coverage being slightly closer to 95%.