Using multi-year national survey cohorts for period estimates: an application of weighted discrete Poisson regression for assessing annual national mortality in US adults with and without diabetes, 2000–2006

Background Monitoring national mortality among persons with a disease is important to guide and evaluate progress in disease control and prevention. However, a method to estimate nationally representative annual mortality among persons with and without diabetes in the United States does not currently exist. The aim of this study is to demonstrate use of weighted discrete Poisson regression on national survey mortality follow-up data to estimate annual mortality rates among adults with diabetes. Methods To estimate mortality among US adults with diabetes, we applied a weighted discrete time-to-event Poisson regression approach with post-stratification adjustment to national survey data. Adult participants aged 18 or older with and without diabetes in the National Health Interview Survey 1997–2004 were followed up through 2006 for mortality status. We estimated mortality among all US adults, and by self-reported diabetes status at baseline. The time-varying covariates used were age and calendar year. Mortality among all US adults was validated using direct estimates from the National Vital Statistics System (NVSS). Results Using our approach, annual all-cause mortality among all US adults ranged from 8.8 deaths per 1,000 person-years (95% confidence interval [CI]: 8.0, 9.6) in year 2000 to 7.9 (95% CI: 7.6, 8.3) in year 2006. By comparison, the NVSS estimates ranged from 8.6 to 7.9 (correlation = 0.94). All-cause mortality among persons with diabetes decreased from 35.7 (95% CI: 28.4, 42.9) in 2000 to 31.8 (95% CI: 28.5, 35.1) in 2006. After adjusting for age, sex, and race/ethnicity, persons with diabetes had 2.1 (95% CI: 2.01, 2.26) times the risk of death of those without diabetes. Conclusion Period-specific national mortality can be estimated for people with and without a chronic condition using national surveys with mortality follow-up and a discrete time-to-event Poisson regression approach with post-stratification adjustment.


Background
National surveillance of incidence, prevalence, and mortality is key to guiding and evaluating progress in chronic disease control and prevention. The prevalence of a chronic disease like diabetes can be affected by increasing incidence among persons without the disease as well as decreasing mortality among persons with the disease. Good estimates of mortality among persons with a chronic disease improve understanding of secular changes in prevalence, incidence, and mortality and their relationships. Since diabetes is often not recorded on death certificates as a direct, underlying, or contributing cause of death, the impact of diabetes on deaths in the United States population could be underestimated [1,2]. The linkage of nationally representative surveys that include baseline disease status with mortality follow-up provides the opportunity to examine all-cause and cause-specific mortality among persons with diabetes or other chronic conditions.
From the policymaking and resource allocation perspectives, a cross-sectional estimate of mortality by calendar period (e.g., year) is highly desirable. Analyses of mortality follow-up data typically use survival approaches to examine the association between risk factors and death. In these analyses, the data are analyzed as a cohort covering the entire follow-up period, and the hazard of death is estimated for the cohort. However, this approach does not permit estimation of the hazard of death across time periods, nor does it provide valid annual or other calendar period estimates.
By following the conceptual framework of age-periodcohort analysis (APC) as represented by the Lexis diagram, multi-year cohort data can be decomposed into discrete time-to-event data and aggregated by calendar period [3,4]. Calendar period all-cause mortality rates can be calculated by simply using the total number of deaths divided by the total person-years in each calendar period. Poisson regression, a generalized linear model, is appropriate for modeling unadjusted and adjusted mortality rates of multiple periods [4]. Discrete Poisson regression yields identical estimates to the piecewise exponential model, which is another alternative to the Cox proportional hazards model [5]. Nevertheless, most discrete time-to-event studies use aggregated group data and categorized independent variables [6]; we are not aware of previous publications using discrete Poisson regression applied to multi-year mortality follow-up data from national sample surveys.
In this study, to increase the awareness of estimating cross-sectional period mortality using multi-year national survey mortality follow-up data, we describe the construction of discrete survival time data in detail and demonstrate our approach from data preparation to data analysis. With diabetes as an example, we use population-weighted Poisson regression to model discrete survival time and estimate annual all-cause mortality by diabetes status. The US National Health Interview Survey (NHIS) 1997-2004 with mortality follow-up up to 2006 was used to illustrate this approach; US mortality estimates from the National Vital Statistics System (NVSS) were compared for validation purposes.

Continuous time-to-event data
Survival analysis studies the occurrence and timing of events. Individual time-to-event data includes three components: the time of study entry (t 0 ), the time of study exit (t 1 ), and the event (i.e., death (D) or censoring (C)). In this study, the total follow-up time is the difference between the end of follow-up (t 1 , date of death or date of censor, whichever came first) and the date of the NHIS baseline interview (t 0 ). We used a modified Lexis diagram to demonstrate the structure of continuous time-to-event survival data ( Fig. 1 -1a) [4]. Each participant (i)'s follow-up experience, represented by the diagonal line segment [from (year_t 0i , age_t 0i ) to (year_t 1i , age_t 1i )], is shown on the plot of age versus calendar year. For clinical trials with a short follow-up time and matched age, follow-up time is usually used as the time scale in the survival analysis. For observational epidemiological studies with much longer follow-up time and a diverse age distribution of the observed sample, it has become popular to use age during follow-up as the time scale [7][8][9]. Typical continuous time-to-event data have a single record for each participant (Table 1 -Part I). For example, Person A entered the cohort at 02/06/2000 and had a total follow-up time of 3.5 years; person B entered the cohort at 07/02/2003 and had a total follow-up time of 3.5 years as well.

Discrete time-to-event survival data
To calculate a period-specific mortality rate, we divided the continuous survival times into discrete calendar years. Since interviews did not all take place on the first day of the survey year, to make sure the survival time was allocated correctly we added an individual-specific partial time period (t_ext i ), calculated as the difference between the interview date and the first day of the year, then we divided the extended survival time [(t 1 −t 0 ) + t_ext i ] into years. That is, each person's total continuous survival time was discretized into multiple records, one for each calendar year. An individual's survival time for a given calendar year was between 0 and 1 year, and the survival time in the first year was [1-t_ext i ]. In the analysis, age and calendar year were treated as time-varying (i.e., time-dependent) covariates. The age during each discrete period was assigned as the age on the first day of that calendar year.
With each participant contributing multiple discrete person-years during the follow-up, the sum of a person's discretized annual person-years is equal to the total continuous survival time of that person ( Fig. 1

US national health interview survey and mortality follow-up
We used the NHIS mortality follow-up data to demonstrate our approach. The NHIS, conducted by the Centers for Disease Control and Prevention's National Center for Health Statistics (NCHS), is an annual ongoing nationally representative cross-sectional household interview survey of US non-institutionalized civilians of all ages [10]. The sampling plan covers the 50 states and the District of Columbia, and follows a multistage area probability design that permits the representative sampling of households and non-institutional group quarters. The annual response rate of NHIS is approximately 80% of the eligible households in the sample [10]. All information about sex, race/ethnicity (non-Hispanic white, non-Hispanic black, Hispanic, and others), and diabetes status was self-reported. Participants were classified as For part 1b, the time-to-event was split yearly; the y-axis shows the discretized age during the follow-up calendar year using the age at the first day of the year. The discrete age increased yearly with the follow-up  Diabetic death was defined as a death with an associated International Classification of Diseases, 10th Revision (ICD-10) code of E10-E14. All-cause with diabetes death was defined as a person with diabetes who died of any cause. The total weighted person-time was used as the denominator for mortality calculation. We also estimated mortality by self-reported diagnosed diabetes at baseline. To validate our findings empirically, we compared the all-cause and diabetic mortality rates from NHIS with mortality rates from the NVSS, the fundamental source of US cause-of-death information. Mortality rates from the NVSS were directly calculated as number of death (all-cause or diabetic death coded as E10 to E14) divided by total population using structured query language from CDC WONDER by following the step-by-step instruction on the WONDER website (http://wonder.cdc.gov/mortSQL.html).
To reduce potential selection bias due to respondents being healthier than non-respondents, we excluded each individual's first two years of follow-up. The final analytical discrete time-to-event data set included adults aged 20 years or older during the years 2000 to 2006.

Poisson regression
Poisson regression was used to analyze and estimate the mortality rate [11]. The mortality (hazard) rate can be estimated using the following equation when follow-up times (pt) vary across individuals: Here, the natural logarithm of the expected value of the event, log(d), with an offset of natural logarithm of follow-up time, log(pt), is a linear combination of independent covariates, X i , with regression parameters β i ,.
Poisson regression provided the estimate of mortality for each calendar year/period. We used the robust error variances estimation approach to minimize overdispersion [12] and the polynomial function of calendar time to smooth year-to-year variation in mortality rates [6,13]. To smooth the variation in mortality due to low mortality rates in some age subgroups, the age at the beginning of a calendar year was defined as a continuous variable with polynomial terms (quadratic polynomial). The mortality rates in our study were estimated by the predictive margins of the regression coefficients from the Poisson model.
Adjusted sampling weights for the discrete time-to-event data The age of sampled participants in each survey cohort increased with the year of follow-up and those multiyear survey cohorts also overlapped time periods. Without accounting for the demographic discrepancy between the participants from different cohorts and the US population at each specific year, the demographic distribution of a discrete period after the baseline year would not represent the demographic distribution of the US population at that specific year or time-period, and the total crude mortality of the US population would be biased toward the older population. In order to correct for these issues, we adjusted the sample weights using a post-stratification procedure in which sampled units were divided into subgroups based on age, sex, and race/ethnicity; we used the nationally representative weighted size of each subgroup of NHIS 2000 to 2006 at interview to estimate the US population size. The analysis weights for the discrete time-to-event data were reweighted proportionally. The adjusted analysis weights thus sum to the US population size within each subgroup. The sum of the analysis weights equaled the total non-institutionalized US population for each calendar year.

Analysis
We used Stata 13.1 (StataCorp LP, College Station, Texas) to account for the complex multistage sampling design and to produce weighted estimates and 95% confidence intervals (CI).
For all comparisons we used a two-sided statistical test with significance defined as p value (p) <0.05 or a 95% CI that did not include the null value. The ggplot2 package of R was used to produce graphics [14].

Results
From 1997 to 2006, the US population increased in total numbers and mean age, decreased in the proportion non-Hispanic white, and increased in prevalence of diabetes (all p values <0.001). The unweighted total number of deaths by year in the NHIS follow-up sample increased from 614 in year 2000 to 2,046 in 2006 ( Table 2). The weighted total numbers of deaths from the NHIS follow-up (data not shown) were less than, but very close to, the total numbers of deaths of all US adults aged 20 to 84 years using the NVSS (Table 2).
To show the importance of post-stratification reweighting, we compared the NHIS follow-up estimates with the results from NVSS and the NHIS estimates that used the original weights ( Table 2). Mortality estimates using the original sampling weights without poststratification adjustment were higher than the mortality estimates using adjusted sampling weights, because of the aging of cohorts during the follow-up. Mortality in each year from the NVSS was within the 95% CIs of mortality rates from the NHIS using the adjusted sampling weights. The average annual decrease in crude mortality (per 1,000 person-years) was 0.12 for both the NHIS and the NVSS. The correlation of NHIS and the NVSS mortality was 0.94. Age-sex-race/ethnicity-adjusted mortality decreased 2.6% per year (p < 0.001).
To demonstrate the flexibility of our approach, we calculated the sex-race/ethnicity-adjusted, age-specific, all-cause mortality by diabetes status at baseline using polynomial Poisson regression (Fig. 2). In summary, adults with diabetes at baseline had 2.31 (95% CI: 2.12, 2.50) times the risk of death compared with adults without diabetes (after adjusting for sex and race/ethnicity).

Discussion
Period mortality among persons with chronic conditions such as diabetes is an important surveillance indicator of disease prevention and control. However, since chronic disease status is not reported in many vital statistics registries, it is often not possible to use vital statistics data to estimate mortality of persons with and without the condition. This presents a particular limitation for diabetes-related death statistics because diabetic death is often not recorded on US death certificates as a direct underlying or contributing cause of death, and diabetic deaths in the US population could be underestimated by solely using death certificate information [1,2]. Assembly of national cohorts by linking national survey data with vital statistics provides a potential remedy to the data gap, but requires specific methods to permit estimation of period effects. In this study, we described the use of weighted discrete Poisson regression to estimate national mortality rates by diabetes status using a  [1,2]. The method that is most often used to analyze mortality cohort data is the Cox proportional hazards regression model, which is useful for analyzing the data from the association or cause-effect relationship perspective. However, it is cumbersome to use this method to calculate hazard rates for a large number of combinations of predictors. Alternatively, parametric survival models can be more convenient for predicting, but cannot deal easily with time-varying covariates [15].
Age-period-cohort (APC) analysis provides a third option. If a vital statistics registry includes complete information on disease status, the APC method can be used to estimate the annual/period mortality among persons with and without diabetes [4,6,11,[16][17][18]. In the US, diabetes status is not recorded in the national vital statistics registry system. So we cannot apply this method directly. However, US nationally representative survey mortality follow-up data provide information on both diabetes status and death status. The APC model and life table framework can be applied to these data.
The APC analysis has been applied in demography, social science, and disease surveillance research using cross-sectional registration or survey data for a long time [19]. The data are usually cross-sectional and grouped for data analyses. One of the major purposes of these studies was to separate the age, period, or cohort  Among adults without diabetes, diabetic mortality from the NHIS mortality follow-up was calculated as the weighted number of diabetic death divided by the weighted person-years of adults without diabetes effects using cross-sectional data [20]. In our study, we applied the concept and analytic framework of this widely used APC model. Compared to traditional APC models, our study had several differences. First, our study used longitudinal national complex survey mortality follow-up data. Second, the purpose was to estimate period mortality, which is a sum of the age and cohort effects. Finally, to account for the aging of the cohort during follow-up, we post-stratified the aggregated multiple segments from different survey cohorts using the US population structure at each period. Both Poisson and logistic regression can be used for discretized time-to-event data analysis. Efron combined the logistic regression with discrete time-to-event survival time by 1-month intervals and obtained direct estimates of the hazard rates [21]. A polynomial or spline model can be used to smooth out the random variation/ noise. This partial logistic regression gives good estimates when the discrete time interval is small. Nevertheless, a Poisson regression that accounts for person-time of follow-up gives more accurate hazard rate estimates for longer discrete time intervals than a logistic regression. Poisson regression has been used frequently to compare mortality rates among different categories of cohorts in epidemiological studies and is a convenient alternative to Cox proportional hazards regression especially when the proportional hazards assumptions are not met [5]. Early studies on the analysis of cohort survival data showed that Poisson regression is a straightforward and intuitive approach for directly estimating the hazard rates while incorporating time scale as a covariate in the model [16,22]. We were interested in annual (or longer) time periods rather than monthly or daily periods and thus discrete Poisson regression was chosen for our analysis.
To obtain valid national estimates from a complex sample survey, it is critical to use proper statistical methods to account for the sample design and sampling weights. Our study shows that in later years, the distribution of age in the follow-up cohort shifted to the right; thus without post-stratification reweighting, the overall mortality rates combining all ages would have been overestimated. Using the US population as the standard population for poststratification re-weighting yielded all-cause and diabetic mortality estimates that were similar to the national registry estimates. Our study demonstrated that discrete Poisson regression with post-stratification is a feasible approach for estimating annual mortality for the US population with and without diabetes.
The major limitation of our approach is the amount of time needed to discretize and analyze a large sample with long follow-up time. Poisson regression using complex sample data is computationally time-consuming with large discretized person-time datasets because data cannot be collapsed over covariates to account for the design-based analysis of complex sample data. Estimation based on a small number of events can create problems with model convergence. Without careful programming and reweighting, the results can be biased. In addition, the NHIS mortality data represented deaths among the civilian noninstitutionalized population with person-year as the denominator, whereas mortality data from the NVSS represented deaths among the entire US population with the whole population at risk as the denominator. Thus, mortality rates from the two systems might have subtle differences. To demonstrate our approach, we used self-reported diabetes. While any self-reported condition is subject to recall error, the self-report of diabetes is considered a valid measure of diagnosed diabetes [23]. Although it is recognized as being non-sensitive, it has been shown to be highly specific [24]. Another source of bias may arise from the lack of information about diabetes status between the baseline interview and death or censoring. Even though the rate of remission from diabetes to nondiabetes is likely small [25], the lack of information on incident cases would likely lead to an overestimation of diabetes duration. Furthermore, if incident cases have a higher mortality rate than non-cases and a lower mortality rate than prevalent cases, then lacking this information on incidence could lead to an overestimation of mortality rates for the populations both with and without diabetes. Future analyses with information with multiple follow-up visits could quantify the impact of this bias. We demonstrated that weighted discrete Poisson regression is an efficient applicable approach to estimate period mortality from the national mortality follow-up data. To our knowledge, there has been no similar report, though all the steps of this approach are well established. Several reasons could explain the scant usage of the discrete Poisson regression approach, including lack of data availability, lack of its inclusion as part of biostatistics educational curricula, the computing time required to analyze discrete time-to-event data, and the complex sampling design of national surveys, which further complicates using this approach. However, the increasing availability of more powerful statistical software and computing capabilities permits a revisitation of this method for the analysis of national survey mortality follow-up data.

Conclusions
We conclude that combining national follow-up cohorts from multiple survey years and analyzing them using population weighted discrete Poisson regression can yield annual national mortality rates by disease status.
Abbreviations CDC: Centers for disease control & prevention; CDC-WONDER: Wide-ranging online data for epidemiologic research; NCHS: National center for health statistics; NHIS: National health interview survey; NVSS: National vital statistics system