A generic model for the assessment of disease epidemiology: the computational basis of DisMod II
© Barendregt et al; licensee BioMed Central Ltd. 2003
Received: 19 March 2003
Accepted: 14 April 2003
Published: 14 April 2003
Epidemiology as an empirical science has developed sophisticated methods to measure the causes and patterns of disease in populations. Nevertheless, for many diseases in many countries only partial data are available. When the partial data are insufficient, but data collection is not an option, it is possible to supplement the data by exploiting the causal relations between the various variables that describe a disease process. We present a simple generic disease model with incidence, one prevalent state, and case fatality and remission. We derive a set of equations that describes this disease process and allows calculation of the complete epidemiology of a disease given a minimum of three input variables. We give the example of asthma with age-specific prevalence, remission, and mortality as inputs. Outputs are incidence and case fatality, among others. The set of equations is embedded in a software package called 'DisMod II', which is made available to the public domain by the World Health Organization.
Assessment of the epidemiology of a disease is often very hard. Data on incidence, prevalence and disease specific mortality are frequently incomplete, not very reliable, or altogether lacking. The solution of choice is gathering good data, but this is time-consuming, often difficult, and always costly. When primary data collection is no real option, as in a burden of disease study where the goal is a comprehensive overview of the epidemiology of a large number of diseases, additional methods of assessing disease epidemiology are needed.
Additional information can be derived from the logical relations between the variables that describe a disease. By definition, a prevalent case must have been incident at some earlier time and age. Also, it is impossible to die or recover from a disease without having had the disease, however brief. These logical relations can be expressed as a formal model of a generic disease process. Such a formal disease model allows calculation of a complete and internally consistent description of disease epidemiology from partial data.
For the Global Burden of Disease 1990 study a generic formal disease model was implemented as a computer model called 'DisMod' [1, 2]. In that study and in subsequent country studies, DisMod has been used extensively to supplement missing data and force consistency on data that were available. DisMod is based on a set of differential equations that describe age specific incidence, remission, case fatality, and 'all other causes' mortality. With total mortality and three transition hazards – incidence, remission, and case fatality – as inputs, the equations were solved numerically using an iterative approximation method, the finite differences method. It was also possible to enter a relative risk on total mortality as an alternative input to case fatality, but, given total mortality, these are equivalent . Other disease variables, such as prevalence and disease specific mortality, were derived from this solution, but could not be used as inputs.
In the field of chronic diseases a similar, but simpler, model has been developed and used for purposes of assessing disease epidemiology and the calculation of 'what if'-scenarios [4, 5]. This disease model is simpler because, being about chronic diseases such as diabetes only, remission can be ignored. This simplification allows analytical solutions of the differential equations to be used, instead of requiring a numerical approximation .
For the Global Burden of Disease 2000 study it was decided to develop a new computer model, called 'DisMod II', which would serve the same purposes as the original DisMod, but would have enhanced usability, such as an interactive graphical interface. An important new feature was to allow for a wider range of disease inputs than the three transition hazards used in DisMod (incidence, remission and case fatality). In particular, prevalence and disease specific mortality would be potential inputs in the new model.
To facilitate interactive use of such a model, speed of computation is essential, and therefore an analytical solution of the differential equations was preferred over a numerical one. Here we report a set of equations that represent the analytical solution of the differential equations. This set of equations forms the computational basis of DisMod II.
The problem and a solution
This mortality from all other causes poses a problem: it is an input to the disease model, but often it is not known. It could be calculated from the total mortality rate and the disease specific mortality, but frequently the disease specific mortality is not known.
To get around this problem we use the property that hazards are unaffected by the presence or absence of other hazards that act on the same population. If it is assumed that mortality from all other causes is independent of the disease, i.e., that it is the same for healthy and diseased people, this implies that the transition hazards for incidence, remission and case fatality are not affected by the value of the 'all other causes' mortality. Therefore we can set the value of mortality from all other causes to 0 (i.e., leave it out of the equations) and still derive the right values for the disease rates. Disease prevalence, when reported as a proportion of the total population, will also be unaffected .
The assumption of independence of the mortality from all other causes implies that the disease-specific mortality in the model stands for all excess mortality caused by the disease, which is not necessarily the same as the disease-specific mortality reported by national statistical offices. This definition of disease-specific mortality complies with the methodology of burden of disease studies, which aims to attribute all excess mortality to the disease.
One of the DisMod II outputs – disease duration – is affected by mortality from all other causes. DisMod II therefore calculates results in two steps. First it calculates the numbers of people in the three states 'healthy', 'diseased' and 'dead from the disease' for all ages and derives disease rates such as incidence, prevalence and mortality. Next, from the disease specific mortality rate it calculates the mortality rate for all other causes, thus making it possible to calculate disease duration.
DisMod II calculates the disease starting with a cohort of 1000 people at age 0, and working up to the highest age considered. Within an age interval transition hazards are assumed constant, and to minimize the impact of this assumption the calculation is done in 1-year age intervals.
For a single cohort the following three differential equations describe the conceptual model when 'all other causes' mortality is ignored:
where the three model parameters, representing the transition hazards, are:
ƒ: case fatality
and the three states are:
S a : Number of healthy people at age a
C a : Number of diseased people at age a
D a : Number of dead people at age a
Equations that express the number of people in each of the states S, C, and D at age a as a function of the parameters i, r, and ƒ were derived from equations 1–3 using Maple V . To simplify the derived equations we first define a number of intermediate variables:
Using these intermediate variables the derived equations then become:
Prevalence and mortality
From the resulting numbers of people in S, C and D for all ages the age specific prevalence proportion and mortality rates are calculated. First, for each age interval person years at risk (PY a ) are calculated:
The prevalence proportion c a then becomes:
and the mortality rate b a is:
The age-specific mortality rates allow derivation of mortality from all other causes (m), needed to calculate disease duration. The equations below describe the expected duration of disease for a person who became incident in the age interval [a,a + 1), while taking mortality from all other causes m into account. We define:
β a = r a + ƒ a + m a : the total hazard to leave the diseased state C a
y a,d : probability to be in the diseased state after duration d, y a,0= 1.0
x a,d : contribution of duration in the interval [d - 1,d) to the total duration in the disease state after incidence in [a, a + 1): X a
Then, for incidence in the interval [a, a + 1) the equations for that first year are:
For durations in subsequent years [a + k, a + k + 1), k = 1,2,3..., the following two equations apply:
Total duration X a for incidence in [a, a + 1) then becomes:
The equations above are implemented as a software package, designed for use by epidemiologists and public health scientists. Users combine the available data and their own expert knowledge interactively to produce best estimates of the epidemiology of the disease. To this end DisMod II comes with a graphical interface and a number of features for fitting curves to and interpolation of the input data. There is an extensive online help, including a tutorial to help users to get started. The software runs on Windows 95 and higher, and is available from the website of WHO http://www.who.int/evidence/dismod/.
An explicit aim of the development of DisMod II was to allow for a wider range of input variables than the three transition hazards in the original DisMod. Equations 4–6 allow calculation of the numbers of people in the three states when the three transition hazards incidence, remission, and case fatality are known. Often these transition hazards are not observed, but, for example, prevalence and disease-specific mortality are. Allowing for prevalence and mortality as inputs directly would require rewriting equations 4–6 accordingly, but this may not be tractable.
In those cases the analytical solution presented here is supplemented by an iterative optimization method, the 'downhill simplex method' . This is an optimization method in multiple dimensions, which in this case are the three transitions hazards. Starting at the lowest age group, values for the transition hazards are inserted in equations 4–6, and all output variables are calculated. A loss function then evaluates the difference between the input variables and the corresponding output, and based on this evaluation a different set of values of the transition hazards is inserted. This procedure is iterated until the loss function reaches a minimum, and the optimization moves to next age group.
Because of this combination of analytical and numerical methods DisMod II accepts, in addition to the transition hazards incidence, remission and case fatality (or its equivalent relative risk for total mortality), the following disease input variables: incidence as a population rate (with total population in the denominator instead of person years at risk), prevalence, duration, and mortality. Because of the two-step calculation procedure duration is a valid input only when it is short, preferably less than one year.
When the input variables do not consist of the three transition hazards incidence, remission and case fatality, they may be (and often are) internally inconsistent. In that case the 'downhill simplex' optimization procedure will adjust the values of the input variables such that they are internally consistent while staying as close as possible to the original values. The user can influence the outcome by applying different weights to the input variables: an input variable with a higher weight will remain closer to the original value. The same procedure applies when more than three inputs are available, i.e., when the model is overidentified.
Generally, at least three disease input variables are needed to calculate the full disease epidemiology; the exception is when case fatality and relative risk for total mortality are given. Case fatality and relative risk for total mortality contain the same information, given total mortality, and therefore count as only one input when they are both included.
In addition to the disease input variables, DisMod II needs total mortality rates and population numbers for the population under study. All input variables are by age, and calculations are done separately for men and women.
Trends in disease epidemiology
Equations 1–3 describe a life table cohort, which, when interpreted as a description of a cross-sectional population, implies an assumption of steady state for the disease. However, it is possible to include past trends on the transition hazards incidence, remission, and case fatality in the calculation. DisMod II then switches to a fully dynamic calculation mode: it still uses equations 4–6, but for each age separately. In life table mode, disease variables at age a depend on variables at age a - 1; in dynamic mode, variables at age a and time t depend on variables at age a - 1 and time t - 1 (which, because of the trends, have not the same values as those at age a - 1 and time t). DisMod II still tries to reproduce the currently observed input variables, but taking past trends into account will result in different values for the unobserved variables. Dynamic mode requires considerably longer computation time than life table mode.
Uncertainty intervals for the output can be obtained by specifying distributions (and parameters) for the disease input variables. A number of distributions are available, among them Poisson, binomial, and normal. DisMod II uses these distributions and associated parameters in a monte carlo simulation (or parametric bootstrapping ). For each of the input variables a value is randomly chosen from its distribution, and the model is calculated. This procedure is repeated a large (user specified) number of times, resulting in distributions for all output variables. From the distributions of the output variables uncertainty intervals are derived. This too may take considerable computation time.
Even if the input variables were internally consistent, the randomly sampled values from their distributions will not be. This causes DisMod II to adjust the input variables to output values that are internally consistent, which has an impact on the width of the uncertainty intervals. Randomly sampled values from the distributions of the input variables will cause the adjustments of the output to have a distribution as well. Consequently the uncertainty intervals do not just reflect the sampling variability of the individual variables (which is why they are not called 'confidence intervals').
We illustrate DisMod II with the example of asthma from the Victorian Burden of Disease study. Deaths for Victoria in 1996 are from the Australian Bureau of Statistics . Prevalence is based on a number of Australian studies, and male-to-female ratios for children and adults. Remission rates are based on a follow-up study from the United States .
Asthma prevalence, remission and mortality rates (per 1000 population) by age and sex, Victoria 1996*
DisMod II outputs, males Asthma incidence, prevalence, remission, case fatality and mortality rates per 1000 population (95% uncertainty interval)
21.86 (17.07, 26.65)
50.57 (37.96, 63.25)
45.92 (44.40, 47.43)
0.02 (0.00, 0.26)
0.00 (0.00, 0.00)
14.94 (12.05, 17.85)
124.30 (108.17 140.62)
81.73 (76.99, 86.50)
0.05 (0.00, 0.56)
0.01 (0.00, 0.03)
1.62 (0.83, 2.45)
82.96 (65.43, 100.83)
86.18 (75.73, 96.86)
0.10 (0.01, 0.60)
0.01 (0.00, 0.04)
0.16 (0.00, 0.67)
54.48 (41.11, 67.98)
28.31 (24.08, 32.51)
0.07 (0.00, 0.57)
0.00 (0.00, 0.06)
0.51 (0.00, 1.05)
47.82 (37.72, 57.93)
12.13 (10.10, 14.10)
0.19 (0.01, 0.69)
0.01 (0.00, 0.19)
0.84 (0.19, 1.49)
48.95 (39.66, 58.27)
13.39 (11.33, 15.44)
0.35 (0.02, 0.88)
0.02 (0.00, 0.37)
1.14 (0.42, 1.87)
47.34 (36.78, 58.03)
28.89 (24.98, 32.79)
1.38 (0.40, 2.36)
0.07 (0.00, 0.56)
0.78 (0.21, 1.37)
40.75 (29.24, 52.30)
35.47 (31.62, 39.30)
4.00 (0.58, 7.44)
0.16 (0.00, 0.67)
0.89 (0.34, 1.47)
32.76 (22.59, 43.12)
32.07 (29.76, 34.36)
14.70 (1.02, 28.88)
0.48 (0.00, 0.99)
DisMod II outputs, females Asthma incidence, prevalence, remission, case fatality and mortality rates per 1000 population (95% uncertainty interval)
15.17 (11.05, 19.37)
34.79 (24.08, 45.57)
45.87 (44.24, 47.49)
0.01 (0.00, 0.49)
0.00 (0.00, 0.33)
10.34 (8.06, 12.63)
90.69 (73.56, 107.92)
80.45 (75.25, 85.70)
0.01 (0.00, 0.52)
0.00 (0.00, 0.50)
3.12 (1.99, 4.25)
67.31 (50.60, 84.29)
76.07 (69.13, 83.14)
0.08 (0.00, 0.58)
0.01 (0.00, 0.49)
1.79 (0.91, 2.66)
61.01 (46.99, 75.23)
27.49 (24.99, 29.98)
0.03 (0.00, 0.54)
0.00 (0.00, 0.50)
1.76 (0.83, 2.69)
67.17 (54.71, 79.74)
12.00 (10.99, 12.99)
0.21 (0.01, 0.72)
0.01 (0.00, 0.49)
2.25 (1.10, 3.41)
76.94 (64.20, 89.74)
13.47 (12.31, 14.62)
0.36 (0.02, 0.87)
0.03 (0.00, 0.36)
2.44 (1.24, 3.64)
79.69 (66.30, 93.07)
29.12 (26.66, 31.58)
0.53 (0.08, 1.05)
0.04 (0.00, 0.41)
1.94 (0.93, 2.96)
72.62 (59.09, 86.25)
35.19 (31.93, 38.44)
1.89 (1.03, 2.75)
0.14 (0.00, 0.64)
2.11 (1.18, 3.06)
65.00 (52.31, 77.74)
32.00 (29.20, 34.80)
6.45 (3.96, 8.95)
0.42 (0.00, 0.93)
Epidemiology is first and foremost an empirical science. But the emphasis on observations, while certainly justified, does not preclude that using more theoretical tools like DisMod II may contribute as well.
We see two main applications for DisMod II. The first is supplementing incomplete data, and this was illustrated by our asthma example. With data for prevalence, remission and mortality, it is possible to calculate the complete epidemiology of asthma, including back-calculating the incidence. This application is useful when data are scarce but an estimate of disease epidemiology is urgently needed. This is a common situation in burden of disease studies.
Supplementing incomplete data is not a fully automatic process, however. For example, available data tend to come in wide age intervals. From the point of view of DisMod II, the differences in value between adjacent age groups are discontinuities, which may be impossible to resolve without very extreme values of one or more variables. In particular, when back-calculating incidence from, among others, prevalence, such discontinuities may result in huge spikes in the back-calculated incidence.
It is the responsibility of the user to guard against such 'solutions'. DisMod II tries to help by showing graphs of the input and output variables, and by providing interpolation and smoothing algorithms. But in the end it is the user who has to decide whether the outcomes are acceptable, and if not, what strategy is needed to resolve the problem (smoothing the input, using different weights for the input variables, etc.). Using DisMod II for this purpose is an interactive exercise.
The second application is checking for internal consistency of existing estimates. Empirical observation of the variables that describe the epidemiology of a disease is subject to measurement error, which may affect variables differently. For example, for a disease with a gradual onset, such as dementia, it is much harder to estimate incidence than prevalence. Measurement error may go undetected, but if we know it exists it may be possible to account for it. Checking the internal consistency of the estimates with DisMod II may help to detect the existence of measurement error.
This application also carries certain difficulties. Inconsistency of cross-sectional variables describing a disease may be real or deceptive . Real inconsistency may be due to the combination of measurements from different sources, or to measurement and sampling error. Deceptive inconsistency is due to past trends in disease epidemiology. The problem is that it is not possible to distinguish between the two, unless the epidemiology of the disease (including the past) is fully known, and in that case there is little need for a tool like DisMod II.
Without complete knowledge of the disease epidemiology this dilemma can be solved only by expert judgement. When it is unlikely that trends in the past have existed, or when a sensitivity analysis shows that reasonable past trends are unable to explain the inconsistency, the expert may decide that most of the inconsistency is real, and also which variable is most likely to be in error.
The message here is that DisMod II is a tool for experts, who should carefully weigh all available evidence, of which the DisMod II output is only a part.
All this assumes that the conceptual model underlying DisMod II is applicable to a wide range of diseases with very different epidemiology. While most aspects of the conceptual model are a matter of definition, this is not true for survival in the diseased state. This survival is piecewise exponentially distributed: exponential within each year of age, with (possibly) a different hazard for each age. However, a check against survival data that at the individual level were lognormally distributed, showed that on the population level DisMod II was able to reproduce the data very well .
One kind of disease for which the conceptual model may not be applicable is an infectious disease that confers immunity. DisMod II assumes that those who remit go back to the pool of susceptibles, which in the case of acquired immunity clearly is not appropriate.
Where epidemiologic data are incomplete or in doubt, DisMod II may prove to be helpful. Thanks to the analytical solution described here it is fast enough to allow interactive use. When a single disease is to be studied, customized tools may be more appropriate, but when the epidemiology of a large range of diseases must be assessed, as in burden of disease studies, the generic DisMod II disease model will be a useful tool.
This study was made possible by the Global Programme on Evidence for Health Policy of WHO (through a Global Health Leadership fellowship awarded to the first author and a software development contract).
- Murray CJ, Lopez AD: Regional patterns of disability-free life expectancy and disability-adjusted life expectancy: Global Burden of Disease Study. Lancet 1997, 349: 1347-1352. 10.1016/S0140-6736(96)07494-6View ArticlePubMedGoogle Scholar
- Murray CJ, Lopez AD: Quantifying disability: data, methods and results. Bull World Health Organ 1994, 72: 481-494.PubMedPubMed CentralGoogle Scholar
- Murray CJ, Lopez AD: Global and regional descriptive epidemiology of disability: incidence, prevalence, health expectancies and years lived with disability. The Global Burden of Disease (Edited by: Murray CJ, Lopez AD). Boston: Harvard School of Public Health 1996, 201-246.Google Scholar
- Barendregt JJ, Baan CA, Bonneux L: An indirect estimate of the incidence of non-insulin-dependent diabetes mellitus. Epidemiology 2000, 11: 274-279. 10.1097/00001648-200005000-00008View ArticlePubMedGoogle Scholar
- Barendregt JJ, Bonneux L, van der Maas PJ: The health care costs of smoking. N Eng J Med 1997, 337: 1052-1057. 10.1056/NEJM199710093371506View ArticleGoogle Scholar
- Barendregt JJ, van Oortmarssen GJ, van Hout BA, van den Bosch JM, Bonneux L: Coping with multiple morbidity in a life table. Mathematical Population Studies 1998, 7: 29-49.View ArticlePubMedGoogle Scholar
- Char B, Geddes K, Gonnet G, Leong B, Monagan M, Watt S: First Leaves, a Tutorial Introduction to Maple V. New York: Springer-Verlag 1991.Google Scholar
- Press WH, Teukolsky SA, Vetterling WT, Flannery BP: Numerical Recipes in Fortran 77: The Art of Scientific Computing. Cambridge: Cambridge University Press 1992.Google Scholar
- Efron B, Tibshirani RJ: An introduction to the bootstrap. London: Chapman & Hall Inc 1993.Google Scholar
- Victorian Government Department of Human Services, Public Health and Development Division: The Victorian Burden of Disease Study: Mortality. Melbourne 1999.Google Scholar
- Victorian Government Department of Human Services, Public Health and Development Division: The Victorian Burden of Disease Study: Morbidity. Melbourne 1999.Google Scholar
- Kruijshaar ME, Barendregt JJ, Hoeymans N: The use of models in the estimation of disease epidemiology. Bull World Health Organ 2002, 80: 622-628.PubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.