Algorithms for enhancing public health utility of national causes-of-death data
© Naghavi et al. 2010
Received: 9 March 2010
Accepted: 10 May 2010
Published: 10 May 2010
Skip to main content
© Naghavi et al. 2010
Received: 9 March 2010
Accepted: 10 May 2010
Published: 10 May 2010
Coverage and quality of cause-of-death (CoD) data varies across countries and time. Valid, reliable, and comparable assessments of trends in causes of death from even the best systems are limited by three problems: a) changes in the International Statistical Classification of Diseases and Related Health Problems (ICD) over time; b) the use of tabulation lists where substantial detail on causes of death is lost; and c) many deaths assigned to causes that cannot or should not be considered underlying causes of death, often called garbage codes (GCs). The Global Burden of Disease Study and the World Health Organization have developed various methods to enhance comparability of CoD data. In this study, we attempt to build on these approaches to enhance the utility of national cause-of-death data for public health analysis.
Based on careful consideration of 4,434 country-years of CoD data from 145 countries from 1901 to 2008, encompassing 743 million deaths in ICD versions 1 to 10 as well as country-specific cause lists, we have developed a public health-oriented cause-of-death list. These 56 causes are organized hierarchically and encompass all deaths. Each cause has been mapped from ICD-6 to ICD-10 and, where possible, they have also been mapped to the International List of Causes of Death 1-5. We developed a typology of different classes of GCs. In each ICD revision, GCs have been identified. Target causes to which these GCs should be redistributed have been identified based on certification practice and/or pathophysiology. Proportionate redistribution, statistical models, and expert algorithms have been developed to redistribute GCs to target codes for each age-sex group.
The fraction of all deaths assigned to GCs varies tremendously across countries and revisions of the ICD. In general, across all country-years of data available, GCs have declined from more than 43% in ICD-7 to 24% in ICD-10. In some regions, such as Australasia, GCs in 2005 are as low as 11%, while in some developing countries, such as Thailand, they are greater than 50%. Across different age groups, the composition of GCs varies tremendously - three classes of GCs steadily increase with age, but ambiguous codes within a particular disease chapter are also common for injuries at younger ages. The impact of redistribution is to change the number of deaths assigned to particular causes for a given age-sex group. These changes alter ranks across countries for any given year by a number of different causes, change time trends, and alter the rank order of causes within a country.
By mapping CoD through different ICD versions and redistributing GCs, we believe the public health utility of CoD data can be substantially enhanced, leading to an increased demand for higher quality CoD data from health sector decision-makers.
Timely, valid, and reliable information on causes of death by age and sex is a critical input into public health planning, program implementation, and evaluation. Most high-income and many middle-income countries have the benefit of a complete vital registration system in which the vast majority of deaths get a certificate of death completed by a physician . These information systems should in principle provide public health communities in each country with critical information needed to guide their programs. Nevertheless, analyzing levels and trends in causes of death, even in countries with well-functioning cause-of-death registration systems, remains challenging for a number of reasons related to the process of completing death certificates and the coding of each death certificate following standardized international rules.
Even with a physician-completed death certificate, assignment of the underlying cause of death can be problematic. In the Second Annual Report of the Registrar General of Great Britain in 1840, William Farr presented the statistics of causes of death (CoD), defined as "diseases, which terminate in the extinction of existence," but Farr highlighted the concern that "...the attention of the observer was less attracted to this class of facts, and overlooking the proximate cause, that is, the internal morbid process..." In that report, he also criticized the use of vague categories like "sudden death," "natural death," "visitation of God," and "old age," but he admitted that in some cases, no particular cause of death could be identified . All these criticisms remain relevant today.
Analysis of cause-of-death data is intimately linked to the evolution of the International Statistical Classification of Diseases and Related Health Problems (ICD). Originally known as the International List of Causes of Death, the modern era for the ICD began when the World Health Assembly approved the sixth revision of the ICD in 1948 . The new classification sought to establish an international standard for terminology and nosological criteria to attribute disease names and classify pathologies. Adoption of the ICD by the World Health Organization (WHO) also included a commitment by Member States of WHO to report national statistics based on the ICD. ICD-6 also included the adoption of an international medical certificate of CoD, an international agreement about the underlying cause of death (UCD) as the main cause to be tabulated and the rules for selecting UCD.
Despite the adoption of an international death certificate, the principle of identifying the UCD, and a standard list of causes codified in the revisions of the ICD, at least three problems create issues of comparability for public health analysis among participating countries. First, each time there is a change in the ICD, the set of causes and the codes assigned to each underlying cause change substantially. Producing time series of cause-of-death data requires mapping for some coherent set of causes across revisions - a practice often known as bridge coding [4, 5]. For example, to produce a time series spanning the 20th century, one would need to map across the International List of Causes of Death (ILCD 1-5) to the International Statistical Classification of Diseases and Related Health Problems (ICD 6-10). Whereas the ILCD had only been used to classify mortality, the ICD expanded to include both mortality and morbidity, thus increasing the number of causes from 179 to 20,000 . Time series analyses [7–9] for selected causes have attempted to map national ICD revisions over time, but idiosyncratic national use of the ICD has limited more general approaches to bridge coding that are applicable across all countries. In addition, in the WHO database documentation , there is no mention of the ICD sixth revision, but during the period 1949-1957, at least 40 countries used this version and sent data to the Pan American Health Organization(PAHO) and WHO.
Second, due to the increase in the number of causes, tabulation lists were introduced starting with ICD-6. These lists provide a much shorter set of aggregate codes intended to facilitate cause-of-death reporting in countries with more limited capacity and for communication purposes. A substantial component of historical vital registration data is only available for these tabulation lists, including ICD-7 Tabulation A and B, ICD-8 Tabulation A and B, Basic Tabulation List (BTL) in ICD-9, and mortality tabulation in ICD-10. As with any aggregation procedure, substantial information is lost as compared to the fully disaggregated ICD data that were used to create these lists. For some causes, such as cardiomyopathy, pericarditis, endocarditis, and myocarditis (in BTL and ICD-7 Tab A), or source of burning and exposure to inanimate or mechanical forces in ICD-10 Tabulation list 1, assessing time trends requires some way of breaking down the tabulated data into component causes.
Third, with the advent of the sixth revision, the ICD has been used not only to code deaths by underlying cause of death but also to code other types of medical information, such as reasons for admission to or discharge from a hospital. The introduction of multiple purposes for the ICD has lead to the addition of many codes for causes that should not be considered underlying causes of death. WHO has recognized this problem by producing lists of ICD codes under the heading "List of conditions unlikely to cause death" in the appendix of Volume 2 of the second edition of the ICD . Despite these recommendations from WHO, these codes are frequently used as underlying causes of death. More generally, some ICD codes are used to assign cause of death that are likely misclassifications from a public health perspective.
In 1996, Murray and Lopez  introduced the term "garbage coding" for the practice of assigning deaths to causes that are not useful for public health analysis of cause-of-death data as part of the assessment of the Global Burden of Disease (GBD). While some practitioners may object to the term "garbage code" as pejorative, alternative terms have not yet caught on in the literature. We follow this practice and use the term garbage code (GC) to refer to all deaths assigned to codes that should be redistributed to enhance the validity of public health analysis. The variable use of GCs across countries and over time profoundly limits meaningful comparisons of causes of death; for this reason, WHO and other analysts have sought to reassign deaths coded to GCs to other causes following various methods [11–16].
Given the importance of cause-of-death data for public health analysis, we attempted in this paper to build on prior cause-of-death analysis work [1, 7, 17–25] and to create a more detailed approach to these problems of comparability of ICD-coded cause-of-death data. Our goal was to maximize the public health utility of cause-of-death data. To achieve this, we created a public health cause-of-death list building on the Global Burden of Disease Study, mapped this cause list across ICD revisions, and provided a comprehensive framework for identifying and redistributing deaths assigned to GCs. We illustrated this approach using an extensive database of publicly available cause-of-death data for more than 100 countries spanning 1950 to 2008.
Country year and number of deaths in study data by ICD format, 1950-2008.
Number of deaths (millions)
ICD 6 and ICD 7 Tab A
ICD 8 Tab A
ICD 9 detail
ICD 9 BTL
Special Country ICD 9 Tab
ICD 10 detail
ICD 10 Tab
Special Country ICD 10 Tab
The starting point for our analysis in this paper was the cause list for which we wanted to produce meaningful comparisons over time and across communities. We have taken advantage of the ongoing GBD Study. This large-scale collaboration with more than 800 scientists has developed a cause-of-death list meant to inform public health analysis. The cause-of-death list has 56 causes in three levels. Given the changes in the ICD and the complexities of GCs in different revisions, it is not possible to track all causes of death across multiple ICD revisions. Based both on the availability of detailed data and the evidence of consistency in time trends, we have been able to map 56 causes over most revisions of the ICD since 1950.
Additional file 2, Table S1 provides the cause list and ICD-10 codes for each cause. Four criteria were used to develop this list: a) causes that based on current knowledge (such as the GBD Study) are important causes of burden or are important for public health policy because they are major sources of health expenditures; b) causes that can be effectively traced across ICD-7 to ICD-10; c) causes that most often can be identified in tabulated versions of the ICD revisions; and d) the set of causes at the same level of the hierarchy that are mutually exclusive and collectively exhaustive. The cause list is organized hierarchically such that at the most aggregate levels, there are three broad groups of causes, and under each level of aggregation, there are more detailed causes. We organized the substructure of the list to allow for maximum comparability over time and assigned unique codes to facilitate analysis by others using our software.
For each of the 56 causes shown in the list, we mapped across the various revisions of the ICD, including back to ICDL-1 through ICDL-5 and the various national versions of ICD revisions and tabulation lists. Additional file 3 shows whether or not a cause can be traced through the various revisions of the ICD for each cause on our list. Examination of the list shows, for example, that all the CoD can be traced through ICD-7 to ICD-10 in the detailed lists, but some causes cannot be traced in the ICD-7 and 8 Tabulation list B or in ICD-9 country-specific lists used in China and India.
Causes that cannot or should not be considered as underlying causes of death. These are codes that are included in the ICD because of its use for classifying health service encounters but that do not signify underlying cause of death. Examples of this type of GC are all the codes under chapter 18 of ICD-10 or R codes. This category also includes two special cases in the cardiovascular area: essential primary hypertension and atherosclerosis. Essential primary hypertension is included in the ICD to classify clinical encounters, but for most physicians, it should be considered a risk factor for cardiovascular disease and not the underlying cause. This distinction between what is a risk factor and what is an underlying cause is somewhat arbitrary but necessary to enhance comparability across revisions. Finally, we included in this category a number of causes that are described as the long-term sequelae of disease, such as G82, paraplegia and tetraplegia, or O94, sequelae of complication of pregnancy, childbirth, and the puerperium. In these cases, for public health purposes, it is more useful to assign these deaths to the underlying cause despite the long time lag between disease and death.
Intermediate causes of death such as heart failure, septicemia, peritonitis, osteomyelitis, or pulmonary embolism. These are clearly defined clinical entities, but each has an underlying cause that would have precipitated the chain of events leading to death. Physicians who have not been adequately trained in the principles of the ICD underlying cause of death often use these causes on death certificates.
Immediate causes of death that are the final steps in a disease pathway leading to death. Examples of this include disseminated intravascular coagulation or defibrination syndrome (D65). The pathway to death includes the final immediate cause, an intermediate cause, and the underlying cause that triggered the chain of events. Cardiac arrest (I46) and respiratory failure, not elsewhere classified (J96), are other examples.
Unspecified causes within a larger cause grouping. For many diseases, such as neoplasms, a code is included within the grouping for an unspecified site. This is an illustration of a GC that is not important for assessing aggregate deaths from neoplasms from all sites but is important when assessing site-specific death rates. Another important example is the injury category in which some injuries are coded to unspecified factors or intent.
List of garbage codes for ICD-10 based on the public health analysis cause list of 56 causes.
A31.1, A59, A60.0, A71-A74, A63.0, B00.0, B07, B08.1, B08.8, B30, B35-B36, F32-F33.9, F40-F42.9, F45-F48.9, F51-F53.9, F60-F98.9, G43-G45.9, G47-G52.9, G54-G54.9, G56-G58.9, H00-H04.9, H05.2-H69.9, H71-H80.9, H83-H93, J30, J33, J34.2, J35, K00-K11.9, K14, L04-L08.9, L20-L25.9, L28-L87.9, L90-L92, L94, L98.0-L98.3, L98.5-L98.9, M03, M07, M09-M12, M14-M25, M35.3, M40, M43.6-M43.9, M45.9, M47-M60, M63-M71, M73-M79, M95-M99, N39.3, N40, N46, N60, N84-N93, N97, Q10-Q18, Q36, Q38.1, Q54, Q65-Q74, Q82-Q84, R00-R99, B94.8, B949.9, G80-G83, Y86, Y87.2, Y89, I10, I15, I70
A40-A41, A48.0, A48.3, E85.3-E85.9, E86-E87, G91.1, G91.3-G91.8, G92, G93.1-G93.6, I26, I27.1, I44-I45, I49-I50, I74, I81, J69, J80-J81, J86, J90, J93, J93.8-J93.9, J94, J98.1-J98.3, K65-K66, K71-K72 (except K71.7), K75, K76.0-K76.4, K92.0-K92.2, M86, N14, N17-N19
D65, I45-I46, J96
C80, C26, C39, C57.9, C64.9, C76, D00-D13, D16-D18, D20-D24, D28-D48, A49.9, B83.9, B99, E88.9 I51, I99, X59, Y10-Y34
To enhance comparability, we followed the conceptual approach developed by Murray and Lopez in the GBD and currently applied by WHO; namely, to reassign deaths from GCs to causes in our cause list. This approach can be divided into three steps: identify GCs, identify the target causes where the deaths assigned to a GC should in principle be reassigned (based on pathophysiology or an assessment of certification practice); and choose the fraction of deaths assigned to a GC that should be reallocated to each target cause. In the work to date, the identification of target causes for a GC has been based on very general groupings, such as all injuries or all Group I diseases, and the allocation algorithm has largely been based on proportionate distribution within an age-sex group.
We expanded the approach taken in the literature. First, we carefully considered pathophysiology in identifying target causes for a GC. For example, for peritonitis, our targets include digestive diseases, such as intestinal obstruction; genitourinary diseases such as salpingitis and oophoritis; pregnancy, childbirth, and puerperium disease; conditions such as abortions; and some intentional and unintentional injuries. Details for some examples (exposure to unspecified factor X59, female genital organ malignant neoplasm, unspecified site C57.9, heart failure I50, peritonitis K65, septicemia A40, A41) are provided in Additional file 4 to give further illustration of this approach.
Second, we distinguished three methods for assigning GC deaths to a set of target underlying causes: proportionate redistribution within an age-sex group, statistical models, and expert judgment. We used a combination of all of these approaches depending on the four types of GCs. For causes with little information content, we used proportionate redistribution across target causes. In the case of heart failure, we developed a statistical model that helps identify the proportion of deaths for each target code within a given age-sex group. The algorithm eliminates all deaths with the code HF (ICD-10 I50) from the database. It identifies the fraction that should be extracted from HF and assigned to each of the target categories. To estimate the fractions allocated to each target code, we regressed by age, sex, and development status using all available ICD-10 mortality data the fraction of heart failure deaths from all deaths related to heart failure, including target causes.
Finally, for many GCs, we reviewed the published literature and engaged in consultation with GBD expert groups to develop an expert-based algorithm for assigning the fraction of deaths assigned to a GC within an age-sex group to be allocated to different target causes. A further criterion used in developing these expert algorithms was to compare the time trends in a cause by country across various revisions of the ICD. For example, the distribution of GCs to target codes for heart failure is a function of local epidemiology. Redistribution of GCs should in principle generate more plausible or continuous time trends commensurate with the underlying nature of a cause without observing the major discontinuities associated with a change in ICD.
The algorithms for reassigning each of the GCs have been developed in Stata. While conceptually simple, the allocation of each GC to target causes for each age-sex group is computationally intensive. We intend to make our software available to researchers or government agencies to enhance the comparability of their own data. We are currently producing a usable version of the program code for the general public. Once complete, the software will be publicly available on the Web site of the Institute for Health Metrics and Evaluation.
In this study, we have extended work undertaken as part of the GBD Study and by WHO to provide tools to enhance the public health use of cause-of-death data. For a list of 56 causes, we mapped across ICD-7 Detail through ICD-10. We have identified four types of GCs in all versions of the ICD and country-specific cause-of-death lists. For each of these GCs, we have identified likely codes for which these deaths should ideally be assigned based on pathophysiology or certification practice. Practical algorithms to redistribute these deaths have been developed and implemented using statistical software. These algorithms have been applied to a database of more than 700 million ICD-coded deaths that are available from public sources covering more than 4,000 country-years. Based on our results, we believe that these algorithms can be generally applied to country-level ICD data by analysts interested in comparability over time and place. Through the application of these approaches, we believe that the public health utility of cause-of-death data can be substantially enhanced, leading to increased demand for higher quality cause-of-death data from health sector decision-makers.
These CoD analysis algorithms affect our interpretation of trends and the relative rankings of countries for selected causes. For example, if we compare country-by-country rankings of the age-standardized death rate for ischemic heart disease (83 countries in 2005), the effect of GC redistribution is to change the rank of 19 countries by two to four ranks and 49 countries by five or more ranks. Similar findings hold true across nearly all causes. For example, for deaths due to transport injuries, 21 countries change by two to four ranks and 51 countries by five or more ranks). Perhaps even more importantly, for some major noncommunicable causes, the overall effect of mapping and GC redistribution is to change the trend over time. As noted as long ago as 1976 , the timing of the epidemiological transition is substantially influenced by the correction of GCs and bridge coding.
In this work, we have looked in much greater depth at the likely target causes to which GCs should be redistributed and have explored three different methods for choosing the fraction in an age-sex group that should be allocated to each target GC. There is, nevertheless, a substantial scope for further research on choosing these redistribution proportions for each GC onto target underlying causes. Ideally, for validation, one would like to collect a dataset where the "true" underlying cause is known based on autopsy or extensive clinical records but the deaths have been assigned to a GC in the normal course of death registration [37–39]. This, however, is unlikely to occur because most deaths with an autopsy or extensive clinical records are not assigned to GCs on their death certificate. Ex post studies are hard to conduct because the records required to ascertain underlying cause may not have been collected or be available . Nevertheless, innovative methods such as matching or blinded death certification may be applicable to the challenge of putting the GC redistribution algorithms on a stronger empirical footing. An important area for research will also be to try and characterize the uncertainty in the redistribution algorithms so that this uncertainty can be reflected in the adjusted death rates for a cause in a particular country and year.
Figure 4 shows that the fraction of deaths assigned to GCs across countries is highly variable even in the latest year of data availability. If all countries had the resources and policy commitment to achieve the levels of quality seen in New Zealand or Australia , the quality of cause-of-death data around the world would be dramatically improved. While WHO undertakes important efforts to help countries implement ICD revisions, the global health community has invested little in helping countries more effectively implement cause-of-death certification and coding. For public health analysis, we believe that it would be useful to clearly communicate to physicians who are going to complete death certificates that certain causes of death should not be used because they either cannot be underlying causes of death or are immediate or intermediate causes of death. Application of the algorithms in this study may help national authorities to demonstrate the extent of garbage coding and therefore motivate further action at the local level to improve the quality of certification [41, 42].
While we have made substantial efforts to consistently map a limited set of important causes of death across the various revisions of the ICD and to deal with the challenge of GCs in each revision, many problems remain. Inconsistencies among the ICD eighth revision and other revisions were not totally solved. The capacity to reconstruct reasonable sequences for ICD-7- and ICD-8-coded data is more limited due to the fact that much of the data are reported using limited tabulation lists.
We distinguish mapping across revisions of the ICD to maximize comparability from formal dual coding of a set of deaths according to two different revisions of the ICD. Such formal bridge coding studies are available for a few select countries and a limited number of ICD revision changes. Comparable cause-of-death statistics, however, require the more general approach of mapping across revisions of the ICD. We recognize the problems associated with applying a universal algorithm across all countries but have designed our choice of causes in the cause list and mapping over the revisions of the ICD to facilitate comparisons wherever possible.
Beyond its incursion into other areas of health care that go beyond the statistics of mortality, the ICD remains the global standard reference frame for describing and analyzing major health problems in society. Efforts such as this to enhance the utility of this information for public health analysis should highlight the intrinsic value of vital registration data with standardized death certification and ICD coding. The ICD and the work of WHO to revise and maintain the classification is a true international public good that requires ongoing support from the global health community.
Basic Tabulation List (ICD-9)
Cause of Death
Global Burden of Disease (Study)
International Statistical Classification of Diseases and Related Health Problems
Institute for Health Metrics and Evaluation
International List of Causes of Death
Underlying Cause of Death
World Health Organization
Automatic Classification of Medical Entry
This research was supported by core funding from the Bill & Melinda Gates Foundation. Special thanks to Justin Ross, who started this project as an IHME Post-Bachelor Fellow, and Post-Graduate Fellow Samath Dharmaratne, who helped with the database construction. We also thank Christopher Murray and Alan Lopez for their feedback and advice. We also thank WHO and PAHO staff who responded to our data requests and questions, particularly Colin Mathers, Doris MaFat (WHO), and Fatima Marhino (PAHO). And we thank Kate Jackson for initial assistance on the project and Catherine Claiborne for editorial assistance and as a program officer for logistical support.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.