Random forests for verbal autopsy analysis: multisite validation study using clinical diagnostic gold standards
© Flaxman et al; licensee BioMed Central Ltd. 2011
Received: 14 April 2011
Accepted: 4 August 2011
Published: 4 August 2011
Computer-coded verbal autopsy (CCVA) is a promising alternative to the standard approach of physician-certified verbal autopsy (PCVA), because of its high speed, low cost, and reliability. This study introduces a new CCVA technique and validates its performance using defined clinical diagnostic criteria as a gold standard for a multisite sample of 12,542 verbal autopsies (VAs).
The Random Forest (RF) Method from machine learning (ML) was adapted to predict cause of death by training random forests to distinguish between each pair of causes, and then combining the results through a novel ranking technique. We assessed quality of the new method at the individual level using chance-corrected concordance and at the population level using cause-specific mortality fraction (CSMF) accuracy as well as linear regression. We also compared the quality of RF to PCVA for all of these metrics. We performed this analysis separately for adult, child, and neonatal VAs. We also assessed the variation in performance with and without household recall of health care experience (HCE).
For all metrics, for all settings, RF was as good as or better than PCVA, with the exception of a nonsignificantly lower CSMF accuracy for neonates with HCE information. With HCE, the chance-corrected concordance of RF was 3.4 percentage points higher for adults, 3.2 percentage points higher for children, and 1.6 percentage points higher for neonates. The CSMF accuracy was 0.097 higher for adults, 0.097 higher for children, and 0.007 lower for neonates. Without HCE, the chance-corrected concordance of RF was 8.1 percentage points higher than PCVA for adults, 10.2 percentage points higher for children, and 5.9 percentage points higher for neonates. The CSMF accuracy was higher for RF by 0.102 for adults, 0.131 for children, and 0.025 for neonates.
We found that our RF Method outperformed the PCVA method in terms of chance-corrected concordance and CSMF accuracy for adult and child VA with and without HCE and for neonatal VA without HCE. It is also preferable to PCVA in terms of time and cost. Therefore, we recommend it as the technique of choice for analyzing past and current verbal autopsies.
KeywordsVerbal autopsy cause of death certification validation machine learning random forests
Verbal autopsy (VA) is a technique for measuring the cause-specific mortality burden for deaths that occur outside of hospitals. In VA, a trained interviewer collects detailed information on signs and symptoms of illness from laypeople familiar with the deceased. These interviews are analyzed by experts or by computer to estimate 1) the cause of death for each individual and 2) the distribution of causes of death in a population. This information can then be used by policy developers, donors, governments, or decision-makers to choose wisely in developing, requesting, and allocating health resources. For VA to provide useful information to individuals or to society, it is essential that the results of these interviews be mapped to the underlying cause of death accurately and quickly. Physician-certified verbal autopsy (PCVA) is currently the most common approach to mapping VA interviews to underlying cause of death, but this approach is expensive and time-consuming .
Machine learning (ML) methods are computer algorithms that infer patterns from examples . In a classification task like VA analysis, an ML method processes a set of examples ("training data") that has gold standard classifications, and develops a model to classify additional data. Developing and refining ML methods is a vibrant area of research in computer science, and numerous new methods have been introduced over the past 50 years. One influential ML method, the artificial neural network (ANN), was applied to VA 10 years ago . This approach was deemed potentially useful, pending further evaluation. By casting VA analysis as an application of general ML methods, incremental advances in ML techniques can be directly applied to improve the accuracy of VA analysis.
The Random Forest (RF) is an exciting innovation in ML technology . The RF has been used extensively in many domains for classification tasks, and is consistently one of the top approaches . Examples of using ML techniques in various domains include gene selection and classification of microarray data , modeling structural activity of pharmaceutical molecules , and protein interaction prediction . For this study, we developed an application of the RF Method to VA analysis and compared the performance of RF to PCVA.
An overview of random forests
Unlike expert algorithms, however, the decision trees in Breiman's Random Forest are generated automatically from labeled examples (the training dataset), without guidance from human experts. Instead, a random resampling of the training dataset is generated by drawing examples with replacement from the training dataset, and then a decision tree is constructed sequentially from this, starting from the root. At each node, the algorithm selects a random subset of signs and symptoms to consider branching on, and then branches on the one that best distinguishes between the labels for examples relevant to that node, halting when all relevant examples have the same label. Because of the randomness in this process, running the approach repeatedly on the same training dataset yields different trees, and two such trees are depicted in Figure 1b.
Validation using the PHMRC gold standard test/train datasets
Numbers of VAs collected by site and gold standard level
Andhra Pradesh, India
Dar es Salaam, Tanzania
Mexico City, Mexico
Pemba Island, Tanzania
Uttar Pradesh, India
Murray et al. have shown that many traditional metrics of performance, such as specificity or relative and absolute error in CSMFs, are sensitive to the CSMF composition of the test dataset  and recommend that robust assessment of performance be undertaken on a range of test datasets with widely varying CSMF compositions. Further, metrics of individual concordance need to be corrected for chance to adequately capture how well a method does over random or equal assignment across causes.
The PHMRC has developed a set of 500 test/train splits of the data, which we analyzed. The splits were generated randomly, stratified by cause. Each has a random 75% of examples of each cause in the training set and 25% in the test set. For each split, we used the training data to generate random forests for each pair of causes and then we applied these forests to the test dataset. We never allowed contamination between the training data and the test data - they were kept strictly separate in all steps of the analysis. Further, the cause composition of the test dataset is based on a random draw from an uninformative Dirichlet distribution. The Dirichlet distribution specifies random fractions that sum to 1. Each test split is resampled with replacement to meet the cause fractions specified by a Dirichlet draw. Consequently, each test split has a different distribution of cause fractions, and the cause composition of the training data and test data are always different.
We assessed the performance of RF at assigning individual causes of death using median chance-corrected concordance by cause across the 500 test datasets and the median average chance-corrected concordance across causes in the 500 test datasets, following the recommendations of Murray et al . For assessing the performance of RF in estimating CSMFs, we calculated the median CSMF accuracy as well as slope, intercept, and root mean squared error (RMSE) of a linear regression for each cause as a summary of the relationship between estimated CSMFs for a cause and the true CSMF in a particular test dataset . We benchmark RF against PCVA on the same dataset using the results reported by Lozano et al .
Murray et al. analyzed data in China two ways: including all items and excluding items that reflected the decedent's health care experience (HCE) . The purpose of excluding the HCE items is to assess how RF would perform on VA for communities without access to health care. They found, for example, that a considerable component of PCVA performance was related to the household recall of hospital experience or availability of a death certificate or other records from the hospital. We assessed the performance of RF in adults, children, and neonates both with and without the free-response items and the structured questions that require contact with health care to answer (marked in Additional files 1, 2, and 3).
There are many potential variations in implementing RF. Specifically:
Continuous and categorical variables can be included as is, or can be dichotomized to reduce noise
The training data can be reweighted so that all causes are represented equally or left as is
Decision trees can compare cause j to all other causes at once, or compare cause j to each other individual cause to come up with "votes"
The signal-to-noise ratio can be improved by removing low-information items using the Tariff Method , or all items can be used
Different numbers of signs and symptoms can be used at each decision node
Different numbers of trees can be used in the forest
Cause assignment can be based on the highest scoring cause for each death or on ranking the scores and assigning to the cause with the highest rank
We conducted an extensive sensitivity analysis to understand the importance of decisions between levels of Tariff-based item reduction, the choice of number of signs and symptoms at every decision node (m), the choice of number of trees (n) in each one-versus-one cause classification, and the difference between max-score and max-rank cause assignment. To avoid overfitting the data when selecting between the model variants, we conducted our sensitivity analysis using splits 1 to 100 and repeated the analysis using splits 101 to 200 and a random subset of 50 splits. The results of the sensitivity analysis are included in Additional file 4 and show that cause assignment by rank is superior to assignment by score but that the other parameters do not affect chance-corrected concordance or CSMF accuracy. The results shown in the next section are all for the one-versus-one model, with dichotomized variables, with training data reweighted to have equal class sizes, using the 40 most important Tariff-based symptoms per cause, m = 5, n = 100, and the max-rank cause assignment, which produced the highest CSMF accuracy for seven of the first 200 splits of the child VA data with HCE and the highest chance-corrected concordance for 14.
Individual cause assignment compared to PCVA
Median chance-corrected concordance (%) for RF and PCVA, by age group with and without HCE
The differential value of HCE to RF in adult VA is more substantial than in child or neonatal VAs. Including HCE responses yields a significant relative increase of 10.3% in median chance-corrected concordance for adult VA. This could be because adults have more substantial experience with health care, and hence more relevant information is generated that aids in VA analysis, or it could be confounded by the differences between the adult, child, and neonate cause lists. In PCVA, however, including HCE responses produces a large increase in median chance-corrected concordance for all modules. In all six of these settings, the median chance-corrected concordance is significantly higher for RF than for PCVA.
Another advantage of RF over PCVA is its relatively consistent performance in the presence and absence of HCE variables. PCVA concordances vary significantly with absence of HCE variables (e.g., for 22 causes of adult deaths, without HCE, concordance decreased by more than 10 percentage points). On the other hand, RF concordance only decreases substantially in 15 adult causes. In addition, RF shows more consistency among all causes. For example, its minimum median chance-corrected concordance in adult causes is 7.9% (without HCE) and 10.7% (with HCE), while minimum median chance-corrected concordance for PCVA without HCE is negative for two causes (meaning PCVA did worse than chance). RF does benefit substantially from HCE variables for certain important causes, however. For example, for adult deaths due to tuberculosis, AIDS, diabetes, and asthma, chance-corrected concordance increased by more than 20 percentage points when HCE variables were included.
CSMF estimation compared to PCVA
Median CSMF accuracy for RF and PCVA, by age group with and without HCE
The results of performing RF with a higher number of trees in each one-versus-one cause classifier showed that the method is stable by only using 100 trees per classifier. It should be noted that, while in the literature it is suggested that increasing the number of trees increases the classification precision, as our overall RF Method includes an ensemble of one-versus-one classifiers (e.g., for adult VAs, RF has one-versus-one classifiers, each including 100 trees), the overall number of trees is high, which results in stable performance.
We found that the RF Method outperforms PCVA for all metrics and settings, with the exception of having slightly lower CSMF accuracy in neonates when HCE was available. Even in this single scenario, the difference in CSMF accuracy is not statistically significant, and furthermore, the PCVA analysis for neonates was limited to a six cause list, while the RF analysis was done on the full 11 cause list. The degree of the improvement varies among metrics, among age modules, and with the presence or absence of HCE variables. When the analysis is conducted without HCE variables, RF is particularly dominant.
The superior performance of RF compared to PCVA with respect to all of our quality metrics is excellent because this method also reduces cost, speeds up the analysis process, and increases reliability. While it may take days for a team of physicians to complete a VA survey analysis, a computer approach requires only seconds of processing on hardware that is currently affordably available. In addition, using machine learning leads to reliability, since the same interview responses will lead to the same cause assignment every time. This is an important advantage over PCVA, which can produce results of widely varying quality among different physicians, according to their training and experience .
Despite these strengths of RF, the method does have weaknesses in individual-level prediction of certain causes. For example, chance-corrected concordances for malaria and pneumonia in adults are around 25% even with HCE. Chance-corrected concordances for encephalitis, sepsis, and meningitis in children are in the 15% to 25% range. However, in many applications, it is the population-level estimates that are most important, and the linear regression of true versus estimated cause fraction shows that for these causes, RF has a RMSE of at most 0.009 for the adult causes and 0.02 for the child causes. It may be possible to use these RMSEs together with the slopes and intercepts to yield an adjusted CSMF with uncertainty.
While the ANN method used by Boulle et al. 10 years ago  showed the potential of using ML techniques, the RF Method we have validated here has proven that ML is ready to be put into practice as a VA analysis method. ML is an actively developing subdiscipline of computer science, so we expect that future advances in ML classification will be invented over the coming years, and VA analysis techniques will continue to benefit from this innovation. During the development of our approach, we considered many variants of RF. However, the possibilities are endless, and even some other variant of RF may improve on the method presented here. For example, nonuniformly increasing the number of trees in the forest to have proportionately more for select causes (in the spirit of Boosting ) is a potential direction for future exploration.
For any ML classifier to be successful, several requirements should be met. As discussed earlier, the accuracy of classification relies considerably on the quality of the training data (deaths with gold standard cause known to meet clinical diagnostic criteria). While the PHMRC study design collected VA interviews distributed among a wide array of causes from a variety of settings, certain causes were so rare that too few cases occurred to train any ML classifier to recognize them. Future studies could focus on collecting additional gold standard VAs for priority diseases to complement the PHMRC dataset. These additional data could improve the accuracy of RF and other ML models on certain selected causes. Future research should also focus on assessing VA's performance in different settings. For example, users in India may be interested specifically in how RF performs in India instead of across all of the PHRMC sites, particularly if it is possible to train the model only on validation deaths from India.
All VA validation studies depend critically on the quality of validation data, and this RF validation is no exception. A unique feature of the PHMRC validation dataset, the clinical diagnostic criteria, ensures that the validation data are very precise about the underlying cause of death. However, this clinical diagnosis also requires that the deceased have some contact with the health system. The validity of the method therefore depends critically on the assumption that the signs and symptoms observed in the deaths that occur in hospitals for a given cause are not substantially different than deaths from that cause that occur in communities without access to hospitals. We have investigated this assumption by conducting our analysis with and without HCE items, which gives some indication of the potential differences.
The machine learning technique described in this paper will be released as free open source software, both as stand-alone software to run on a PC and also as an application for Android phones and tablets, integrated into an electronic version of the VA instrument.
We presented an ML technique for assigning cause of death in VA studies. The optimization steps taken to improve the accuracy of RF classifiers in VA application were presented. We found that our RF Method outperformed PCVA in chance-corrected concordance and CSMF accuracy for adult and child VA with and without HCE and for neonatal VA without HCE. In addition, it is preferable to PCVA in terms of both cost and time. Therefore, we recommend it as the technique of choice for analyzing past and current verbal autopsies.
artificial neural network
computer-coded verbal autopsy
cause-specific mortality fraction
physician-certified verbal autopsy
Population Health Metrics Research Consortium
root mean squared error
health care experience
ischemic heart disease.
This research was conducted as part of the Population Health Metrics Research Consortium: Christopher JL Murray, Alan D Lopez, Robert Black, Ramesh Ahuja, Said Mohd Ali, Abdullah Baqui, Lalit Dandona, Emily Dantzer, Vinita Das, Usha Dhingra, Arup Dutta, Wafaie Fawzi, Abraham D Flaxman, Sara Gomez, Bernardo Hernandez, Rohina Joshi, Henry Kalter, Aarti Kumar, Vishwajeet Kumar, Rafael Lozano, Marilla Lucero, Saurabh Mehta, Bruce Neal, Summer Lockett Ohno, Rajendra Prasad, Devarsetty Praveen, Zul Premji, Dolores Ramírez-Villalobos, Hazel Remolador, Ian Riley, Minerva Romero, Mwanaidi Said, Diozele Sanvictores, Sunil Sazawal, Veronica Tallo. The authors would like to additionally thank Charles Atkinson for managing the PHMRC verbal autopsy database and Michael K Freeman, Benjamin Campbell, and Charles Atkinson for intellectual contributions to the analysis.
This work was funded by a grant from the Bill & Melinda Gates Foundation through the Grand Challenges in Global Health initiative. The funders had no role in study design, data collection and analysis, interpretation of data, decision to publish, or preparation of the manuscript. The corresponding author had full access to all data analyzed and had final responsibility for the decision to submit this original research paper for publication.
- Soleman N, Chandramohan D, Shibuya K: Verbal autopsy: current practices and challenges. Bull World Health Organ 2006, 84: 239-245. 10.2471/BLT.05.027003PubMed CentralView ArticlePubMed
- Mitchell TM: Machine Learning. 1st edition. New York, NY: McGraw-Hill Science/Engineering/Math; 1997.
- Boulle A, Chandramohan D, Weller P: A case study of using artificial neural networks for classifying cause of death from verbal autopsy. Int J Epidemiol 2001, 30: 515-520. 10.1093/ije/30.3.515View ArticlePubMed
- Breiman L: Random Forests. Machine Learning 2001, 45: 5-32. 10.1023/A:1010933404324View Article
- Caruana R, Karampatziakis N, Yessenalina A: An empirical evaluation of supervised learning in high dimensions. Proceedings of the 25th International Conference on Machine Learning - ICML '08, Helsinki, Finland 2008, 96-103.View Article
- Diaz-Uriarte R, Alvarez de Andres S: Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006, 7: 3. 10.1186/1471-2105-7-3PubMed CentralView ArticlePubMed
- Svetnik V, Liaw A, Tong C, Wang T: Application of Breiman's Random Forest to modeling Structure-Activity Relationships of pharmaceutical molecules. In Multiple Classier Systems, Fifth International Workshop, MCS 2004, 9-11 June 2004, Proceedings, Cagliari, Italy. Volume 3077. Edited by: Roli F, Kittler J, Windeatt T. Lecture Notes in Computer Science, Berlin: Springer; 2004:334-343.
- Qi Y, Klein-Seetharaman J, Bar-Joseph Z: Random forest similarity for protein-protein interaction prediction from multiple sources. Pac Symp Biocomput 2005, 531-542.
- Breiman L, Friedman J, Olshen R, Stone C: Classification and Regression Trees. 1st edition. Wadsworth and Brooks, Monterey, CA; 1984.
- Anker M, Black RE, Coldham C, Kalter HD, Quigley MA, Ross D, Snow RW: A Standard Verbal Autopsy Method for Investigating Causes of Death in Infants and Children. Geneva: World Health Organization; 1999.
- Hastie T, Tibshirani R: Classification by pairwise coupling. Ann Statist 1998, 26: 451-471.View Article
- Murray CJL, Lopez AD, Black R, Ahuja R, Ali SM, Baqui A, Dandona L, Dantzer E, Das V, Dhingra U, Dutta A, Fawzi W, Flaxman AD, Gómez S, Hernández B, Joshi R, Kalter H, Kumar A, Kumar V, Lozano R, Lucero M, Mehta S, Neal B, Ohno SL, Prasad R, Praveen D, Premji Z, Ramírez-Villalobos D, Remolador H, Riley I, Romero M, Said M, Sanvictores D, Sazawal S, Tallo V: Population Health Metrics Research Consortium gold standard verbal autopsy validation study: design, implementation, and development of analysis datasets. Popul Health Metr 2011, 9: 27. 10.1186/1478-7954-9-27PubMed CentralView ArticlePubMed
- Murray CJL, Lozano R, Flaxman AD, Vahdatpour A, Lopez AD: Robust metrics for assessing the performance of different verbal autopsy cause assignment methods in validation studies. Popul Health Metr 2011, 9: 28. 10.1186/1478-7954-9-28PubMed CentralView ArticlePubMed
- Lozano R, Lopez AD, Atkinson C, Naghavi M, Flaxman AD, Murray CJL, the Population Health Metrics Research Consortium (PHMRC): Performance of physician-certified verbal autopsies: multisite validation study using clinical diagnostic gold standards. Popul Health Metr 2011, 9: 32. 10.1186/1478-7954-9-32PubMed CentralView ArticlePubMed
- Murray CJL, Lopez AD, Feehan DM, Peter ST, Yang G: Validation of the symptom pattern method for analyzing verbal autopsy data. PLoS Med 2007, 4: e327. 10.1371/journal.pmed.0040327PubMed CentralView ArticlePubMed
- James SL, Flaxman AD, Murray CJL, the Population Health Metrics Research Consortium (PHMRC): Performance of the Tariff Method: validation of a simple additive algorithm for analysis of verbal autopsies. Popul Health Metr 2011, 9: 31. 10.1186/1478-7954-9-31PubMed CentralView ArticlePubMed
- Schapire RE, Freund Y, Bartlett P, Lee WS: Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods. The Annals of Statistics 1998, 26: 1651-1686.View Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.