We found that the RF Method outperforms PCVA for all metrics and settings, with the exception of having slightly lower CSMF accuracy in neonates when HCE was available. Even in this single scenario, the difference in CSMF accuracy is not statistically significant, and furthermore, the PCVA analysis for neonates was limited to a six cause list, while the RF analysis was done on the full 11 cause list. The degree of the improvement varies among metrics, among age modules, and with the presence or absence of HCE variables. When the analysis is conducted without HCE variables, RF is particularly dominant.
The superior performance of RF compared to PCVA with respect to all of our quality metrics is excellent because this method also reduces cost, speeds up the analysis process, and increases reliability. While it may take days for a team of physicians to complete a VA survey analysis, a computer approach requires only seconds of processing on hardware that is currently affordably available. In addition, using machine learning leads to reliability, since the same interview responses will lead to the same cause assignment every time. This is an important advantage over PCVA, which can produce results of widely varying quality among different physicians, according to their training and experience .
Despite these strengths of RF, the method does have weaknesses in individual-level prediction of certain causes. For example, chance-corrected concordances for malaria and pneumonia in adults are around 25% even with HCE. Chance-corrected concordances for encephalitis, sepsis, and meningitis in children are in the 15% to 25% range. However, in many applications, it is the population-level estimates that are most important, and the linear regression of true versus estimated cause fraction shows that for these causes, RF has a RMSE of at most 0.009 for the adult causes and 0.02 for the child causes. It may be possible to use these RMSEs together with the slopes and intercepts to yield an adjusted CSMF with uncertainty.
While the ANN method used by Boulle et al. 10 years ago  showed the potential of using ML techniques, the RF Method we have validated here has proven that ML is ready to be put into practice as a VA analysis method. ML is an actively developing subdiscipline of computer science, so we expect that future advances in ML classification will be invented over the coming years, and VA analysis techniques will continue to benefit from this innovation. During the development of our approach, we considered many variants of RF. However, the possibilities are endless, and even some other variant of RF may improve on the method presented here. For example, nonuniformly increasing the number of trees in the forest to have proportionately more for select causes (in the spirit of Boosting ) is a potential direction for future exploration.
For any ML classifier to be successful, several requirements should be met. As discussed earlier, the accuracy of classification relies considerably on the quality of the training data (deaths with gold standard cause known to meet clinical diagnostic criteria). While the PHMRC study design collected VA interviews distributed among a wide array of causes from a variety of settings, certain causes were so rare that too few cases occurred to train any ML classifier to recognize them. Future studies could focus on collecting additional gold standard VAs for priority diseases to complement the PHMRC dataset. These additional data could improve the accuracy of RF and other ML models on certain selected causes. Future research should also focus on assessing VA's performance in different settings. For example, users in India may be interested specifically in how RF performs in India instead of across all of the PHRMC sites, particularly if it is possible to train the model only on validation deaths from India.
All VA validation studies depend critically on the quality of validation data, and this RF validation is no exception. A unique feature of the PHMRC validation dataset, the clinical diagnostic criteria, ensures that the validation data are very precise about the underlying cause of death. However, this clinical diagnosis also requires that the deceased have some contact with the health system. The validity of the method therefore depends critically on the assumption that the signs and symptoms observed in the deaths that occur in hospitals for a given cause are not substantially different than deaths from that cause that occur in communities without access to hospitals. We have investigated this assumption by conducting our analysis with and without HCE items, which gives some indication of the potential differences.
The machine learning technique described in this paper will be released as free open source software, both as stand-alone software to run on a PC and also as an application for Android phones and tablets, integrated into an electronic version of the VA instrument.