This assessment of the performance of InterVA compared to gold standard cause of death assignment in a large multisite study shows an overall chance-corrected concordance of 24.2%, 24.9%, and 6.3% for adults, children, and neonates, respectively. At the level of estimating CSMFs, InterVA has a CSMF accuracy of 0.546 for adults, 0.504 for children, and 0.404 for neonates. Compared to PCVA, the performance of InterVA is much lower in terms of chance-corrected concordance, and it produces substantially larger errors in estimated CSMFs .
The poor performance of InterVA, given some published studies, is surprising. Not all studies, however, have reported good concordance. Oti et al.  compared InterVA on 1,823 deaths to physician review and found a chance-corrected concordance of 31.2%, which is not much higher than reported here - authors' calculations. One other validation study found a 33.3% chance-corrected concordance when comparing InterVA to physician review . Two factors may account for the difference in the findings here compared with the more favorable studies. First, the PHMRC database is the first VA validation study where cause of death has been assigned using strict clinical diagnostic criteria and not medical record review or hospital diagnosis. The distinction is critical; in medical record review a chart may say myocardial infarction but not have documentation on how this diagnosis was made. In the PHMRC dataset, a death from myocardial infarction requires at least one of the following: cardiac perfusion scan, electrocardiogram changes, documented history of coronary artery bypass grafting or percutaneous transluminal coronary angioplasty or stenting, coronary angiography, and/or enzyme changes in the context of myocardial ischemia. Second, it is difficult to compare across previous studies because different metrics and results are reported for only one CSMF composition in the test data. Murray et al. report that findings can vary widely as a function of CSMF composition, and therefore metrics based on a single CSMF can be highly misleading .
Reporting chance-corrected concordance and regression results of CSMF true on CSMF estimated for each cause provides a framework for analyzing the strengths and weaknesses of InterVA. Clearly, the program is currently better suited to identify certain more obvious causes than other more complex ones. The program also has differential performances based on the cause fraction of each disease. This partly explains why different studies have shown different levels of accuracy for the program. InterVA could easily identify deaths with highly-probable symptoms such as road traffic injuries, but it struggled with less explicit causes such as infections. There also appeared to be some anomalous results from the program. For example, the program indicates that the probability of assigning drowning as a true cause is 0.99 if the respondent responded "yes" to the question "did s/he drown?" However, of the 117 adult deaths in which the respondent indicated that there was drowning, InterVA only assigned six of them "drowning" as the cause of death. We believe that this was the result of a coding error in the program. InterVA also tends to overpredict perinatal asphyxia in neonates. While we are less confident why this is, we believe that it is a notable shortcoming of the program. We hope that the cause-specific results can be used to better inform expert priors for future Bayesian methods.
The analysis of InterVA compared to the other Bayesian automated approach, Simplified Symptom Pattern, also provides a clear indication of why InterVA is not working well. The analysis of SSP variants designed to approximate InterVA show that four factors contribute to better results using SSP: use of interdependencies in the symptom responses, the use of all the items in the WHO or PHMRC instrument rather than just the 106 items in InterVA, the use of empirical probabilities of symptoms conditional on the true cause rather than expert judgment, and finally the technical advantage of developing models for each cause relative to other causes rather than all causes independently . Moving to empirical probabilities improved chance-corrected concordance by 4%, capturing the interdependencies of some items added another 6%, and expanding from the InterVA item list to the full item list added another 7%. The progressive improvement in the performance of the SSP variants provides an understanding of how the limitations of the implementation of Bayes' theorem in InterVA contribute to its poor performance.
There are several limitations of this study. First, because the InterVA and PHMRC cause lists had to be merged to a joint cause list, InterVA was essentially challenged to predict causes that it was not built to identify (such as specific types of injuries). Conversely, there are a number of causes for which InterVA may predict very well that were not included in the study (such as malnutrition in children). InterVA could in theory perform well for these causes, which would have increased its average chance-corrected concordance. Note that the cause list used for the assessment of PCVA performance was slightly longer, so the InterVA performance may have been slightly exaggerated . Second, there were a number of InterVA items that were not mapped to the PHMRC survey (17 adult questions, 32 child questions, and 30 neonatal questions). Inclusion of these items would likely improve performance of the tool. Third, InterVA predicted deaths in some age groups for causes that largely belong to other age groups. For example, it predicted preterm/small baby as a child cause and malnutrition as an adult cause. These deaths were assigned to the residual other category. This practice also may have exaggerated InterVA accuracy.
The contribution of this study is the use of gold standard cases for the validation of InterVA. The aforementioned studies only provide information on the relationship between InterVA and hospital- assigned or physician-reviewed cause of death. This study provides a direct comparison of InterVA to gold standard verified causes of death. It is also important to note that this study is considering the performance of InterVA in a diverse cultural and epidemiological context. However, further analysis from each of the sites will provide specific results about the performance of InterVA in each of the countries included in the PHMRC study.