Adapting the all-cause model for interpreting VA material from reproductive age female deaths gave CSMFs that were broadly comparable to physician reviews, in the absence of any available "gold standard", while also offering inherent consistency of interpretation over time and place. Reliable and consistent estimates of CSMFs, including attribution of mortality to pregnancy-related causes, at the population level, are the key requirements for monitoring MDG-5. Whilst statistical modelling may not reflect all the subjective subtleties that reviewing physicians might apply to individual cases, it offers very significant advantages in terms of efficiency, consistency and standardisation.
Rigorous validation of VA procedures is needed to establish confidence in the data collected, in order to understand the operational characteristics of VA in the populations under study and to identify misclassification patterns, which may then be corrected[25, 26]. Poor validity for specific causes or cause-of-death categories raises questions not only about the utility of the specific VA tool, but also about the questionnaire used to collect data, interviewer skills and household awareness of health and disease. The extent of differences within the original physician diagnoses and the differences revealed in subsequent physician reassessment in the current study highlight the lack of standardisation inherent in physician interpretation of VA material. It was not possible, with these data, to robustly validate the model by comparison with original physician assessments. We accept that the process of reassessment by a further physician, as described above, may do more to illustrate the vagaries of VA interpretation than to provide a standard for validation, but it is important to recognise that in many settings there is no absolute "gold standard" by which to validate the performance of alternative VA interpretative models.
What is often termed "validation of VA" includes multiple components (validity and standardisation of VA instruments and interview, validity of VA interpretation(s), validity of arbitration between various interpreters and multiple validity issues around candidate "gold standards" such as medical record assessments). Discussions of VA validity typically focus on sensitivity, specificity and positive predictive values (PPVs) derived by comparing VA diagnoses with a reference diagnosis. In general, two types of reference "gold-standards" are used for validating VA tools: health-facility-based diagnoses or diagnoses derived from medical records, and community-based physician review diagnoses[7, 19, 27]. Whilst facility-based validations enable comparison of VA findings with a comparatively highly accurate medical diagnosis of cause of death, such studies are subject to selection and information bias and do not represent the populations for whom VA is intended, most of whom die without medical attention. Deaths from haemorrhage, for example, occur more rapidly than deaths from obstructed labour or pregnancy-related sepsis, and therefore they are likely to be under-represented in facility-based validations since haemorrhaging individuals will be less likely to reach a hospital before death; this is particularly true in areas with poor transportation.
Ideally therefore, the validity of VA should be assessed using a sample of community-based deaths. Physician review of VA data from community-based deaths has specific limitations, which have already been highlighted here and by others[9, 19]. Issues of sampling communities for VA validation studies and the difficulty of tracing medical records (if they exist) to support physician diagnoses are further limitations of community-based studies.
Discussions of validity in terms of sensitivity, specificity and PPV assume that the referent diagnosis gives the right answer. This is reasonable if the objective is to assess whether alternate interpretation methods can be as accurate as the reference standards in the specific setting and time period of interest. It has been acknowledged, however, that VA diagnoses may be more accurate than the referent diagnosis in some instances[4, 7, 28]. Though not a formal validation of InterVA-M against the usual gold standards employed in VA studies, the current study compares the results from the probabilistic model with physician review of the data, and assesses the model's performance in terms of comparability, reliability and adequacy of purpose, avoiding reference to sensitivity, specificity or PPVs, which would imply inherent superiority of the physician review method. Efforts have been made to adjust imperfect gold standards, including adjusting for the quality and quantity of evidence in support of the reference standard[8, 28] and such techniques may provide an opportunity for more thorough demonstrations of validity of InterVA-M against proxy gold-standards, at least for certain causes of death, in the future.
The proportion of deaths identified by the probabilistic model as being during pregnancy or within 6 weeks of pregnancy ending (30%) was slightly lower than the proportion of maternal deaths among deaths of all females of reproductive age for Burkina Faso as estimated by WHO, UNICEF and UNFPA (37%). However, this discrepancy is not simple to account for in the light of the modelling used to arrive at national estimates and the possibility of local differences in mortality patterns.
Categorisation of pregnancy-related deaths as direct or indirect, or pre-, inter- or post-partum, will never be easy using VA data. The current version of the model does not attempt an "intra-partum" category, although the combination of the pregnant/recently delivered categorisation with specific maternal causes can reveal this to some extent. In principle the model could be adapted further around these issues, but more work is needed on arriving at consensus requirements.
The omission of free-text information from various algorithmic approaches to VA interpretation has hindered their acceptance and caused concern over validity[19, 20]. One study showed that the sensitivity of VA using physician review for neonatal causes of death was lower when only closed questions were used. Whilst InterVA-M can be used with any kind of VA data, the process of identifying and extracting indicators from open-text often requires greater medical knowledge and subjectivity. Nevertheless, this study suggests little or no benefit from this process, in concordance with an earlier study. It may be the case that free-text information is more informative to physician reviewers than to modelling processes, but frequently much information is duplicated between open and closed sections of VA interviews, possibly prolonging the process unnecessarily. Further investigation into the value of free-text information for the InterVA method is anticipated using existing VA data.
The possible precision of VA methods continues to be debated. For public health monitoring, the greatest precision is needed to distinguish between causes that might be the targets of viable interventions. Generating possible multiple causes of death is likely to more accurately reflect the interactions between different diseases that lead to death and more realistically highlight the dominant morbidity and mortality burdens at the community level. Insisting on single causes of death could distort estimates of overall mortality and potential gains from health interventions. Weighting single deaths among several causes could complicate analysis and comparison with other studies[20, 32]. However, previous work with InterVA data has incorporated this approach successfully[12, 13]. InterVA-M may also be well suited to public health monitoring using the established standards of the international death certificate (ICD-10), which allows for multiple causes in a causal pathway leading to death, however further consideration of interpreting the sequencing of events in InterVA-M is needed.
The InterVA-M model will now move on to being reviewed by a further expert panel, drawn from a range of diverse settings, to review the indicators, possible causes and associated probabilities currently used in the model, together with conceptual and contextual issues regarding terminology and regional variation. A similar process for the all-cause model resulted in an improvement in its overall performance. To ensure that InterVA-M becomes an acceptable tool for use across the developing world, both in research and service settings, we hope to find opportunities for more robust validation studies. A pilot version of the InterVA-M model implemented on a handheld computer (PDA) which allows direct capture and interpretation of VA data is also under test. Meanwhile, the preliminary version of the model, as described here, can be downloaded from the InterVA website.