The deadline for the Millennium Development Goals (MDGs) is less than five years away and the need to quantify childhood mortality, understand its causes, and assess the effects of proposed interventions are central to MDG4. Neonatal deaths contribute about 40% of under-5 mortality globally . A recent evaluation of the INDEPTH network of Health and Demographic Surveillance Sites  calls for all sites to use InterVA for coding of causes of death, since such approaches represent "the only viable strategy to produce timely and comparable cause of death statistics" . Our study has revised the InterVA method for verbal autopsy to improve its ability to identify causes of stillbirth and newborn death and tested it in three populations.
In this study, physician review was used as a reference standard to compare InterVA. The use of physician review was the only alternative source of cause of death assessment for our study populations. This choice has limitations, however. Physicians are influenced by their experience, perception, and interpretation of local epidemiology [23, 27]. Moreover, they mostly use the open history to reach a decision and may not account consistently for all the indicators. Sensitivity and specificity of physician review compared with hospital diagnosis in neonatal populations varied between 64% and 74% in a recent study  and concerns about inter- and intrarater reliability are well described .
An alternative to physician diagnoses is the use of hospital records. Hospital diagnoses have been used to establish sensitive, specific, and positive predictive values of VA diagnoses [[8, 12, 20]]. The main pitfall of hospital diagnoses in developing countries, particularly in rural settings, is that the CSMF of deaths occurring in hospitals are likely to be different from the ones in communities . There is therefore the risk of increasing precision of an interpretative method, defined as its ability to reproduce hospital diagnoses in the population where it is tested. This would not necessarily produce results that are correct when used in populations where access to hospitals and health care is limited. Moreover, the ability to recognize, recall, and report signs of illnesses may be different among hospital users and nonhospital users.
The results of InterVA as compared with physician review showed an almost identical ranking of causes of death. However, differences exist. Some of these differences can be explained by the way the model was constructed. Prematurity, for example, was over-diagnosed by InterVA in Zimbabwe and Nepal. This probably resulted from using a dataset where clinicians were allowed more than a single cause of death to refine InterVA. In fact, when multiple causes of death are allowed, prematurity is more likely to be listed as a coexisting cause of death than when a single cause is selected . The model did not include "other" as a cause of death and would have classified such causes of death in one of the available diagnoses.
InterVA over-diagnosed neonatal infections compared with physician review in Zimbabwe, while the opposite happened in Nepal. This inconsistency could be due to the interpretation of signs by different physicians. Alternatively, it could be due to the selection of a priori probabilities. Greater understanding of the way physicians decide to value or ignore signs and symptoms may help in future refinements and evaluations of InterVA.
Stillbirths were included for practical and public health reasons. Although globally there are about 3.2 million stillbirths per year, reliable statistics are lacking . This information gap has to be addressed. About half of perinatal deaths are accounted for by stillbirths . The refinements including stillbirths in the model eliminate the need to differentiate between live births and stillbirths before processing VA data, making the method more suitable for use in large surveys. The separation between fresh and macerated stillbirths is relevant, as prevention strategies are different. The comparisons between InterVA and physician review in Malawi and Nepal suggest that InterVA can differentiate the two categories, although, as with neonatal deaths, there may be room for further refinement.
Case-by-case agreement was moderate in all datasets, however it was lower for Zimbabwe compared to Nepal and Malawi. The new indicators and matrix probabilities have been chosen and modified on the basis of the personal experience of the researchers, and subsequently tested and modeled on a subset of the Malawi data. There is a risk, therefore, that the tool may be too closely modeled on a sub-Saharan African setting (although the results from Nepal do not support this) or on a particular research setup. In addition, the modifications have so far not been put to a panel of experts and may need to be subject to a wider consensus.
There may be important epidemiological and social explanations for the difference in the CSMF in Malawi, Zimbabwe, and Nepal. However, even if the interpretation of verbal autopsy data by InterVA was consistent, methodological variability in other aspects of VA may have contributed to the observed cause distribution. Indeed, the close comparability of CSMF between Malawi and Nepal may to some degree reflect common data capture processes that differ from those used in Zimbabwe. It is possible that in Nepal and Malawi, the populations were part of research areas and might have been sensitized to recognize, describe, and recall signs of neonatal diseases, while in Zimbabwe the community was part of a government surveillance and may have responded differently. Nevertheless, this is a reality of all VA studies conducted in research settings. Use of lay (in Malawi and Nepal) versus health-professional (in Zimbabwe) interviewers and their gender may also have had an impact on data capture. This highlights the need for further methodological research into the effects of other aspects of VA. It is likely that a number of strategies and international collaborations will be necessary to ensure the success of such investigations.
The modified version of InterVA for stillbirths and neonatal deaths produced plausible results when compared with physicians' opinions but had the advantage of being completely internally consistent, allowing standardized comparisons of data from different countries. Ultimately, standardized methods are essential and their application and evaluation in a wide range of settings is encouraged. Through wider application, the strengths and weakness of InterVA, and VA in general, will become more apparent, thereby better informing the application and public health utility of surrogate methods for measuring mortality in absence of vital registration systems.