Assessing the repeatability of verbal autopsy for determining cause of death: two case studies among women of reproductive age in Burkina Faso and Indonesia
Population Health Metrics volume 7, Article number: 6 (2009)
Verbal autopsy (VA) is an established tool for assessing cause-specific mortality patterns in communities where deaths are not routinely medically certified, and is an important source of data on deaths among the poorer half of the world's population. However, the repeatability of the VA process has never been investigated, even though it is an important factor in its overall validity. This study analyses repeatability in terms of the overall VA process (from interview to cause-specific mortality fractions (CSMF)), as well as specifically for interview material and individual causes of death, using data from Burkina Faso and Indonesia.
Two series of repeated VA interviews relating to women of reproductive age in Burkina Faso (n = 91) and Indonesia (n = 116) were analysed for repeatability in terms of interview material, individual causes of death and CSMFs. All the VA data were interpreted using the InterVA-M model, which provides 100% intrinsic repeatability for interpretation, and thus eliminated the need to consider variations or repeatability in physician coding.
The repeatability of the overall VA process from interview to CSMFs was good in both countries. Repeatability was moderate in the interview material, and lower in terms of individual causes of death. Burkinabé data were less repeatable than Indonesian, and repeatability also declined with longer recall periods between the death and interview, particularly after two years.
While these analyses do not address the validity of the VA process in absolute terms, repeatability is a prerequisite for intrinsic validity. This study thus adds new understanding to the quest for reliable cause of death assessment in communities lacking routine medical certification of deaths, and confirms the status of VA as an important and reliable tool at the community level, but perhaps less so at the individual level.
Garenne and Faveau  recently set out a brief history of verbal enquiries into cause of death, including identifying some of the possible limitations, but without mentioning any aspect of repeatability of the process. Verbal autopsy (VA) has become an increasingly well-established approach for determining cause of death in populations lacking universal cause of death registration over the past two decades, and is a very important source of data on deaths among the poorer half of the world's people. Following early VA work in West Africa , there have been a number of efforts towards standardisation of VA procedures. Much of this has concentrated on the standardisation of interview questionnaires, culminating in WHO's recently published standards . There has also been work on objective approaches to interpreting material from VA interviews [4–10], albeit with a possible trade-off between standardisation and subtlety of interpretation. Other studies have made comparisons between VA-derived cause of death and arguably "harder" evidence, such as hospital records for deaths occurring in institutions [9–13]. Some of these have been described as "validation" studies for VA, although they have generally only considered selected parts of the overall VA process. However, none of this work has objectively assessed the repeatability of the overall VA process (from individual interviews to aggregated mortality patterns), nor of its constituent parts (interviews, interpretations, individual causes of death, aggregated mortality patterns).
In this paper, we report the results of a study designed specifically to examine the repeatability of the VA process. The study was run in parallel in Burkina Faso and Indonesia, to allow comparison between two very different settings, both of which involved repeated VA interviews concerning deaths among women of reproductive age. We chose to exclude consideration of repeatability of the interpretation stage of the process, by using the InterVA-M model , which gives an intrinsic 100% repeatability, and also allows inter-country comparisons without needing to consider systematic differences between interpreters with different training and backgrounds.
Our major aim was to assess the repeatability of the overall VA process under operational conditions, from interview to aggregated mortality patterns, in the two different settings.
Subsidiary aims were:
to assess the repeatability of the interview stage of the VA process (in terms of material gathered in VA interviews);
to assess the repeatability of cause of death determination at the individual level;
to assess the repeatability of cause-specific mortality fractions (CSMF) determined at the population level.
It should be noted that this study did not aim to arrive at conclusions, other than on repeatability, about cause of death patterns in either setting, and not explain differences in mortality patterns between the two countries, evaluate the clinical validity of VA, nor draw conclusions for health planning.
In both Burkina Faso and Indonesia, large scale community-based surveys of mortality among women of reproductive age, with a particular focus on pregnancy-related mortality, were undertaken in 2005–6 (February to May 2006 in Burkina Faso and December 2005 to June 2006 in Indonesia) [15, 16]. These surveys involved identifying deaths that had occurred among women of reproductive age and then undertaking verbal autopsy interviews. These interviews were structured to include the collection of the "indicators" (a range of 75 questions with "yes" or "no/unknown" responses, covering background, pregnancy status, clinical history, signs and symptoms before death and obstetric history) that are needed as the input material for the InterVA-M model . This material from the interviews was interpreted using the InterVA-M model. For the purpose of these investigations into the repeatability of the VA process, subsamples of the originally identified cases were reselected, on a purposeful basis that approximately reflected the overall mortality patterns in the original surveys and were logistically feasible for re-interviewing. The majority of these cases either had no contact with health services around the time of death or case-notes were unavailable. These cases were then revisited in November 2007 (Burkina Faso) and January 2008 (Indonesia) and the VA interviews repeated, generally by different interviewers. Aspects of repeatability within the entire VA process, in both Burkina Faso and Indonesia, have been assessed within the conceptual framework shown in Figure 1.
The interview material thus consisted of individual sets of responses to the 75 possible InterVA-M indicators, and the repeatability of the interview stage of the overall process was assessed by calculating kappa statistics for each indicator, together with p-values assessing whether agreement between the original and follow-up interviews was significantly greater than that expected by chance.
The InterVA-M model was then run on all the original and follow-up interview material from both countries, to interpret likely pregnancy status and cause-of-death outcomes. The InterVA-M model generated, for each case, the most likely pregnancy status at the time of death (pregnant, delivered within 6 weeks or not pregnant within 6 weeks of death) with an associated likelihood. Then up to three likely alternative causes of death were generated, each with an associated likelihood. These likelihoods were used to ascribe fractional causes of death, as described previously . As the model provides 100% repeatability between indicator input and pregnancy status or cause-of-death output, its performance was not part of this repeatability assessment. The individual level pregnancy status and cause of death likelihoods were summed over all individuals and the proportions calculated for each status/cause according to whether the same status or cause was or was not represented in the output from both an individual's original and follow-up interviews, within each country.
The individual pregnancy status and cause of death outputs were then combined into overall pregnancy status fractions and CSMFs for the population samples in Burkina Faso and Indonesia respectively, which were then compared in terms of magnitude and rank order.
Ethical approvals for the verbal autopsy studies in Burkina Faso and Indonesia, including follow-up interviews where needed, were granted by the Ministry of Health National Health Research Ethics Committee (Ouagadougou, Burkina Faso) and Centre MURAZ Institutional Review Board (Bobo-Dioulasso, Burkina Faso); by the Faculty of Public Health Research Ethics Committee at the University of Indonesia (Jakarta, Indonesia).
A total of 207 VA interviews were successfully repeated, 91 in Burkina Faso and 116 in Indonesia. The basic characteristics of these cases (both the deceased women and the interview respondents) are summarised in Table 1. Educational levels among both the deceased women and the VA respondents were much lower in Burkina Faso than in Indonesia, as were the availability of amenities such as piped water and television. Respondents in Indonesia were generally younger and more likely to be female than those in Burkina Faso. In Indonesia, in 86/116 interviews (74%), the same respondent was re-interviewed at follow-up. In Burkina Faso, the identities of the original respondents were not known and so comparison was not possible. The mean recall period from the death to the original interview was 26 months in Burkina Faso and 8 months in Indonesia, and similarly to the follow-up interview 46 months and 29 months respectively.
Kappa statistics were calculated for each InterVA-M indicator in each country to assess the repeatability of the verbal autopsy interview process. The distribution of kappa statistics for all measurable InterVA-M indicators in relation to their positive response rate in each country are shown in Figure 2. Although the InterVA-M model captures a total of 75 indicators, some were either locally non-applicable or received no positive responses in one of the countries. Thus repeatability was measurable for 68 indicators in Burkina Faso and 63 in Indonesia. The mean κ was 0.24 (range -0.09 to 1.00) for Burkina Faso and 0.45 (range -0.03 to 0.92) for Indonesia. In Burkina Faso 30 out of 68 measurable indicators (44%) showed repeatability better than expected by chance at the p < 0.01 level. Full details are shown in Additional file 1. In Indonesia 52/63 indicators (83%) showed repeatability better than expected by chance at the p < 0.01 level. Full details are shown in Additional file 2.
Since recall period is obviously a potentially important factor in the repeatability of VA interviews, mean kappa statistics were also calculated separately for original interviews falling before or after the median recall time, for each country. In Burkina Faso the median time to first interview was 26 months, and for 51 indicators represented in interviews on both sides of the median, the mean κ was 0.27 for interviews up to and including the median, compared with 0.17 for interviews after the median. Of the 51 indicators, 76% had a lower κ value for interviews beyond the median recall time. In Indonesia, the median time to first interview was 7 months, and for 60 indicators the mean κ was 0.44 for interviews up to and including the median, compared with 0.45 for interviews after the median time.
Table 2 shows individual level agreement on pregnancy status and likely cause(s) of death for the Burkina Faso and Indonesian cases, after interpreting the VA interview material via the InterVA-M model. In Burkina Faso, 18.0% of the cause of death output was concordant between the original and follow-up interviews, while in Indonesia 25.0% was concordant. For pregnancy status, 67.2% and 85.0% respectively were concordant at the individual level.
Table 3 shows aggregated mortality as cause-specific mortality fractions (CSMF), together with ranked causes of death, in Burkina Faso and Indonesia, from the original and follow-up surveys. These results are presented at the population level, for each of the four series of VA interviews. Applying the Wilcoxon signed-rank test to the results from Burkina Faso and Indonesia gave results of z = 0.52 and 0.50 respectively, p > 0.6 in both cases. Thus there was no evidence of significant differences between original and follow-up CSMF patterns.
Table 4 shows the top five ranking causes of death and associated CSMFs from the original and follow-up surveys in both Burkina Faso and Indonesia, at the population level. In all four surveys, the top five causes of death accounted for approximately 60% of overall mortality.
The concept of VA, from the interview, through interpretation, to population-level results, is a complex one. This study has investigated the repeatability of three different stages of this overall process, which is entirely novel, and capitalised on the use of the InterVA-M interpretative model in order to eliminate any subjective or inter-country variation in the process of interpreting the VA interview material.
It is difficult to discuss these findings extensively in relation to other work, since very little attention has previously been given to the repeatability of VA. All of our interviews were conducted in real field conditions, typical of settings in which VA is an important tool, and where deaths in hospital are rare. We happened to use material relating to women of reproductive age, although it is likely that similar findings would apply to other population groups. While it is possible that we may have influenced the conduct of the VA process by, exceptionally, undertaking follow-up interviews, we believe this is unlikely. In all cases the interval between the original and follow-up interviews was between one and two years, which probably minimised any effect associated with recalling the previous interview, even where the same respondent was involved. On the other hand, the additional accumulation of recall time since the death itself may have influenced the way in which original events were remembered, and thus reduced repeatability.
VA respondents, recall and repeatability
In terms of the repeatability of the VA interview stage, it is clear that there were major differences between the interviews undertaken in Burkina Faso and Indonesia, with a markedly higher mean kappa statistic and proportion of indicators with non-chance agreement in Indonesia (Figure 2). We cannot say definitively what factors lie behind this major difference, although the marked differences observed in the characteristics of those who had died and of the interview respondents (Table 1) are putative factors. In particular, the lower educational levels, greater age and higher proportion of men among Burkinabé respondents could have resulted in a higher proportion of interviews in which the respondent did not clearly know or recall the sequence of events leading to death, with these shortcomings being reflected in inconsistencies between the original and follow-up interview material. However, the difference between the two countries was also confounded by different recall periods, and the analyses of kappa by recall time suggest that the effect of recall bias was much more pronounced in the longer recall periods experienced in Burkina Faso. From these findings it might be reasonable to conclude that VA interviews should, where possible, be undertaken within two years of death. However, since our objective here was to assess repeatability of VAs under operational conditions, rather than optimising the process, all these effects mean that our findings on reliability are possibly conservative.
When the individual-level pregnancy status and cause of death outputs were compared between the original and follow-up interviews, there was a disappointing lack of concordance, though this was perhaps not surprising given the extent of non-agreement between the interview material, particularly from Burkina Faso (Table 2). This finding supports the view that VA material may not be particularly well suited to individual-level cause of death determination , at least under the operational conditions encountered here.
However, when the output data were considered at the population level, there was a much clearer sense of agreement between the original and follow-up material (Table 3). As might be expected from the repeatability results at the first stage, there were still greater discrepancies in the Burkina Faso results compared with those from Indonesia. There were however, as would be expected, marked differences persisting in the overall patterns between the two countries, counteracting any suggestion that the whole VA process might amount to some kind of reduction to lowest common factors.
Implications of VA repeatability for health policy
Taking a public health perspective, Table 4 considers repeatability of the VA process in the context of the often-asked question "what are the major causes of mortality?". It is clear that, for women both in Burkina Faso and Indonesia, there was very good repeatability between the original and follow-up VAs in terms of generating summary information for health policy and planning, even though the specific country findings were, as expected, different.
The relative levels of repeatability, both between the two country settings involved and between the different stages of the VA process, are interesting. The obvious differences between the two countries make clear that the whole VA process is context-dependent, since it might reasonably be inferred that in a context where repeatability is lower, then the intrinsic validity of one-time interview material would also be lower. The effects of non-repeatability at different stages of the VA process are also interesting – our results suggest that, given moderate repeatability within the original interviews, the repeatability of cause of death at the individual level is seriously compromised. However, when a reasonable group of individual VAs are taken as aggregate entities (around 100 per setting in these results) then our findings suggest that there is some recovery of repeatability, certainly to levels that appear to be acceptable in terms of generating aggregate data for health planning. The effects behind this are not easy to quantify, but it seems likely that some of the non-repeatability of details at the interview stage may have the effect of tipping the balance between possible causes at the individual level, but with many such differences then cancelling out on aggregating causes to the population level.
Repeatability is often not explicitly assessed for health measurement tools – so why is it important in relation to VA and what do the results tell us? The origins of VA – as a proxy source of cause of death data in the absence of medical certification – has sometimes led to confusion about its fundamental nature. While it is not generally argued that VA should be used as a direct replacement for medical certification at the individual level, this principle has not always been made explicit in VA work. Consequently, attempts to validate VA [9–13] have tended to struggle for lack of clear, appropriate gold standards and methods, and the role of VA has to some extent been left in an uncertain position. At the same time, the rigour of medical certification of death is often not critically evaluated .
We therefore offer these analyses of VA repeatability as a fresh viewpoint on the overall process, to shed some light on the practice of VA under realistic operational conditions and the value of the ensuing results. This partly follows from our previous consideration of the question "Who needs cause of death data?" , since VA cannot be considered as reducible to a single one-size-fits-all tool, and must be contextualised.
At least in contexts emphasising community-level cause-specific mortality, the findings of these repeatability analyses are encouraging. Although good repeatability does not guarantee good validity, at least it suggests that intrinsic validity is not compromised by random effects in the overall process. It also became clear that the longer recall periods associated with some of the Burkinabé interviews were detrimental to repeatability – and so presumably to validity. At the same time, overall repeatability was lower in Burkina Faso than in Indonesia, possibly because of different respondent profiles, which again emphasises the importance of considering VA material contextually, rather than simply in terms of standardised methods.
The overall process of VA, from interview to CSMF, has been shown to have good repeatability for two very different communities. However, VA outcomes were less repeatable at the individual level, and recall periods beyond two years compromised repeatability. Although repeatability does not demonstrate validity, it is a prerequisite, and so this study adds new understanding to the quest for reliable cause of death assessment in communities lacking routine medical certification of deaths.
Garenne M, Faveau V: Potential and limits of verbal autopsies. Bull World Health Organ 2006, 84: 164. 10.2471/BLT.05.029124
Garenne M, Fontaine O: Assessing probable causes of deaths using a standardized questionnaire: a study in rural Senegal. Reprinted (2006) in Bull World Health Organ 1986, 84: 248-253.
Baiden F, Bawah A, Biai S, Binka F, Boerma T, Byass P, Chandramohan D, Chatterji S, Engmann C, Greet D, Jakob R, Kahn K, Kunii O, Lopez AD, Murray CJ, Nahlen B, Rao C, Sankoh O, Setel PW, Shibuya K, Soleman N, Wright L, Yang G: Setting international standards for verbal autopsy. Bull World Health Organ 2007, 85: 569-648. 10.2471/BLT.07.043745
Reeves BC, Quigley M: A review of data-derived methods for assigning causes of death from verbal autopsy data. Int J Epidemiol 1997, 26: 1080-1089. 10.1093/ije/26.5.1080
Quigley MA, Chandramohan D, Rodrigues LC: Diagnostic accuracy of physician review, expert algorithms and data-derived algorithms in adult verbal autopsies. Int J Epidemiol 1999, 28: 1081-1087. 10.1093/ije/28.6.1081
Boulle A, Chandramohan D, Weller P: A case study of using artificial neural networks for classifying cause of death from verbal autopsy. Int J Epidemiol 2001, 30: 515-520. 10.1093/ije/30.3.515
Byass P, Huong DL, Minh HV: A probabilistic approach to interpreting verbal autopsies: methodology and preliminary validation in Vietnam. Scand J Public Health 2003, 31: 32-37. 10.1080/14034950310015086
Byass P, Fottrell E, Huong DL, Berhane Y, Corrah T, Kahn K, Muhe L, Do DV: Refining a probabilistic model for interpreting verbal autopsy data. Scand J Public Health 2006, 34: 26-31. 10.1080/14034940510032202
Fantahun M, Fottrell E, Berhane Y, Wall S, Hogberg U, Byass P: Assessing a new approach to verbal autopsy interpretation in a rural Ethiopian community: the InterVA model. Bull World Health Organ 2006, 84: 204-210. 10.2471/BLT.05.028712
Murray CJL, Lopez AD, Feehan DM, Peter ST, Yang G: Validation of the symptom pattern method for analyzing verbal autopsy data. PLoS Med 2007,4(11):e327. 10.1371/journal.pmed.0040327
Quigley MA, Chandramohan D, Setel P, Binka F, Rodrigues LC: Validity of data-derived algorithms for ascertaining causes of adult death in two African sites using verbal autopsy. Trop Med Int Health 2000, 5: 33-39. 10.1046/j.1365-3156.2000.00517.x
Kahn K, Tollman SM, Garenne M, Gear JSS: Validation and application of verbal autopsies in a rural area of South Africa. Trop Med Int Health 2000, 5: 824-831. 10.1046/j.1365-3156.2000.00638.x
Setel PW, Rao C, Hemed Y, Whiting DR, Yang G, Chandramohan D, Alberti K, Lopez AD: Core verbal autopsy procedures with comparative validation results from two countries. PLoS Med 2006, 3: e268. 10.1371/journal.pmed.0030268
Fottrell E, Byass P, Ouedraogo TW, Tamini C, Gbangou A, Sombie I, Hogberg U, Witten KH, Bhattacharya S, Desta T, Deganus S, Tornui J, Fitzmaurice AE, Meda N, Graham WJ: Revealing the burden of maternal mortality: a probabilistic model for determining pregnancy-related causes of death from verbal autopsies. Pop Health Metrics 2007, 5: 1. 10.1186/1478-7954-5-1
Bell JS, Ouedraogo M, Ganaba R, Sombie I, Byass P, Baggaley RF, Filippi V, Fitzmaurice AE, Graham WJ: The epidemiology of pregnancy outcomes in rural Burkina Faso. Tropical Medicine and International Health 2008,13(suppl 1):31-43.
Bell J, Qomariyah SN: Immpact – Tools & Methods: Selected findings on maternal mortality. Presentation at Immpact International Symposium 'Delivering Safer Motherhood: Sharing the Evidence'; London 2007. [http://www.immpact-international.org/uploads/files/Session19_section3_Immpact_JBell_MM.pdf]
Lahti RA, Penttilä A: Cause-of-death query in validation of death certification by expert panel; effects on mortality statistics in Finland, 1995. Forensic Sci Int 2003, 131: 113-124. 10.1016/S0379-0738(02)00418-8
Byass P: Who needs cause-of-death data? PLoS Med 2007,4(11):e333. 10.1371/journal.pmed.0040333
The authors would like to thank the families who participated and the Burkina Faso and Indonesian study groups for their assistance in the conduct of this evaluation. This work was undertaken as part of Immpact, funded by the Bill & Melinda Gates Foundation, the UK Department for International Development, the European Commission and USAID. Immpact is an international research programme which also provides technical assistance through its affiliate organisation, Ipact. The funders have no responsibility for the information provided or views expressed in this paper. The views expressed herein are solely those of the authors. We are grateful to Professor Wendy Graham for comments on an earlier draft.
The authors declare that they have no competing interests.
PB devised the study; all authors participated in field work; MO and SNQ coordinated work in Burkina Faso and Indonesia respectively and prepared datasets; PB & LD carried out analysis and drafting.
Electronic supplementary material
Additional File 1: Repeatability of VA indicators in Burkina Faso. Details of repeatability for each verbal autopsy indicator from a series of 91 repeated interviews in Burkina Faso (by ascending κ values within each category). (PDF 91 KB)
Additional File 2: Repeatability of VA indicators in Indonesia. Details of repeatability for each verbal autopsy indicator from a series of 116 repeated interviews in Burkina Faso (by ascending κ values within each category). (PDF 89 KB)
About this article
Cite this article
Byass, P., D'Ambruoso, L., Ouédraogo, M. et al. Assessing the repeatability of verbal autopsy for determining cause of death: two case studies among women of reproductive age in Burkina Faso and Indonesia. Popul Health Metrics 7, 6 (2009). https://doi.org/10.1186/1478-7954-7-6