Cross-national agreement on disability weights: the European Disability Weights Project

Background Disability weights represent the relative severity of disease stages to be incorporated in summary measures of population health. The level of agreement on disability weights in Western European countries was investigated with different valuation methods. Methods Disability weights for fifteen disease stages were elicited empirically in panels of health care professionals or non-health care professionals with an academic background following a strictly standardised procedure. Three valuation methods were used: a visual analogue scale (VAS); the time trade-off technique (TTO); and the person trade-off technique (PTO). Agreement among England, France, the Netherlands, Spain, and Sweden on the three disability weight sets was analysed by means of an intraclass correlation coefficient (ICC) in the framework of generalisability theory. Agreement among the two types of panels was similarly assessed. Results A total of 232 participants were included. Similar rankings of disease stages across countries were found with all valuation methods. The ICC of country agreement on disability weights ranged from 0.56 [95% CI, 0.52–0.62] with PTO to 0.72 [0.70–0.74] with VAS and 0.72 [0.69–0.75] with TTO. The ICC of agreement between health care professionals and non-health care professionals ranged from 0.64 [0.58–0.68] with PTO to 0.73 [0.71–0.75] with VAS and 0.74 [0.72–0.77] with TTO. Conclusions Overall, the study supports a reasonably high level of agreement on disability weights in Western European countries with VAS and TTO methods, which focus on individual preferences, but a lower level of agreement with the PTO method, which focuses more on societal values in resource allocation.


Background
Summary measures of population health combine information on mortality and non-fatal health outcomes in order to represent the health of a particular population as a single measure [1]. They are used traditionally for comparative judgements of average levels of population health between populations and over time. Summary measures of population health were recently used with an explicit link to health resource allocation, e.g. disabilityadjusted life expectancies (DALE) computed among other measures for the evaluation of the performance of health systems in the World Health Report 2000 [2], or disabilityadjusted life years (DALY) for burden of disease estimates and cost-effectiveness analyses [3][4][5].
All summary measures of population health are built on three critical inputs: mortality by age, sex and condition; epidemiological data on non-fatal health outcomes by age, sex and condition; and valuations of health states (disability weights) that assess the relative severity of a year lived in a particular condition. Whereas mortality and epidemiological data may be seen as objective measures, even if scarcity and heterogeneity of data may compromise their accuracy, valuations of health states are undoubtedly subjective measures.
The lack of a gold standard for health state valuation has led to the development of various valuation methods [6]. The 1996 Global Burden of Disease study (GBD) represented a milestone in the development of summary measures of population health, as it established a single set of several hundred disability weights relating to 107 conditions using the same valuation method [7,8]. The choice of the specific values of an international panel of about ten health experts was supported by high correlations of their disability weights for 22 hypothetical indicator health states with those of eight panels from National Burden of Disease teams or World Health Organization (WHO) workshops on burden of disease methods [9]. Since then the assumption of cross-national agreement on disability weights has been further supported by studies using similar valuation protocols [10,11], whereas agreement between different types of informants in health have shown contradictory results [12][13][14][15].
One of the primary objectives of the European Disability Weights (EDW) project was to assess the cross-national agreement on valuations of health states when elicited using different methods [16]. In the EDW study, a visual analogue scale (VAS) measured the severity of health states relative to the anchoring endpoints of the scale (worst and best imaginable health states). The time tradeoff technique (TTO) measured the extent to which respondents would be willing to give up an amount of life time to avoid a hypothetical condition and be in full health, and the person trade-off technique (PTO) elicited directly the health decision maker's trade-off between severity of illness, the size of the health gain and the number of people helped [6]. Hypothetical health states were valued in panels of two possible informants in health, i.e. health care professionals and the general public with an academic background. We report here on the agreement of disability weights from five Western European countries (England, France, the Netherlands, Spain, and Sweden) using VAS, TTO and PTO.

Methods
The valuation of health states in the participating Western European countries followed a standardised protocol with back and forth translation from English for all valuation materials [16]. Key points of the valuation procedure were fixed to limit construct-irrelevant variance: 1. The scenarios to be valued were presented consistently in the form of a disease label, a brief clinical description of the disease stage, and a generic health state profile (EQ-5D extended with a cognitive dimension) [17][18][19]; 2. Three valuation methods were used: visual analogue scale (VAS), time trade-off (TTO), and person trade-off (PTO); 3. A structured protocol which allowed for discussion and deliberation was followed in all panel sessions; Panel sessions in each country were led by a trained facilitator from that country. Facilitators were trained by the Dutch group who had previous experience in valuation in panel sessions [11].
Two sources of variance in the valuation of health states were retained in our interrater reliability study of each valuation method: 1) the country; 2) the type of panel according to medical background of participants.
(page number not for citation purposes)

Disease stages selection and description
A list of diseases accounting for almost 80% of years of life lost due to premature mortality and 80% of years lived with disability in the Established Market Economies Region (including all Western European countries) was extracted from the Global Burden of Disease study [9]. Thirteen diseases were then selected to cover: 1. The main chapters from the ninth revision of the International Classification of Diseases, 2. Different dimensions of disability, 3. Very mild to very severe health states.
External health care professionals and public health experts participated in both the subdivision of selected diseases into homogenous disease stages with respect to functional status, treatment and prognosis, and the elaboration of a brief clinical description for each disease stage [16].
Fifteen disease stages were selected for the panel valuation procedure: the stages selected covered the full range of disease severity, from the common cold to a final year of an unspecified fatal disease. All selected disease stages were described on a separate sheet with the name of the disease, the position of the selected disease stage among the other stages, a brief clinical description and a health state profile defined using the EQ-5D descriptive system extended to include a cognitive dimension, i.e. EQ-5D+C [17][18][19]. The EQ-5D+C system has six dimensions (mobility, self-care, usual activities, pain/discomfort, anxiety/ depression, cognition) each with three possible levels of severity (no problem, some problems, extreme problems). Consistency of profiles was checked across disease stages within diseases and across diseases. Figure 1 shows an example of a disease stage description.

Valuation methods
Pilot studies conducted in participating countries tested innovative societal valuation methods [20] after the GBD societal valuation protocol had been criticized at an early stage of the project on ethical grounds [21,22]. Agreement on the valuation protocol was reached by consensus, and the three valuation methods are described below in the order of their use in panels. In VAS, all fifteen disease stages were valued; in PTO and TTO the nine chronic disease stages were valued.
In the self-administered VAS participants were asked to consider the consequences of living with the disease stage for one year. The disease stages were first ranked by decreasing severity, and then scored on a vertical thermometer graded from 0 (the worst imaginable health state) to 100 (the best imaginable health state) considering the consequences of living with the disease stage for one year. The best and the worst disease stages were scored first.
In the PTO, panel participants played the role of decisionmakers in their country prioritising between two preventive programmes. Several assumptions about the programmes were made explicit in the panel sessions: -Prevention means the reduction of occurrence in two to four years; programmes are of the same costs and otherwise equal (e.g. age, sex, socio-economic status of groups); -Both programmes include people of various ages; -Loss of production for society and burden on family or caretakers were to be disregarded in decisions.
The PTO session began with the following example: "Programme A prevents the occurrence of a rapidly fatal disease in 100 people in your country in 2 to 4 years' time.
Example of a disease stage description as presented for valuation Figure 1 Example of a disease stage description as presented for valuation. The disease stage description included a disease label (dementia), the disease stage to be valued (marked by the arrow), a textual description (in bold) and a generic description of the functional health status (using the EQ-5D+C descriptive system, which has 3 severity levels per attribute; one dot indicates the 'moderate' level). The identity of these people is unknown. With the programme they will live in normal health for a normal lifetime. Programme B prevents the occurrence of severe vision disorder in a number of people in your country in 2 to 4 years' time. The identity of these people is unknown. With the programme they will avoid the state and live in normal health for a normal life time." Participants determined the number of people in programme B at which they were indifferent between the two programmes with the aid of a visual prop that displayed a stepwise procedure increasing the numbers in programme B (100, 200, 1000, 10 000, etc.). Indifference numbers lower than 100 were also allowed [21]. After the example, participants had to prioritise between the prevention of a rapidly fatal disease and quadriplegia, and then between each of the eight chronic disease stages on the one hand, and quadriplegia on the other. Quadriplegia was thus used as an anchoring state, linking the valuation of chronic states to death. After initial individual valuations, discussion was structured among panel participants by the facilitator who ensured that participants understood and were aware of the implications of their choices. Following discussion panel members had the opportunity to change their responses if they so wished.

Dementia
In TTO, panel participants had to imagine someone like themselves in full health, and choose between living their remaining 10 years of life in the chronic disease stage or less time in full health. The number of years at which the panel participants were indifferent was found using a "ping-pong" procedure, but participants were allowed not to trade-off any years of life [23]. The facilitator again ensured that panel participants fully understood the task.
Finally, participants had the opportunity to reconsider their responses after discussion in the panel and were encouraged to compare individual rankings of all disease stages for all three valuation methods and make changes to any responses if they so wished. As implied in the last two equations, the PTO used quadriplegia rather than death as the anchoring state. In cases in which the DW exceeded 1 due to the chained procedure in the PTO (i.e. when participants valued quadriplegia worse than death), the DW was truncated to 1. The proportion of participants who valued quadriplegia worse than death was recorded as well as changes in PTO and TTO numbers after panel discussions and consistency checks across valuation methods.

Statistical analyses
Rankings of the disease stages based on mean DW computed from VAS, PTO and TTO were compared across countries and health care professional status with Spearman's rank correlation coefficients. In a random effect model, variance components of disability weight were estimated for the random effects identified in this study: Maximum likelihood estimates of the variance components were used to compute the proportion of total variance accounted for by each random effect. For our interrater reliability study, two intraclass correlation coefficients were computed according to generalisability theory, which is a specific application of analysis of variance [24]. A first intraclass correlation coefficient measured the agreement between countries on disability weights deduced from VAS, TTO and PTO: The numerator includes the variance components of all random effects on disability weights other than countryrelated effects and the residual term, which are added in the denominator. The closer to unity the intraclass correlation coefficient, the better the agreement of countries on valuations of disease stages. With a comparable design, a second intraclass correlation coefficient was computed to measure the agreement between the two types of panel for all valuation methods. Non-parametric bootstrap resampling techniques were used to compute 95% confidence intervals [25], since the complex design of our interrater reliability study did not allow simple computations [26]. One hundred independent random samples were resampled from individual data depending on country and panel type. Significance was examined at the 5% level. All analyses were undertaken using SAS version 8.0 (SAS Institute, Cary NC).
No general statement about the desired level of the reliability coefficient of a test can be made, because the purpose for which the test is used must always be taken into account [24]. When tests are intended for important decisions at the individual level, e.g. admission for or discontinuation of a clinical treatment, a reliability coefficient greater than or equal to 0.90 may be considered as "good." When tests are intended for less important decisions at the individual level, e.g. evaluation of treatment outcome, a reliability coefficient greater than or equal to 0.80 may be considered as "good." In our particular case, where valuation methods were intended for research at the group level, a reliability coefficient greater than or equal to 0.70 may be considered as "good," between 0.60 and 0.70 as sufficient, and less than 0.60 as insufficient [27,28].

Results
A total of 232 participants from England, France, the Netherlands, Spain and Sweden were included in 13 panels of health care professionals and 10 panels of nonhealth care professionals. Overall, 60% of subjects were females, and the mean age was 40.4 years, with a standard deviation of 15 At least one disease stage description was questioned in panels of either non-health care professionals (8 out of 10) or health care professionals (10 among 13). Similar proportions of panels of non-health care professionals and health care professionals also reported difficulties with prognosis of some disease stages in TTO (74%), the initial example of PTO (35%), and the PTO valuation method overall (17%).

Discussion
A total of 232 participants from five western European countries valued disease stages in health care professional and non-health care professional panels. Overall we found a very similar ranking of disease stages across countries irrespective of the valuation method used. This confirms previous findings based on the valuation of seventeen health conditions, either with VAS through individual interviews of about fifteen key informants in fourteen countries from different regions [29], or with PTO in the GBD study and recent refinements [9,10]. Similar rankings of disability weights are not enough, however, to judge the appropriateness of a universal disability weight set used at a cardinal level in summary measures of population health.
We found that intraclass correlation coefficients measuring agreement between countries were good with VAS and TTO. At first glance, this finding may appear at odds with cross-national comparisons of disability weights focusing on disease conditions separately. Other studies eliciting values for EQ-5D health states with TTO from the general public in the United Kingdom and Spain [30], or in the United Kingdom and Japan [31], showed a high positive correlation of values between countries, but significant differences in values were found for a number of health states. Whereas a great variability in the valuation of health states is observed within countries [32], the previous approach does not allow one to disentangle systematic differences in valuation between subjects and between countries. As shown within the framework of generalisability theory, the subject effect accounted for more variance of disability weights than the country effect for all valuation methods.
In the case of the PTO method, the intraclass correlation coefficient measuring agreement between countries was insufficient. PTO elicited directly health decision-makers' trade-offs between preventive programmes and attempted to get societal preferences between disease stages. Whether respondents actually took a societal view in PTO questions (as opposed to an individual view in VAS and TTO) was not confirmed directly, e.g. through follow-up interviews, and is certainly worthy of further research. The PTO method demonstrated a dramatic increase in the systematic effects related to subjects and countries as compared to VAS and TTO. This might be related to different views regarding equity across European people [33].
We found that the agreement between people of similar academic background but of different medical background was good with VAS and TTO, and sufficient with PTO. This confirms results of an earlier study in the Neth- erlands [11]. However, agreement between possible informants in health, e.g. individuals in health states, patients' families, health care professionals and the general public, showed contradictory results in Western countries using the TTO method [12][13][14] or VAS [15]. In the absence of clear agreement between possible informants on disability weights, the United States Panel on Costeffectiveness in Health and Medicine stated that the general public preferences on health conditions should be used to inform health care resources allocation [34]. Further research should assess differences in valuations between representative samples of the general public and the more educated and homogeneous groups used in this study. In particular, academics who volunteered to participate in such a time-consuming enterprise may represent a biased, highly literate sample of the population of the country. Academics and medical doctors are in many cases exposed to a similar global intellectual culture, which might override the national culture in intellectual matters. This may be especially true in the context of developing countries where highly educated people may have values at odds with those of the general public [35].
The design of our valuation methods may limit comparison with other studies. Framing and anchoring effects were likely to have been present with all three valuation methods. Among other framing effects, VAS scores are prone to sequencing effects (i.e. the worst and the best disease stages were scored first in this study), and the range of health states considered [23]. The anchoring of the TTO in a ten-year time frame was fixed for all participants to ensure comparability of results. However, TTO disability weights for most disease stages decreased with the age of these relatively young participants, with older people less willing to give up an amount of life time to avoid a health condition than younger people (data not shown). This may have been of particular relevance in the crossnational comparisons focusing on disease conditions separately, since age patterns differed between participating countries.  Pilot studies resulted in a "chained PTO" to limit the "rule of rescue" encapsulated by the technique, i.e. valuations take into account the initial disease severity of the programmes' recipients in particular in lifesaving programmes. Quadriplegia as the anchoring state had various consequences at the country level. Firstly, 43% of participants thought that the prevention of quadriplegia should receive a higher or equal priority to that of a lifesaving program. This finding was not related to age of participant but differed significantly across countries, from 23% in Sweden to 25% in Spain, 37% in France, 56% in England, and 64% in the Netherlands. Secondly, 24% and 5% of participants thought that the prevention of severe depression and stroke, respectively, should receive a higher or equal priority to that of quadriplegia with significant variation across countries (from 4% in Spain to 12% in France, 20% in Sweden, 28% in England, and 50% in the Netherlands in case of severe depression).
The PTO-DW of disease stages worse than the PTO reference programme was truncated to 1 to allow face validity across different valuation metrics. This truncation means that there was no differentiation between the different very severe states, and that, as all responses were recoded as 1, the level of agreement could obviously be higher than if ranking of these states was also done. In addition, participants were encouraged to compare their individual rankings of the nine chronic disease stages for all three valuation methods at the end of panel sessions. Spearman's rank correlations between valuation methods increased significantly (in paired t-tests) at the individual level after these "consistency" checks, by 0.013 (± New-York Heart Association (NYHA) Another limitation of this study is related to the validity of our valuation protocol. Despite great care being taken to ensure the face validity of disease stages, at least one disease stage was questioned in a majority of panels of both health care and non-health care professionals. For instance, discrepancies between the brief clinical description of spinal cord injuries resulting in quadriplegia and its generic health state profile were often noted. Difficulties were also encountered with TTO and PTO methods in spite of the deliberative panel process led by a facilitator and the high level of education of the participants. If we are to collect values from the general public as recommended, then we need to put more effort into ensuring that valuation methods are understood as intended by respondents. Discussion in panel sessions had a considerable impact on individual PTO valuations (see Table 2), and underlines that the collection of societal values at the individual level without discussion may hamper its face validity. Although discussion increased the level of agreement on DW computed from PTO in this study, differences across countries and health care professional status were still striking.

Conclusions
This study supports a reasonably high level of agreement on disability weights in Western European countries with VAS and TTO methods, but a lower level of agreement with the PTO method. This study showed that even within a relatively homogenous and wealthy region, and with a PTO valuation protocol that may inflate the level of agreement across countries, the agreement on disability weights was insufficient when a societal perspective was taken into account, i.e. when the summary measure of population health was considered explicitly within the context of health care resource allocation. Accordingly, this study casts some doubts on the generalisability of the disability weights computed from PTO used in the Global Burden of Disease study, although PTO protocols differed. For any valuation method, the level of agreement on disability weights requires further evidence in larger and more representative samples of the general public within and across regions, as defined by countries' location and possibly by similarities in mortality patterns and cost structures.
However, uncertainty surrounding disability weights may be considered small when compared to the lack of epidemiological data in many areas of the world to compare summary measure of population health across countries, as in the World Health Report 2000 [36]. In the European * Two types of panel either with healthcare professionals or not ** 95% bootstrap confidence intervals estimated on 100 independent random samples dependend from country and panel type of the 232 individuals