The episodic random utility model unifies time trade-off and discrete choice approaches in health state valuation

Background To present an episodic random utility model that unifies time trade-off and discrete choice approaches in health state valuation. Methods First, we introduce two alternative random utility models (RUMs) for health preferences: the episodic RUM and the more common instant RUM. For the interpretation of time trade-off (TTO) responses, we show that the episodic model implies a coefficient estimator, and the instant model implies a mean slope estimator. Secondly, we demonstrate these estimators and the differences between the estimates for 42 health states using TTO responses from the seminal Measurement and Valuation in Health (MVH) study conducted in the United Kingdom. Mean slopes are estimates with and without Dolan's transformation of worse-than-death (WTD) responses. Finally, we demonstrate an exploded probit estimator, an extension of the coefficient estimator for discrete choice data that accommodates both TTO and rank responses. Results By construction, mean slopes are less than or equal to coefficients, because slopes are fractions and, therefore, magnify downward errors in WTD responses. The Dolan transformation of WTD responses causes mean slopes to increase in similarity to coefficient estimates, yet they are not equivalent (i.e., absolute mean difference = 0.179). Unlike mean slopes, coefficient estimates demonstrate strong concordance with rank-based predictions (Lin's rho = 0.91). Combining TTO and rank responses under the exploded probit model improves the identification of health state values, decreasing the average width of confidence intervals from 0.057 to 0.041 compared to TTO only results. Conclusion The episodic RUM expands upon the theoretical framework underlying health state valuation and contributes to health econometrics by motivating the selection of coefficient and exploded probit estimators for the analysis of TTO and rank responses. In future MVH surveys, sample size requirements may be reduced through the incorporation of multiple responses under a single estimator.


Background
Health state valuation studies using the time trade-off (TTO) approach lack a sound theoretical framework for the incorporation of worse than death (WTD) responses. Furthermore, TTO responses may be considered a form of discrete choice (i.e., expressions of a tie between two alternative scenarios); yet, no valuation study has applied discrete choice estimators to TTO data. In this paper, we introduce an episodic random utility model (RUM) and two novel estimators for health state valuation. We show that the assumption of the episodic RUM theoretically and econometrically unifies TTO and other discrete choice approaches.
Estimating the value assigned to an episode of health is the main purpose of the EuroQol Group, and the focus of this paper. In Figure 1, the solid line represents the accumulation of quality adjusted life years (QALYs) for a per-son over time. The slope of the line represents the instantaneous value of health at each point in time. For example, the kink in the line suggests that at around the third year, the person's health was poor, as represented by a shallow slope; however, her health improved, and after ten years, she accumulated around seven QALYs. In other words, the value assigned to her decade-long health episode is seven.
As described by Torrance in 1982, values of better than dead (BTD) states are bounded by the values of optimal health (1.00) and dead (0). WTD states may be as large as minus infinity [1]. In Figure 1, a person's "spaghetti" line may lie anywhere between the dotted lines, but the slope of the spaghetti line must remain between one and minus infinity. The potential for an infinitely negative slope poses a fundamental challenge in the estimation of QALYs using TTO, standard gamble (SG), person trade-off (PTO), or any other discrete-choice approach. In TTO, the conventional approach to QALY estimation entails an average of positive and negative slopes (i.e., mean slope estimator); a similar process is applied in SG and PTO. An often noted problem is that the influence of negative slopes can be so massive (e.g., -39 in the MVH study) that the mean slopes appear much too low, well outside the reasonable range of face validity within the QALY concept.
Confronted with this threat to face validity, researchers typically manipulate WTD response data, arbitrarily increasing the negative slopes and imposing an ad-hoc boundary of negative one on the slopes. The boundary of negative one reduces the influence of negative slopes on the mean slope and gives an appealing mirror image for the valuations space above zero. Nevertheless, critics from early on have warned that there is no theoretical justification for the value of negative one, which means the truncated scale may not represent 'utility' [2]. Changing data to improve face validity is generally frowned upon, even in the case of outliers.
A similar health econometrics discussion has taken place on cost analyses, revealing that the transformation of positive outliers has a large effect on the mean cost per patient. At the 2008 American Society of Health Economists, John Mullahy compared the role of a health econometrician to that of an anatomist, dissecting data in an Aristotelian fashion [3]. In his lecture, "Anatomy of Healthcare Cost Distributions," he dismantled the thick upper tail of a common cost distribution and discussed its possible interpretations. Likewise, health state valuation studies continuously re-examine the theoretical framework that guides estimator selection and the best approach to address results with poor face validity.
QALY Space: Accumulation of Quality-Adjusted Life Years over Time In pursuit of a justification for this ad hoc transformation, studies report that respondents find it more difficult and make more errors estimating negative values than estimating positive values, especially in TTO tasks [4]. These psychometric complications are reflected in the high variance of negative values, the low discriminating power of negative values, and the discontinued scale around the value of death, otherwise known as the 'gap-effect' [5][6][7]. While the evidence on the influence of state-specific heteroskedasticity is mounting, there is not yet a clear and coherent framework for combining BTD and WTD TTO responses.
Recently, there has been considerable interest in estimating health state values from ranking exercises suitable for QALY calculations [4,5,8,9]. Ranking is seen as a relatively easy valuation method, like the visual analogue scale (VAS), and shown to render predictions that are concordant with (if not identical to) VAS predictions [5]. The advantage of ranking versus VAS is a well developed theoretical foundation in Item Response Theory without the response spreading and context effects associated with VAS [10]. Unlike VAS, ranking is a choice-based approach, which provides a basis for its merger with economic oriented choice-based methods, like TTO and SG. A drawback for both ranking and VAS is their unclear relation to health state values on the QALY scale, a relation which is better described for TTO and SG. A theoretically driven model that reduces the difference between a psychometrically strong method (e.g., ranking) and a method with a strong link to utility theory (e.g., TTO) has the potential to revolutionize the field of health state valuation. This model would increase the 'convergent validity' of related psychometric and econometric methods, and therefore, enhance the 'construct validity' of these methods [11]. In the absence of a 'gold standard' in health state valuation, such an increase in convergent validity would advance our understanding regarding the latent construct of quality of life and its assessment. Furthermore, if a model reduces dependence on arbitrary deviations from utility theory, such as negating the use of ad hoc corrections of WTD responses in the QALY paradigm, the model would promote face validity. Lastly, such a model might further improve upon the validity of QALYs by integrating the benefits of psychometric and econometric methods under a single statistical estimator.
In this paper, we introduce an episodic random utility model (RUM) as such a theoretical framework. This model not only allows for the comparisons between rank and TTO predictions within a common estimator, it resolves key econometric and psychometric issues that inhibit TTO-based valuation. In introducing this model, the difficulties with the face validity of WTD responses are addressed in a way that is theoretically coherent for the fields of economics and psychometrics, and improves upon the convergent validity between TTO and rankbased predictions. For purposes of illustration, the conventional and episodic RUMs are estimated using the Measurement and Valuation of Health (MVH) study data from the United Kingdom (UK) [12][13][14].

Episodic and Instant Random Utility Models (RUMs)
The utility of a health state, j, over time, t, for an individual, i, is random and may be represented by either: In the episodic RUM, the error, ε ij , represents variability in the value of an episode. For example, Figure 1 has time on the x-axis and utility on the y-axis, so the error would be distributed vertically along the y-axis. The second model is an instant RUM, which suggests a random slope. Its error, ε ij , represents variability in the value of an instantaneous state, not the episode. The instant RUM is the theoretical basis underlying the mean slope estimator, the conventional approach to health state valuation studies.
The instant RUM would be equivalent to an episodic RUM if we were to assume that the magnitude of error is proportional to the duration of the episode. In other words, more time in state j coincides with more error in the valuation. However, each model assumes that errors have equal variances. This difference is subtle, but highly influential in cases where there are WTD TTO responses. In WTD responses, the respondent's choice of time in optimal health changes the amount of time in state j, thus, changing the amount of error under the instant RUM. For example, if the respondent equates the state to "immediate death" (t = 0), according to the instant RUM model, there is no error in this response.
Both models assume that the utility of dead for any duration is zero (i.e., U dead (t) = 0), and the utility of optimal health for any duration equals the duration (i.e., U optimal (t) = t) [5]. Both models assume constant proportionality: the expected utility of a health state is proportional to its duration, t, and the expected error is zero. State-specific components and errors may depend on the duration (e.g., μ j (t)) [15,16]; however, questions concerning duration effects in health state valuation are outside the scope of this paper and left to be examined in future work.

Interpretations of TTO responses
As part of the TTO task in the MVH study, respondents provide either a BTD or WTD response for each hypothet- Population Health Metrics 2009, 7:3 http://www.pophealthmetrics.com/content/7/1/3 ical health state, j; however, the interpretation of the responses depends on the RUM. If ten years in a health state, j, is better than "immediate death," the respondent determines the duration in optimal health, t 1 , such that: The interpretation of the BTD response, t 1 , is for all intensive purposes equivalent under episodic and instant RUMs, because the amount of time in state j is equal regardless of response (i.e., ten years).
On the other hand, if ten years in the health state, j, is worse than dead (WTD), the respondent determines the duration in optimal health, t 2 , such that: The interpretation of the WTD response, t 2 , differs greatly between the episodic and instant RUM estimates.

RUM Estimators
The purpose of a RUM estimator is to find the estimate for the state-specific component, μ j , that best fits the sample of responses (i.e. minimizes the error). RUMs do not imply a specific error distribution. Errors have expectation zero, are uncorrelated, and have equal variances. Under these three assumptions, the Gauss Markov theorem states that the best (i.e., minimum variance) estimator of the state-specific component, μ j , is a mean slope under the instant RUM approach and a coefficient under the episodic RUM [17].
Both estimators are non-parametric, and they are equivalent if the sample only includes BTD responses, t 1 . The instant RUM estimator is a mean slope (Figure 1), and because slopes can be exceptionally negative (e.g., -39), the mean slope estimator is not robust to small changes in the error term. The episodic RUM estimator is a fraction of weighted sums, creating additional stability.
Beginning in the mid 1990's, the field of economic evaluations faced a similar choice between estimators [18,19]. The emergence of patient-level data led to the question of whether to use the mean ratio (i.e., mean slope) or the mean cost over the mean effectiveness as the estimator of the incremental cost-effectiveness ratio (ICER). Like in our case, if incremental effectiveness approaches zero for any patient, the patient's ICER blows up together with the mean. As such, ratio statistics are not widely used in costeffectiveness research.
A parallel argument in favor of the coefficient estimator comes from psychometrics. The coefficient estimator is motivated by economic theory (i.e., episodic RUM). However, measurement theory also implies the same estimator with a slightly different interpretation: when respondents provide the amount of time in optimal health, they may respond with some error (t + ε) [19]. The coefficient estimator accommodates such response error.
Nevertheless, the mean slope estimator is the conventional approach to health state valuation studies using discrete choice methods (i.e., TTO, SG, PTO, etc.). In an effort to improve the face validity of instant RUM predictions, Dolan replaced the negative slopes with -t 2 /10, while Shaw and colleagues divided the negative slopes by a constant (i.e., 39) [12,20]. Each transformation attenuates the magnifying effects in the slopes by bounding them to be greater than negative one. In the economic evaluation analogy, Dolan's transformation is like changing the incremental effectiveness to the maximum, 10 years, when the patient's ICER is negative. By construction, the Dolan approach will produce estimates greater than the unadjusted mean slope, but less than the coefficient if there are any WTD responses (mathematical proof available upon request). These arbitrary manipulations are not nested within either the instant or episodic RUMs, or within any other utility or psychometric theory [2].

Mixing TTO, Rank, and RUM
While TTO estimation does not require further specification to produce consistent results, it may be more efficient to assume the errors are normally distributed. This assumption allows for maximum likelihood estimation and, more importantly, the merger of rank and TTO responses under a single estimator.
Craig, Busschbach, and Salomon demonstrated that ranks can be decomposed into a series of pair-wise comparisons for rank-based health state valuation using an exploded probit model [5,17]. Because hypothetical states continue for ten years (i.e., t does not vary) in all EQ-5D rank responses within the UK MVH-protocol, their model estimates agree with either the episodic or instant RUMs. Under the assumption of normally distributed errors, the probability of dominance for each pair-wise comparison in the rank responses is represented by: 10 10 10 The exploded probit estimation can predict the state-specific components and variances. While health states clearly have different expected utilities, differences in variances (i.e., σ j ≠ σ k ) have little effect on the predicted values as demonstrated by Craig, Busschbach, and Salomon [5]. Therefore, in this article, we estimated a homoskedastic probit model using rank responses and predicted values for 42 health states on the QALY scale with fixed anchors for comparison. TTO responses were used to predict the OLS episodic RUM.
The rank-based estimator (equation 5) required slight modification (equation 6) to incorporate both the episodic RUM and TTO responses. In rank responses, ties occur when a respondent considers two or more states to be equivalent. TTO responses can also be described as equivalences between two hypothetical scenarios (equations 2 and 3). Therefore, TTO responses can be incorporated into the same exploded probit model using Efron's method for ties in rank responses [21]. Specifically, the probability of a TTO response is: where y equals t 1 if BTD, or -t 2 if WTD, and x equals 10 if BTD, or (10-t 2 ) if WTD. This is equivalent to a simple linear regression with no constant and an assumption of normally distributed errors with state-specific variances. A central advantage of the exploded probit is the estimator can accommodate both TTO and rank responses.
Caution is warranted when merging responses from different valuation techniques into a single estimator. While the estimation of state-specific components, μ j , may benefit greatly from the added information, it remains unclear whether the TTO variance is equal to the variance found in rank responses. Completion of the TTO task entails a greater cognitive burden for respondents, which may result in greater errors. In the combined estimator, a separate variance parameter describing the difference between the method-specific variances is included for rank responses.
In combining TTO and rank responses within a single estimation, we increase the power of valuation studies that explore preferences of respondents using both TTO and rank responses. In most valuation studies done on the basis of the MVH protocol, both TTO and rank were administrated. A problem might be that there are more ranked pairs than TTO responses. To impose balance across methods, we assigned the pair-wise comparisons a reduced weight equal to the respondent's number of hypothesized non-anchor states over the respondent's number of pair-wise comparisons. As a result, each respondent's set of decomposed rank responses received the same weight in the maximum likelihood estimation as their set of TTO responses. The estimator accounts for both sources of information equitably.

United Kingdom Measurement and Valuation of Health (MVH) Study
In 1993, the University of York administered 3395 interviews with a response rate of 64%, and collected values of 42 EQ-5D health states and the state of unconsciousness [12][13][14]. The MVH protocol, developed for the aforementioned study, describes a face-to-face interview that can be separated into several sections. First, the respondents are asked to describe their own health using the EQ-5D descriptive system. Then, the respondents rank 15 cards each describing a health state. This set of 15 health state cards always includes the anchor states, optimal health (11111) and immediate death. The respondents are instructed to assume that the duration of the health state is 10 years and followed by death. After the ranking exercise, the subjects are asked to place each card on the EQ-VAS, often referred to as the EuroQol "thermometer." After the EQ-VAS valuation section, the deck of health state cards is reshuffled, and 13 health states are valued using the TTO method. The two missing states are 11111 and 'immediate death' as these states cannot be valued directly using the standard TTO, because they anchor the TTO scale. The TTO-interview is complemented by a visual aid, specifically a TTO-probe board that graphically displays the difference in life years between health states. As previously described, the TTO task produces either t 1 or t 2 responses, each of which describes a compensating amount of time in the optimal health state.
For the TTO and rank analytical sample (N = 3,333 and 3,355, respectively), respondents were excluded for a particular method (1) if only one or two states were valued (other than 11111, "immediate death," and "unconscious"); (2) if all states were given the same value; or (3) if all states were valued worse than "immediate death." In addition, respondents were excluded from the rank sample if they ranked death equivalent to optimal health. These four criteria motivated the exclusion of 1.8% of the rank respondents and 1.2% of the TTO respondents.  Figure 2 illustrates the relationship between the episodic and instant RUM predictions using TTO responses for the 42 EQ-5D hypothetical health states included in the UK MVH study. As described in equation 4, the instant RUM estimates are mean slopes. These means are presented with and without Dolan's transformation of WTD responses to a bound of -1.00. The unadjusted means are substantially less than the episodic RUM predictions, depending on the quantity of WTD responses. This pattern intuitively illustrates the effect of summing fractions, instead of taking a coefficient (i.e., a fraction of sums).  Table 1, episodic and instant predictions are correlated above 97% for Pearson's or Spearman's rho. However, unadjusted mean slopes poorly agree with coefficient estimates (Lin's rho = 0.135).

Comparison between Instant and Episodic RUM
On the other hand, the adjusted mean slopes moderately agree (Lin's rho = 0.841). The predictions rendered through the arbitrary correction of WTD responses are still substantially different from the episodic RUM predictions (average absolute difference = 0.179). While we recognize that the Dolan transformation improves concordance between the instant and episodic predictions, improved face validity is not a sufficient motivation for data manipulation.  This suggests that rank responses provide similar information to TTO responses based on the episodic RUM compared to the instant RUM with Dolan's transformation of WTD responses. Convergence validity between the two methods is improved more by a theoretical coherent model, than by an ad hoc boundary of -1.00. This, in turn, increases the construct validity of both the ranking and TTO estimates for health state valuation. Comparison of Episodic RUM Estimates using Single and Both Responses for 42 EQ-5D Health States Figure 3 Comparison of Episodic RUM Estimates using Single and Both Responses for 42 EQ-5D Health States. Figure 3 illustrates the relationship between episodic predictions. Compared to the rank-based predictions, coefficient estimates are slightly higher for mild health states and lower for states near death, which suggests the potential for duration dependence in health state valuation [15,16]. Future analysis may parameterize the duration effect and estimate the extent of adaptation. Overall, the pattern between episodic predictions suggests strong concordance.

Episodic RUMs using TTO, Rankings, and Both Responses
Lastly, we compared the 95% confidence intervals between episodic RUM predictions using TTO responses, and intervals using TTO and rank responses (See Table 2). Among the 42 states, TTO confidence intervals of three states have their dual response counterparts nested within and 12 states have intervals that overlap. This discordance suggests some systematic differences between the TTO and rank-based values. In terms of interval width, the average width of the TTO interval is 0.057 with a range of 0.021 to 0.094. The average width of the dual response  interval is 0.041 with a range of 0.027 to 0.047. For eleven mild health states, the TTO based interval is narrower than the dual response intervals; but for the remaining more severe states the dual response interval is narrower. The interval widths of the TTO and rank responses suggest that the use of both responses decreases the standard error in health state value predictions by around two thirds and allows for greater variability in the value of mild states.

Discussion
In this paper, we introduce the episodic RUM and its coefficient estimator, which together provides a framework for health state valuation that is theoretically and econometrically consistent. The findings suggest a re-analysis of current health state valuation data and the potential merger of TTO and rank responses under a unified QALY estimator, specifically the exploded probit. To better understand this conclusion, we delineate the three major contributions of the episodic RUM.
The first contribution is the theoretical realization that under the conventional TTO approach, known as the instant RUM, the error scale in WTD and BTD responses is different by construction. As shown in equation 1, BTD error is divided by ten, and WTD error is divided by a number less than ten. Therefore, the instant RUM inflates the error of WTD responses, causing them to become more influential on the estimator and pulling the estimates down. Dolan's transformation of WTD responses (t 2 /10) inadvertently causes the error scale to be equivalent, but the predictions lose internal consistency [12]. On the contrary, the episodic RUM assigns the same error scale, regardless of response type, and produces consistent results.
The second contribution is in convergent validity [11]. The episodic RUM predictions from the TTO responses strongly agree with predictions from the rank responses. In fact, this strength of agreement is larger than the agreement between rank predictions and instant RUM predictions with the Dolan transformation of WTD responses. The results confirm ranking and TTO to be closely related, suggesting the combination of both methods' strengths: the sound psychometric foundations and feasibility of ranking, and the face validity of TTO as it relates closely to the QALY paradigm. In a previous paper, Craig, Busschbach and Salomon show that rank predictions are essentially equivalent to VAS predictions (Lin's rho = 0.98); therefore, the results of this paper complementarily demonstrate convergent validity in the predictions for rank, VAS and TTO under the episodic RUM [5]. Furthermore, this evidence on the promise of the episodic RUM demonstrates that Dolan's arbitrary correction of negative responses is outmoded.
The third contribution is more practical. Under the assumption of normal errors, the episodic RUM implies an exploded probit estimator that integrates rank and TTO responses. This exploded probit estimator increases the power of valuation studies considerably by combining responses from two forms of discrete choice experiments: TTO and ranking. We demonstrate that the integration of rank and TTO responses is feasible and decreases the standard errors of the state value predictions. By merging a psychometrically strong instrument (i.e., ranks) with discrete choice data based on utility theory (i.e., TTO), predictions are more robust. However, we recognize the appeal of the nonparametric episodic RUM estimator (equation 5).

Conclusion
The episodic RUM may replace the current paradigm in health state valuation, given that the instant RUM changes the error scale by response type; arbitrary corrections of WTD responses produce aberrant results; and the exploded probit allows the integration of TTO, rank, SG, and other discrete choice responses in a theoretically and econometrically consistent manner. In more practical terms, future valuation studies (e.g., EQ-5D five level version) may be statistically powered using a variety of discrete choice responses. The next step might be to reestimate each country-specific valuation set using the episodic RUM and further examine duration effects in components and errors.