The influence of measurement error on calibration, discrimination, and overall estimation of a risk prediction model
© Rosella et al.; licensee BioMed Central Ltd. 2012
Received: 26 January 2012
Accepted: 18 October 2012
Published: 1 November 2012
Self-reported height and weight are commonly collected at the population level; however, they can be subject to measurement error. The impact of this error on predicted risk, discrimination, and calibration of a model that uses body mass index (BMI) to predict risk of diabetes incidence is not known. The objective of this study is to use simulation to quantify and describe the effect of random and systematic error in self-reported height and weight on the performance of a model for predicting diabetes.
Two general categories of error were examined: random (nondirectional) error and systematic (directional) error on an algorithm relating BMI in kg/m2 to probability of developing diabetes. The cohort used to develop the risk algorithm was derived from 23,403 Ontario residents that responded to the 1996/1997 National Population Health Survey linked to a population-based diabetes registry. The data and algorithm were then simulated to allow for estimation of the impact of these errors on predicted risk using the Hosmer-Lemeshow goodness-of-fit χ2 and C-statistic. Simulations were done 500 times with sample sizes of 9,177 for males and 10,618 for females.
Simulation data successfully reproduced discrimination and calibration generated from population data. Increasing levels of random error in height and weight reduced the calibration and discrimination of the model. Random error biased the predicted risk upwards whereas systematic error biased predicted risk in the direction of the bias and reduced calibration; however, it did not affect discrimination.
This study demonstrates that random and systematic errors in self-reported health data have the potential to influence the performance of risk algorithms. Further research that quantifies the amount and direction of error can improve model performance by allowing for adjustments in exposure measurements.
In medicine, prediction tools are used to calculate the probability of developing a disease or state in a given time period. Within the clinical setting, predictive algorithms, such as the Framingham Heart Score 31  are used to calculate the probability that a patient will develop coronary heart disease – have contributed important advances in individual patient treatment and disease prevention . Similarly, applying predictive risk tools to populations can provide insight into the influence of risk factors, the future burden of disease in an entire region or nation, and the value of interventions at the population level. Risk prediction is a key aspect of clinical work and has recently been applied to population health through the Diabetes Population Risk Tool (DPoRT) . The prediction of disease risk using risk algorithms is based on a set of baseline variables that may contain measurement error that could affect the prediction, discrimination, and accuracy of the tool.
Increasingly, prediction tools have incorporated self-reported patient information to facilitate their use [3–5]. These self-reported responses can contain random error due to imperfect recall or misunderstanding of the question. They can also result in systematic error or bias (over- or underreporting), as a result of psychosocial factors such as social desirability. The influence of error contained in self-reported risk factor data on disease prediction has not been systemically studied. In particular, evidence on the influence that measurement error has on predictive accuracy is lacking. By understanding the consequence of measurement error on risk algorithms, efforts could be made to correct for these errors and thus improve the accuracy and validity of risk algorithms. Furthermore, developers of risk tools can use this information to better weigh the pros and cons of using different types of data (i.e., self-reported or measured).
Measurement error has mainly been examined with respect to its effect on risk estimates, such as risk ratios or hazard ratios [6–8]. This research has led to improvements in the critical appraisal and interpretation of epidemiologic findings. While useful for understanding the effects of error on etiological estimates of disease, the findings from these studies do not directly apply to risk algorithms. The objective of this study is to use simulation to understand the effect of measurement error in self-reported risk factors on the performance of a simple risk algorithm to predict diabetes. This study will focus on the measurement of body mass index (BMI), which is defined as an individual’s body mass in kilograms (kg) divided by the square of the individual’s height in meters (m) (kg/m2). This measure is the focus because it has the greatest influence on diabetes risk [9–13].
Two general categories of error were examined in this study: random (nondirectional) error and systematic (directional) error. Data were simulated to allow for estimation of the impact of hypothetical values of random and systematic error on predicted risk and two measures of predictive accuracy: calibration and discrimination. Calibration is achieved in a prediction model if it is able to predict future risk with accuracy such that the predicted probabilities closely agree with observed outcomes. A model that does not have good calibration will result in a significant over- or underestimation of risk. Calibration is not an issue if the purpose of the model is only to rank-order subjects . In this study, calibration was measured using the Hosmer-Lemeshow (H-L) goodness-of-fit statistic (χ2 H-L) where observed and expected values are compared across deciles of risk [15–17]. It is calculated by dividing the cohort into deciles of predicted risk and comparing observed versus predicted risk resulting in a modified version of H-L chi-square statistic (χ2 H-L). Consistent with D’Agostino’s approach for evaluating observed and predicted values using risk algorithms, the value 20 (99th percentile of a chi-square with 8 degrees of freedom) was used as a cutoff to mark sufficient calibration . Discrimination is the ability to differentiate between those who are high risk and those who are low risk – or, in this case, those who will and will not develop diabetes given a fixed set of variables. The receiver operating characteristic (ROC) curve is the accepted way to measure discrimination. An ROC curve repeats all possible pairings of subjects in the sample who exhibit the outcome and do not exhibit the outcome and calculates the proportion of correct predictions, thereby resulting in an index of resolution. This area under the ROC curve is equal to the C-statistic where 1.0 implies perfect discrimination and 0.5 implies no discrimination [14, 19, 20]. A perfect prediction model would perfectly resolve the population into those who develop diabetes and those who do not. Accuracy is unaffected by discrimination, meaning a model can possess good discrimination yet poor calibration .
The simulation was initiated using parameters taken from the same population-level data used to develop DPoRT . These data represent 23,403 Ontario residents that responded to the 1996/1997 National Population Health Survey (NPHS) conducted by Statistics Canada  and were linkable to health administrative databases in Ontario. In the NPHS, households were selected though stratified, multilevel cluster sampling of private residences using provinces and/or local planning regions as the primary sampling unit. The survey was conducted by telephone and all responses were self-reported (83% response rate). Persons under the age of 20 (n = 2, 407) and those who had self-reported diabetes were excluded (n = 894). Those who were pregnant at the time of the survey were also excluded (n = 241) due to the fact that baseline BMI could not be accurately ascertained, leaving a total of 9,177 males and 10,618 females. The diabetes status of all respondents in Ontario was established by linking persons to the Ontario Diabetes Database (ODD), which contains all physician-diagnosed diabetes patients in Ontario identified since 1991. The database was created using hospital discharge abstracts and physician service claims. The ODD has been validated against primary care health records and demonstrated excellent accuracy for determining incidence and prevalence of diabetes in Ontario (sensitivity of 86%, specificity of 97%) [23, 24].
Starting values taken from 1996–1997 National Population Health Survey (NPHS) used in simulation
mean (standard deviation)
(N = 9,177)
(N = 10,618)
Correlation for height and weight (rhw)
rhw = 0.475
rhw = 0.311
10-year DM incidence
Using the generated values for the regression coefficients β0, β1, and β2, the probability of the person having diabetes was calculated for each individual. The coefficients in the algorithms remain constant for each calculation in order to replicate the current practice where the same risk equation is applied to all individuals.
We assumed that the observed variance of height and weight contain some level of error and therefore the observed variance can be separated into the true variance of the measurement (σ2 true) in the population and the variance that can be attributed to measurement error (σ2 error). Random measurement error was defined by the intraclass correlation coefficient (ICC) as an estimate of the fraction of the total measurement variance associated with the true variation among individuals [25, 26]. Systematic error, which we refer to as bias in our study, is defined as the difference in observed height and weight from the true value (without measurement error). In our study, the bias was defined as an overestimation in height (0 to 3.0 cm) and an underestimation of weight (0 to −3.0 kg) varied in increments of 0.5 units. The magnitude of bias in height and weight were taken from a recent systematic review that summarized the empirical evidence regarding the concordance of objective and subjective measures of height and weight , consistent with those found in the Canadian population . True BMI is defined as the height and weight when measurement error is equal to zero.
The simulation of each sample population was run 500 times. For each simulation P i was calculated twice for each individual. The first calculation is done using the observed height and weight values and the second is related using the true BMI value (in the absence of the specified measurement error). H-L statistic and C-statistic were also calculated twice using both true BMI and observed BMI values to allow for comparison. All simulations were done using SAS statistical software (version 9.1, SAS Institute Inc., Cary, NC) and random numbers were generated using the RAN family of functions (RANUNI and RANNORM).
We had three a priori hypotheses prior to running the simulation. First, we hypothesized that random measurement error would affect both discrimination and calibration of a model due to the increase in observed variance in BMI and misclassification. Secondly, we hypothesized that systematic error would have minimal effects on discrimination (the ability to rank order subjects) but significant effects on calibration of a model. Thirdly, we hypothesized that random error would not affect the overall predicted risk value and that systematic error would influence the predicted risk in the direction of the systematic error.
Values of actual risk equation relating BMI to probabilities of developing diabetes using logistic regression values from the National Population Health Survey (NPHS) 10-year follow-up cohort and values achieved from the simulation model
Males – NPHS data (N = 9,177)
Males – Simulation (N = 9,177)
Calibration (χ2 HL)
χ2 HL = 5.67, p-value = 0.6841
χ2 HL =9.951, p-value = 0.3689
C = 0.677
C = 0.686
Females – NPHS data (N = 10,618)
Females – Simulation (N = 10,618)
Calibration (χ2 HL)
χ2 HL = 9.33, p-value = 0.3153
χ2 HL = 10.466, p-value = 0.3356
C = 0.726
C = 0.718
Difference in overall diabetes risk (observed – true) and percent that achieved calibration (H-L χ2 <20) in 500 replications under systematic reporting error (bias) in height and weight for males (N=9,177) and females (10,618)
Difference in diabetes risk (observed - true)
Number of diabetes cases
Percent that achieved calibration*
Difference in diabetes risk (observed - true)
Number of diabetes cases
Percent that achieved calibration
Underestimate of weight by
Overestimate of height by
This study systematically examined the impact of measurement error in the context of a prediction algorithm. This simulation study reveals several interesting aspects of the influence of measurement error (systematic and random) on the performance of a risk algorithm.
As hypothesized, random error reduced calibration and discrimination of the algorithm due to the fact that the observed variance is greater than the true variance in the presence of measurement error. The observed BMI distribution was wider than the true distribution due to this increased variation. This affects both diabetics and nondiabetics due to its random nature, resulting in greater overlap between the BMI distributions. Ultimately, this makes assigning risk according to BMI levels more difficult to achieve. Even though random error in height and weight should, on average, correctly estimate the true BMI in the population (since it does not skew the mean in a particular direction), it can still influence the performance of a prediction model due to decreased precision, which leads to greater dispersion in the BMI distribution.
In this study, systematic error in height and weight biased the predicted risk estimates in the direction of the error. This affects calibration, which is not surprising since the concordance of observed and predicted events would be influenced by the under- or overreporting of the BMI level. In other words, persons that are over- or underreporting their weight will then be over- or underestimated by the risk model and thus result in disagreement with observed estimates. Systematic error did not influence the ability to rank order subjects. Therefore, the ability to discriminate between who will and will not develop diabetes was not affected by systematic error when variance due to random error is held constant. This was reflected by the stability of the C-statistic under varying degrees of systematic error. The way that the systematic error was examined in this study was such that the distribution of BMI was shifted to the left (as a result of underestimating weight or overestimating height, or both) compared to the true distribution. This is an overall effect, and the decreased precision or increased variability as seen with random error is therefore not observed. Even though the distribution is shifted to the left, those with higher BMI still have a higher probability of developing diabetes compared to those with lower BMI despite the fact that the absolute levels of risk will be underestimated in both groups. This is a classic example of how discrimination and calibration are often discordant. Due to the nature of probability, it is possible for a prediction algorithm to exhibit perfect discrimination, i.e., it can perfectly resolve a population into those who will and will not experience the event, and at the same time have deficient accuracy (meaning that the predicted probability of developing diabetes does not agree with the true probability) . This study did not impose systematic error with respect to disease status, but it could be hypothesized that if the systematic error were differential between diabetics and nondiabetics that this could indeed affect discrimination.
The finding that random error resulted in the overall predicted risk estimated to be biased upwards was contradictory to the hypothesis that only systematic error will bias the risk estimate. Random error increases the variability of a measurement and increases the range of predicted risk, which is bounded by 0 in the logistic model. In a situation where the outcome probabilities are very high, the skew would be expected to be in the opposite direction. Not surprisingly, the error in predicted risk resulting from underreporting weight or overreporting height is in the anticipated direction (i.e., if weight is underreported the observed risk will be underestimated). Furthermore, the addition of random error to this type of systematic error slightly reduced the amount of underestimation because the random and systematic errors worked in opposite directions. In another situation, random error could potentially augment the error in predicted risk. Such would be the case if systematic error tended to result in an overestimate of risk.
This study shows that random error, which accounts for 20% of the total observed variance (ICC of 0.8 or higher), is unlikely to affect the performance or validation of a prediction model. Research shows that the random error in height and weight reporting is unlikely to exceed that amount . Interestingly, the effects on the predicted diabetes risk were relatively minor, even in situations of high under- or overreporting of weight and height. This is likely because BMI has such a strong relationship with diabetes such that increased risk is apparent even with significant underestimation. The true distributions of BMI in diabetics and nondiabetics are so distinct that even in the presence of underreporting these populations have dissimilar risk for developing diabetes. Had this misclassification affected a variable that did not have such a strong relationship with the outcome, the effect on predicted risk may have been more severe. Furthermore, in this study systematic error in self-reported height and weight was taken as an overall effect in the population. If self-reporting error were significantly more likely to occur in those who were more likely to develop diabetes, then the impact of this bias could be augmented.
This study focused on the overall trend of self-reporting error seen in several validation studies, that is an underestimation of weight and an overestimation of height ; however, these patterns may also vary across subpopulations, such as gender and socioeconomic status. Generally, women tend to underestimate weight more so than men, and men tend to overestimate height more so than women [31, 32]. Socioeconomic status has been shown to modify these associations such that those of lower socioeconomic status may actually overestimate their weight and/or underestimate their height [33, 34]. These subgroups may also have differential diabetes risk and the extent to which this error influences population risk prediction is a topic of future research. In addition, values from a given individual in the population may exceed the maximum values included in this study; however, the influence of this would be more relevant for individual risk prediction tools versus for population prediction.
There are several limitations to consider in the context of this study. Conclusions drawn from this simulation study will relate only to the scenarios simulated and may not apply to all risk algorithm situations. Simulation programs that reflect the specific study conditions to which a study is applied must be created to make conclusions applicable. Another caution in interpreting the findings of this study is that models examined in this exercise are simpler than complicated multivariate risk algorithms encountered in practice. This simpler model allows us to focus on the height and weight error, which is the greatest potential source of error in DPoRT. It should be noted that one of the assumptions of this study is that the only sources of error are in self-reported height and weight. Other sources of error, including error in diabetes status and selection bias in the survey or in sampling, are assumed to be absent.
This study provides novel information about the influence of measurement error in a risk prediction model. By understanding the consequences of measurement error on prediction and algorithm performance, efforts can be made to correct for these errors and thus improve the accuracy and validity of a risk algorithm. Further, efforts must be made to understand the nature of error in self-reporting measurements. Ongoing work to improve the quality of measurements used in risk algorithms will improve model performance. Researchers developing and validating risk tools must be aware of the presence of measurement error and its impact on the performance of their risk tools.
The study was approved by the Research Ethics Board of Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada.
The authors thank Lennon Li, Karin Hohenadel, and Gillian Lim for reviewing the final manuscript.
The study is funded by the Canadian Institutes of Health Research. The views expressed here are those of the authors and not necessarily those of the funding agency. The funding agency had no role in the data collection or in the writing of this paper. The guarantors accept full responsibility for the conduct of the study, had access to the data, and controlled the decision to publish. This study was supported by the Institute for Clinical Evaluative Sciences (ICES), which is funded by an annual grant from the Ontario Ministry of Health and Long-Term Care (MOHLTC). The opinions, results, and conclusions reported in this paper are those of the authors and are independent from the funding sources. No endorsement by ICES or the Ontario MOHLTC is intended or should be inferred.
- Anderson KM, Wilson PWF, Odell PM, et al.: An updated coronary risk profile - a statement for health-professionals. Circulation 1991, 83: 356-362. 10.1161/01.CIR.83.1.356View ArticlePubMedGoogle Scholar
- Hippisley-Cox J, Coupland C, Vinogradova Y, et al.: Derivation and validation of QRISK, a new cardiovascular disease risk score for the United Kingdom: prospective open cohort study. Br Med J 2007, 335: 136-141. 10.1136/bmj.39261.471806.55View ArticleGoogle Scholar
- Rosella LC, Manuel DG, Burchill C, et al.: A population-based risk algorithm for the development of diabetes: development and validation of the diabetes population risk tool (DPoRT). J Epidemiol Commun Health 2011, 65: 613-620. 10.1136/jech.2009.102244View ArticleGoogle Scholar
- Lindstrom J, Tuomilehto J: The diabetes risk score: a practical tool to predict type 2 diabetes risk. Diabetes Care 2007, 26: 725-731.View ArticleGoogle Scholar
- Mainous AG, Koopman RJ, Diaz VA, et al.: A coronary heart disease risk score based on pateint-reported information. Am J Cardiol 2007,99(9):1236-1241. 10.1016/j.amjcard.2006.12.035View ArticlePubMedPubMed CentralGoogle Scholar
- Flegal KM, Keyl PM, Nieto FJ: The effects of exposure misclassification on estimates of relative risk. Epidemiology 1986, 123: 736-751.Google Scholar
- Fuller WA: Estimation in the presence of measurement error. Int Stat Rev 1995, 63: 121-141. 10.2307/1403606View ArticleGoogle Scholar
- Weinstock MA, Colditz GA, Willet WC: Recall (report) bias and reliability in the retrospective assessment of melanoma risk. Am J Epidemiol 1991, 133: 240-245.PubMedGoogle Scholar
- Colditz G, Willet WC, Rotnitzky A, et al.: Weight gain as a risk factor for clinical diabetes mellitus in women. Ann Intern Med 1995, 122: 481-486.View ArticlePubMedGoogle Scholar
- Colditz G, Willet WC, Stampfer MJ, et al.: Weight as a risk factor for clinical diabetes in women. Am J Epidemiol 1990, 132: 501-513.PubMedGoogle Scholar
- Perry IJ, Wannamethee SG, Walker MJ, et al.: Prospective study of risk factors for development of non-insulin dependent diabetes in middle aged British men. Br Med J 1995, 310: 555-559. 10.1136/bmj.310.6979.555View ArticleGoogle Scholar
- Vanderpump MPJ, Tundbridge WM, French JM, et al.: The incidence of diabetes mellitus in an english community: A 20-year follow-up of the Wickham Survey. Diabet Med 1996, 13: 741-747. 10.1002/(SICI)1096-9136(199608)13:8<741::AID-DIA173>3.0.CO;2-4View ArticlePubMedGoogle Scholar
- Wilson P, Meigs JB, Sullivan LM, et al.: Prediction of incident diabetes mellitus in middle-aged adults. Arch Intern Med 2007, 167: 1068-1074. 10.1001/archinte.167.10.1068View ArticlePubMedGoogle Scholar
- Harrell FE: Regression modeling strategies with applications to linear models, logistic regression, and survival analysis. New York: Springer; 2001.Google Scholar
- Hosmer DW, Hosmer T, Cessie LE, et al.: A comparison of goodness-of-ft tests for the logistic regression model. Stat Med 1997, 16: 965-980. 10.1002/(SICI)1097-0258(19970515)16:9<965::AID-SIM509>3.0.CO;2-OView ArticlePubMedGoogle Scholar
- Hosmer DW, Lemenshow S: Applied logistic regression. New York: Wiley; 1989.Google Scholar
- Hosmer DW, Lid Hjort N: Goodness-of-fit processes for logistic regression: simulation results. Stat Med 2002, 21: 2723-2738. 10.1002/sim.1200View ArticlePubMedGoogle Scholar
- D’Agostino RB, Grundy S, Sullivan LM, et al.: Validation of the framingham coronary disease prediction scores. JAMA 2001, 286: 180-187. 10.1001/jama.286.2.180View ArticlePubMedGoogle Scholar
- Harrell FE, Lee KL, Mark DB: Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 1996, 15: 361-387. 10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4View ArticlePubMedGoogle Scholar
- Pencina M, D’Agostino RB: Overall C as a measure of discrimination in survival analysis: model specific population value and confidence interval estimation. Stat Med 2004, 23: 2109-2123. 10.1002/sim.1802View ArticlePubMedGoogle Scholar
- Campbell G: General Methodology I: Advances in statistic methodology for the evaluation of diagnostic and laboratory tests. Stat Med 2004, 13: 499-508.View ArticleGoogle Scholar
- Statistics Canada: 1996–97 NPHS public Use microdata documentation. Ottawa; 1999.Google Scholar
- Hux JE, Ivis F: Diabetes in ontario. Diabetes Care 2005, 25: 512-516.View ArticleGoogle Scholar
- Lipscombe LL, Hux JE: Trends in diabetes prevalence, incidence, and mortality in Ontario, Canada 1995–2005: a population-based study. Lancet 2007, 369: 750-756. 10.1016/S0140-6736(07)60361-4View ArticlePubMedGoogle Scholar
- Deyo RA, Diehr P, Patrick DL: Reproducibility and responsiveness of health status measures: Statistics and strategies for evaluation. Control Clin Trials 2008, 12: 142S-158S.View ArticleGoogle Scholar
- Fleiss J: Statistical methods for rates and proportions. New Jersey: John Wiley & Sons; 2003.View ArticleGoogle Scholar
- Gorber SC, Sheilds M, Tremblay M, et al.: The feasibility of establishing correction factors to adjust self-reported estimates of obesity. Health Rep 2009., 19: Google Scholar
- Shields M, Gorber SC, Tremblay MS: Estimates of obesity based on self-report versus direct measures. Health Rep 2008, 19: 1-16.Google Scholar
- Gorber SC, Tremblay M, Moher D, et al.: A comparison of direct vs. self-report measures for assesing height, weight, and body mass index: a systematic review. Obes Rev 2007, 8: 307-326. 10.1111/j.1467-789X.2007.00347.xView ArticlePubMedGoogle Scholar
- Diamond GA: What price perfection? Calibration and discrimination of clinical prediction models. J Clin Epidemiol 1992, 45: 85-89. 10.1016/0895-4356(92)90192-PView ArticlePubMedGoogle Scholar
- Nawaz H, Chan W, Abdulraham M, et al.: Self-reported weight and height: implications for obesity research. J Prevent Med 2001, 20: 294-298. 10.1016/S0749-3797(01)00293-8View ArticleGoogle Scholar
- Niedhammer I, Bugel I, Bonenfant S, et al.: Validity of self-reported weight and height in the French GAZEL cohort. Int J Obes 2000, 24: 1111-1118. 10.1038/sj.ijo.0801375View ArticleGoogle Scholar
- Bostrom G, Diderichsen F: Socioeconomic differentials in misclassification of height, weight and body mass index based on questionnaire data. Int J Epidemiol 1997, 26: 860-866. 10.1093/ije/26.4.860View ArticlePubMedGoogle Scholar
- Wardle K, Johnson F: Sex differences in the association of socioeconomic status with obesity. Int J Obes Relat Metab Disord 2002, 26: 1144-1149. 10.1038/sj.ijo.0802046View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.