Determining the best population-level alcohol consumption model and its impact on estimates of alcohol-attributable harms

Background The goals of our study are to determine the most appropriate model for alcohol consumption as an exposure for burden of disease, to analyze the effect of the chosen alcohol consumption distribution on the estimation of the alcohol Population- Attributable Fractions (PAFs), and to characterize the chosen alcohol consumption distribution by exploring if there is a global relationship within the distribution. Methods To identify the best model, the Log-Normal, Gamma, and Weibull prevalence distributions were examined using data from 41 surveys from Gender, Alcohol and Culture: An International Study (GENACIS) and from the European Comparative Alcohol Study. To assess the effect of these distributions on the estimated alcohol PAFs, we calculated the alcohol PAF for diabetes, breast cancer, and pancreatitis using the three above-named distributions and using the more traditional approach based on categories. The relationship between the mean and the standard deviation from the Gamma distribution was estimated using data from 851 datasets for 66 countries from GENACIS and from the STEPwise approach to Surveillance from the World Health Organization. Results The Log-Normal distribution provided a poor fit for the survey data, with Gamma and Weibull distributions providing better fits. Additionally, our analyses showed that there were no marked differences for the alcohol PAF estimates based on the Gamma or Weibull distributions compared to PAFs based on categorical alcohol consumption estimates. The standard deviation of the alcohol distribution was highly dependent on the mean, with a unit increase in alcohol consumption associated with a unit increase in the mean of 1.258 (95% CI: 1.223 to 1.293) (R2 = 0.9207) for women and 1.171 (95% CI: 1.144 to 1.197) (R2 = 0. 9474) for men. Conclusions Although the Gamma distribution and the Weibull distribution provided similar results, the Gamma distribution is recommended to model alcohol consumption from population surveys due to its fit, flexibility, and the ease with which it can be modified. The results showed that a large degree of variance of the standard deviation of the alcohol consumption Gamma distribution was explained by the mean alcohol consumption, allowing for alcohol consumption to be modeled through a Gamma distribution using only average consumption.


Introduction
Alcohol consumption is a component cause [1] for over 200 International Classification of Diseases (ICD-10) three-digit codes [2,3]. In other words, a fraction, usually called the Population-Attributable Fraction (PAF) of the incidence of these diseases, would disappear if exposure to one of the causal components was eliminated [4][5][6][7] (in the case of alcohol, under the counterfactual scenario of every person being a lifetime abstainer). The proportion of the diseases caused by alcohol consumption in a component cause model for a population is determined by both the patterns and volume of alcohol consumption and by the relative risks associated with each exposure level [3,8]. For most major diseases where alcohol plays a role (for example, alcohol-attributable cancers, pancreatitis, and cirrhosis of the liver), the average volume of alcohol consumption alone was found to be an adequate predictor of the risk [3,[8][9][10]; however, some diseases and injuries (for example, ischemic heart disease, unintentional injuries, and intentional injuries) were found to be also dependent on drinking patterns [11][12][13][14].
The calculation of an alcohol PAF involves a threestage process: 1) estimation of an exposure distribution of alcohol, 2) establishment of the relative risk function, and 3) the solving of the equation for the PAF [15]. Since the distribution of alcohol consumption on an international level has not been agreed upon, the common approach is to estimate the PAF using categorical measurements rather than modeling it in a more mathematically appropriate continuous manner [16,17]. The mathematical expression is as follows:(Formula 1) where i is the exposure category with baseline exposure or no exposure, i = 0, RR i is the relative risk at exposure level i compared to no consumption, and P i is the prevalence of the j th category of exposure.
When a continuous distribution for the volume of alcohol consumption is used, this calculation can be represented by the following formula:(Formula 2) PAF(x) = P a RR a + P ex RR ex + 150 0 P(x)RR(x)dx -1 P a RR a + P ex RR ex + 150 0

P(x)RR(x)dx
where P a is the prevalence of lifetime abstainers, RR a is the relative risk of lifetime abstainers, P ex is the prevalence of former drinkers, RR ex is the relative risk of former drinkers, x is the average volume of alcohol consumption per day, P(x) is the prevalence of alcohol consumption, and RR(x) is the relative risk of drinkers [15]. Although this is the most accurate way to calculate a PAF, it requires that the distribution of alcohol consumption be known. Previous attempts at modeling alcohol consumption using a Log-Normal distribution have been criticized for various reasons [18,19]; however, the Log-Normal distribution has provided adequate approximations for most applications [20,21]. Recently, more adaptable distributions such as the Gamma distribution have been favored over the Log-Normal distribution [15,22], and it has been suggested that a mixing of distributions is needed to separately model the frequency of drinking and the quantity of alcohol consumed [23].
There are two main instruments to monitor alcohol exposure currently used by countries and international organizations: 1) general population surveys and 2) estimates of per capita consumption, where per capita consumption is an aggregate measure of recorded, unrecorded, and tourist per capita consumption of alcohol (derived from sales, production, and other economic statistics) [9,24,25]. These instruments, however, have limitations [26].
There are no available surveys for many countries, and in some cases where they do exist they do not allow for the accurate estimation of the volume of consumption, as these surveys only ask about the absence or presence of drinking [27]. Existing surveys often considerably underestimate real consumption levels [28][29][30] by typically covering only 30% to 60% of alcohol sales [26]. As a result, per capita consumption figures are considered to be a best estimate of overall volume of consumption in a country [31]; however, per capita consumption does not provide any disaggregated statistic and, thus, does not provide age-and gender-specific consumption estimates. Since in some instances the risk relationship between alcohol consumption and disease-specific mortality is dependent on gender as well as on age, alcohol exposure by gender and age is required to estimate the PAF and to calculate the alcohol-attributable burden of disease in a population [3].
The problems noted above with respect to surveys lead to an underestimated burden of disease attributable to alcohol consumption when PAFs are calculated from population data without adjustment. As a consequence, methods have been developed to triangulate both average alcohol consumption derived from population surveys and from per capita consumption information [15,26]. However, current PAF calculation methods are based on categorical estimates of consumption with alcohol consumption being corrected by multiplying the two top alcohol consumption categories by the inverse of the estimated undercoverage (per capita consumption/the estimated per capita consumption from the survey) [17]. For most categories of disease where there is an association with volume of alcohol consumption, the dose-response relationship is nonlinear and, thus, distribution estimates of alcohol consumption by age and gender are required for accurate estimates of alcohol PAFs [3].
Given the recent recognition of the need to strengthen and disseminate information about alcohol as outlined in the World Health Organization's strategy to reduce harmful consumption of alcohol [32], there is a need to find an appropriate model for exposure, prevalence, and distribution of alcohol consumption that can easily be modeled to make the fit more compatible with per capita consumption data and that also has properties that make it possible to estimate the exposure distribution for countries that lack survey data except for estimates of prevalence of abstention. Thus, the first aim of this study is to assess internationally if alcohol consistently follows one of the three well-known right-skewed distributions, Log-Normal, Gamma, or Weibull, and to determine if the chosen exposure distribution has a significant effect on the estimation of a PAF, using the PAFs for pancreatitis, diabetes, and breast cancer as examples. The second aim of this study is to investigate if a global relationship between parameters exists so that a distribution of alcohol consumption can be estimated based on mean alcohol consumption.
GENACIS surveys alcohol consumption was measured by a global quantity-frequency measure. In the STEPS surveys alcohol consumption was measured in standard drinks consumed in the seven days preceding the survey.

Methods for fitting the distributions
As alcohol consumption distributions have been shown to have a unimodal shape, [19,37,38] we evaluated the fit of the Log-Normal, Gamma, and Weibull distributions (unimodal distributions commonly used to fit right-skewed empirical data) to determine the most appropriate distribution to model alcohol consumption from national survey data. The Log-Normal, Gamma, and Weibull probability densities are similar in shape, but have significantly different tail behaviors. In the past, alcohol consumption has been more commonly modeled by the Log-Normal distribution as it is used to model continuous random quantities that are rightskewed and is based on the normal distribution, making it easy to fit, test, and modify [20,21]. Although alcohol consumption is frequently modeled using the Log-Normal distribution, empirical distributions often deviate considerably from the Log-Normal model. In comparison, the Gamma and Weibull distributions have a scale parameter and a shape parameter, making them more adaptable since the scale parameter can stretch or compress the distribution.
The Log-Normal distribution is a function of the mean (μ) and standard deviation (σ) parameters, and describes a random variable x where log (x) is normally distributed. The probability density function of the Log-Normal distribution can be expressed as follows: where x > 0 and -∞ < μ < ∞, σ > 0 The Gamma distribution is characterized by a shape () and a scale parameter (θ), has a mean of θ and a standard deviation of √ κθ 2 . The probability density function of the Gamma distribution can be expressed as follows: The probability density function of the Weibull distribution is expressed as follows: Maximum likelihood estimation was used to fit all three distribution models to the drinking population data obtained from GENACIS and ECAS. All missing values were excluded from the fitted models. The Newton-Raphson algorithm was used to optimize the likelihood equations solving for the maximum likelihood estimates of the unknown parameters [39]. Data values of alcohol consumption over 300 g/day were truncated to 300 g/day. Numerical integration utilizing the trapezoidal rule was used to characterize each distribution.

Method for deriving the alcohol PAF
We performed a sensitivity analysis where the alcohol PAFs for pancreatitis, diabetes, and breast cancer were calculated using a continuous model (Log-Normal, Gamma, and Weibull) and using a categorical model in order to see if the chosen exposure distribution had an effect on the estimation of the alcohol PAF. All PAFs were calculated with zero alcohol consumption as the counterfactual scenario, similarly to the Comparative Risk Analysis for alcohol. This counterfactual scenario under certain circumstances of a light drinking average alcohol consumption without heavy drinking occasions may not reflect the theoretical minimum risk depending on the distribution of diseases and cause of death in a society. However, for this paper these considerations are not relevant. The relative risks of lifetime abstainers and former drinkers for pancreatitis, diabetes, and breast cancer were obtained from the meta-analysis [40][41][42].
In order to illustrate that the alcohol PAF estimates based on the Gamma distribution model deviated only slightly from the PAF derived from the categorical model, we calculated the difference between the PAFs calculated for both models.

Methods for characterizing the gamma distributions
The Gamma distribution can be characterized by a shape () and a scale parameter (θ), where the mean and the standard deviation of the Gamma distribution can be obtained directly from the parameter estimates as follows: Since the mean of the Gamma distribution is equal to the mean of the empirical distribution, the mean of the Gamma distribution does not need to be estimated from the shape and scale parameters.
A maximum likelihood algorithm (see description above) was used to obtain the shape and scale parameters using the maximum likelihood function for the shape and scale parameters of the Gamma distribution:

Regression analysis
The maximum likelihood method was used to fit a Gamma model in order to summarize the alcohol consumption of 66 countries by gender and age (in total 851 datasets [422 for women; 429 for men]). After the data was fit by a Gamma model, the relationship between the Gamma mean and the Gamma standard deviation was examined using various general linear models. The performance of the general linear models was then assessed by how well the assumption of homoscedasticity was upheld and based on the distribution of the residuals. All data analyses were performed in R version 2.13.0 [43].

Modeling alcohol consumption as a distribution
The three distributions, Log-Normal, Gamma, and Weibull, were fit to 41 datasets; parameter estimates are outlined in Table 1 for women and in Table 2 for men. The mean and standard deviation estimates from the empirical data and the estimates from each fitted model are summarized in Table 3 for women and in Table 4 for men. When comparing the empirical mean to each distribution's mean, we observed that the mean estimates from the Weibull distribution were much closer to the empirical mean than were the Log-Normal distribution mean estimates, while the mean estimates from the Gamma distribution were equal to the empirical mean. When comparing the standard deviation estimates, the estimates from the Log-Normal distribution deviated furthest from the empirical data, while there was no statistically significant difference between the empirical standard deviation estimate and the standard deviation estimates from either of the Weibull or the Gamma distributions.
Three countries with diverse economic conditions and drinking patterns, namely Germany, Sri Lanka, and Uganda, were selected to display their density curves (Log-Normal, Gamma, and Weibull) superimposed on the population-based data histograms; see Figures 1, 2, 3, 4, 5, and 6 for both women and men. We observed a common trend among men in Figures 2, 4, and 6: the Log-Normal distribution tended to underestimate the number of men who drank 25 g/day to 50 g/day, whereas the Gamma and Weibull distributions accurately estimated alcohol consumption for these populations. A similar trend was observed with respect to women from Germany and Uganda who drank between 10 g/day to 30 g/day and for Sri Lankan women who drank between 0.5 g/day to 2.0 g/day. Alcohol PAF estimates modeled using the Log-Normal, Gamma, and Weibull distributions, together with the proportion estimates for lifetime abstainers and former drinkers, are listed in Table 5 for breast cancer (women), Tables 6 and 7 for diabetes (women and men, respectively), and Tables 8 and 9 for pancreatitis (women and men, respectively).
The alcohol PAF estimates that incorporated the Gamma and Weibull distributions are very similar and, for the most part, are within 1% of one another. Since the Log-Normal distribution is known to have a heavy tail, and this study includes data values for alcohol consumption up to 300 g/day, the alcohol PAF estimates  from the Log-Normal distribution tend to be much larger and unrealistic when compared to the estimates from the Gamma and Weibull distributions.
Overall, the PAF estimates from the categorical model, Gamma model, and Weibull model are relatively similar when the survey data are more compact, but for those countries where data are more spread out, PAF estimates are more susceptible to sampling bias for diseases with a relatively linear or exponential risk relationship with alcohol, such as pancreatitis and breast cancer. For example, for Brazilian men the alcohol consumption prevalence data tend to be very spread out when compared to men from France, leading to a small difference in the PAFs for pancreatitis. However, this trend does not apply when we look at a disease, such as diabetes, that has a J-shaped relative risk function. If we look at the same example, we find that the alcohol PAFs for diabetes provide similar estimates from the categorical model, Gamma model, Log-Normal model, and Weibull model for men from both Brazil and France. This is due to the fact that the relative risk functions are exponential    for pancreatitis and are J-shaped for diabetes and thus have different properties. The J-shaped curve in some cases leads to a negative PAF (which represents the fraction of deaths prevented) as the risk of diabetes at the population level is less under current levels of alcohol consumption than under the counterfactual scenario of no alcohol consumption.

Characterizing the alcohol consumption gamma distribution
Based on data from GENACIS and STEPS, the mean daily average per capita alcohol consumption among drinkers was estimated to be 7.549 grams for women (the Gamma standard deviation was 9.862) and 18.292 grams for men (the Gamma standard deviation was 22.015) (see Table 10). After analyzing the association between the Gamma mean and the Gamma standard deviation, a strong linear relationship was established. Analysis of the residuals of various general linear models led to the conclusion that a general linear model with a normal distribution and an identity link (i.e., a linear regression model) is the best possible model to characterize the relationship between the standard deviation of the Gamma distribution and the mean of the Gamma distribution. As a statistical interaction was determined to be present by gender for the relationship between the Gamma mean and the Gamma standard deviation, this linear relationship was modeled separately for men and for women. Figures 7 and 8 illustrate the linear fit for women and men, respectively. The linear regressions indicate that a unit increase in mean alcohol consumption is associated with an increase of 1.258 (95% CI: 1.223 to 1.293) in the standard deviation of the Gamma alcohol consumption distribution for women and 1.171 (95% CI: 1.144 to 1.197) in the standard deviation of the Gamma alcohol consumption distribution for men. Additionally, for women the linear regression indicated that 92.07% of the variation of the standard deviation of the Gamma distribution was explained by the mean, while for men 94.74% of the variation of the standard deviation of the Gamma distribution was explained by the mean.
Regression diagnostics indicated that there were some outliers. For women, two data points from Nigeria and one from Uganda were identified as influential observations, while for men, two observations in Germany and one in Nigeria were identified as influential observations. There was no indication of a lack of homoscedasticity for any of the regression models (Additional file 1).

Discussion
Both the Gamma and the Weibull distributions summarized the population distribution of average volume of alcohol consumption more accurately than did the Log-Normal distribution. Moreover, for the Gamma and Weibull distributions the ratio of mean to standard deviation was comparable across all countries, irrespective of drinking patterns and the survey measure used to measure alcohol consumption. Overall, both the Gamma and Weibull distributions yield similar PAFs and could  be used in descriptive alcohol epidemiology. Although not examined specifically, these outcomes would also apply to PAFs that are calculated when using a counterfactual scenario where alcohol consumption is decreased due to a policy or intervention such as taxation. Since the Weibull distribution is a more complicated distribution and less flexible than the Gamma distribution, and since it is possible to shift the Gamma distribution upwards (necessary in modeling the burden of disease attributable to alcohol consumption), the Gamma distribution is the best distribution for modeling alcohol consumption.
Modeling survey alcohol consumption data alone without correcting the distribution for undercoverage will lead to inaccurate alcohol PAFs as self-reported survey data typically underestimate alcohol consumption based on sales or taxation (e.g., [26]). In other words, alcohol surveys often do not accurately represent the population due to undercoverage where some members of the population are inadequately represented (or excluded) or due to response bias [30]. Accordingly, a method must be developed that will shift the exposure distribution so that it is consistent with per capita consumption data in order to correct for survey bias and allow for a more accurate estimation of the true alcohol consumption distribution and for an accurate comparison of the alcohol-attributable burden of disease across countries.
Given the relationship between the mean and the standard deviation of alcohol consumption [15], modeling alcohol consumption using the Gamma distribution, up-  estimating this distribution using the relationship between the mean and the standard deviation, and using per capita consumption data, allows us to correct for the biases that lead to undercoverage (for specifics on the upshifting methods see [15]) and allows for the estimation of the distribution of alcohol consumption in a country as if it were measured by a survey with a much higher coverage rate. Additionally, based on the relationship between the mean and the standard deviation of the alcohol consumption Gamma distribution, we can use the mean alcohol consumption from sales and taxation data to obtain the and θ parameters for the alcohol exposure distribution for those countries where no survey data exist. Due to great variations in the populations surveyed, and in the sampling frame, response rate, and coverage rate for each of the individual surveys within the main survey groups of GENACIS, ECAS, and STEPS, our observations that alcohol consumption can best be modeled through a Gamma distribution and that the mean is highly correlated with the standard deviation of the alcohol consumption Gamma distribution indicate that these results are applicable to a wide range of countries and are valid for population surveys that use different methodologies. An interesting finding from our study was the identification as outliers of some of the observations from Nigeria. This could be due to multiple factors. The number of observations from Nigeria upon which the mean and the standard deviation of the alcohol consumption Gamma distribution are based are fewer than the number of observations from other countries. A further factor is that the relationship between the mean and standard deviation of the alcohol consumption Gamma distribution for Nigeria may be different when compared to other countries. Given that only some age groups in Nigeria were identified by the regression diagnostics as outliers, it is very likely that these outliers were due to the low number of individuals surveyed in Nigeria. Future research will focus on modeling alcohol consumption by global region (such as by using the 2005 Comparative Risk Assessment regions [44]) to see if there are regional differences in the relationship between the mean and the standard deviation of the alcohol consumption Gamma distribution.