- Research
- Open Access
- Open Peer Review
A novel method for estimating distributions of body mass index
- Marie Ng^{1, 2}Email author,
- Patrick Liu^{1},
- Blake Thomson^{1, 3} and
- Christopher J. L. Murray^{1}
https://doi.org/10.1186/s12963-016-0076-2
© Ng et al. 2016
- Received: 1 April 2015
- Accepted: 7 March 2016
- Published: 12 March 2016
Abstract
Background
Understanding trends in the distribution of body mass index (BMI) is a critical aspect of monitoring the global overweight and obesity epidemic. Conventional population health metrics often only focus on estimating and reporting the mean BMI and the prevalence of overweight and obesity, which do not fully characterize the distribution of BMI. In this study, we propose a novel method which allows for the estimation of the entire distribution.
Methods
The proposed method utilizes the optimization algorithm, L-BFGS-B, to derive the distribution of BMI from three commonly available population health statistics: mean BMI, prevalence of overweight, and prevalence of obesity. We conducted a series of simulations to examine the properties, accuracy, and robustness of the method. We then illustrated the practical application of the method by applying it to the 2011–2012 US National Health and Nutrition Examination Survey (NHANES).
Results
Our method performed satisfactorily across various simulation scenarios yielding empirical (estimated) distributions which aligned closely with the true distributions. Application of the method to the NHANES data also showed a high level of consistency between the empirical and true distributions. In situations where there were considerable outliers, the method was less satisfactory at capturing the extreme values. Nevertheless, it remained accurate at estimating the central tendency and quintiles.
Conclusion
The proposed method offers a tool that can efficiently estimate the entire distribution of BMI. The ability to track the distributions of BMI will improve our capacity to capture changes in the severity of overweight and obesity and enable us to better monitor the epidemic.
Keywords
- BMI distribution
- Overweight
- Obesity
- L-BFGS-B optimization
- Beta distribution
Introduction
Overweight and obesity are growing health problems worldwide. In 2013, nearly one third of the world’s population was either overweight or obese [1]. Concern regarding the rising disease burden associated with obesity has become nearly universal, and widespread calls have been made for more consistent and accurate monitoring in all populations [2].
Conventional strategies for monitoring population-level overweight and obesity rely on obtaining point estimates, including mean body mass index (BMI) or prevalence of overweight (BMI ≥ 25) and obesity (BMI ≥ 30) [3, 4]. Mean and prevalence are succinct metrics which provide useful insight into distinct aspects of a population’s distribution of BMI. In addition, these measures are easily interpreted by the general public. However, to rigorously monitor the rapidly evolving obesity epidemic, simply observing measures of mean and prevalence is not adequate. Specifically, as the proportion of overweight and obesity increases, the distribution of BMI will become skewed. This, in turn, affects the ability of mean to accurately reflect the central tendency of the distribution [5–7]. If the goal is simply to obtain a more accurate estimate of central tendency, it may be sufficient to replace mean by median. However, as the epidemic intensifies, there is a growing interest in understanding the shift in the BMI distribution and in tracking changes across subclasses of obesity which include class I (BMI: 30–34.9), class II (BMI: 35–39.9), and class III obesity (BMI ≥ 40) [8]. Furthermore, understanding population distribution of BMI is critical to estimating the associated disease burden. To calculate the population attributable fraction of diseases related to high BMI, for instance, one would need to have an accurate measure of exposure represented by population BMI distribution [9]. Therefore, there is a practical need to look beyond measures of mean and prevalence and to monitor the distribution of BMI as a whole.
Monitoring the population distribution of BMI is a challenging task. Existing national surveillance systems do not always include a sample size sufficient for precise approximations of BMI distributions by subpopulation, such as by sex and age [10]. A direct solution would be to increase sample sizes of a survey. However, given the need for regular and timely monitoring, increasing sample sizes will be costly and may not be sustainable in the long run. It would, therefore, be highly desirable to develop a strategy that can effectively use available point estimates from surveys to infer the underlying distribution.
In this study, we propose a novel method that utilizes an optimization algorithm to approximate the distribution of BMI using the three commonly available population-level metrics: mean BMI, prevalence of overweight, and prevalence of obesity. The paper is organized as follows: We first provide a brief description of the proposed method. We then describe the simulation experiment used to validate the method and present the results. To illustrate the utility of the method, we apply it to the 2011–2012 US National Health and Nutrition Examination Survey (NHANES) and compare our estimate with the true distribution of BMI. We conclude by discussing the potential extension, limitations, and implications of the method.
Methods
Rationale
The characteristics of a continuous distribution are defined by its probability density function (pdf). Depending on the distribution, the parameters involved in the pdf vary. For instance, a normal distribution has a pdf defined by a measure of central tendency (μ) and a measure of dispersion (σ^{2}) parameters. On the other hand, a beta distribution has a pdf defined by two shape parameters, namely α and β. Although estimates of these parameters are not always immediately available, they can be easily derived from any two pieces of known statistical information.
In the case of BMI, three statistics which are commonly available from existing surveys are mean BMI, prevalence of overweight, and prevalence of obesity. They respectively provide information on central tendency and specific quintiles. Based on this information and assumptions about the potential family of distributions, parameters can be obtained analytically. For example, if a normal distribution is assumed, μ can be immediately inferred from the sample mean. σ^{2} (assuming that sample variance information is not immediately available) can be calculated based on the mean and quintiles using inverse z scores. Suppose prevalence of obesity is 0.025; if BMI is normally distributed, the corresponding z-value would be 1.96. Using the standard z-score calculation formula, \( z=\frac{X-\mu }{\sigma } \), with z = 1.96, X = 30, μ being the sample mean, σ^{2} can be calculated. Once μ and σ^{2} are defined, the shape of the distribution is fully realized.
Skewness and kurtosis of BMI distributions from six countries
ISO3 | Survey | Year | Sex | Skewness | Kurtosis |
---|---|---|---|---|---|
UGA | DHS | 2011 | Male | 0.82 | 4.87 |
UGA | DHS | 2011 | Female | 1.37 | 8.96 |
IND | DHS | 2005 | Male | 1.26 | 7.17 |
IND | DHS | 2005 | Female | 1.43 | 7.29 |
SAU | Saudi Arabia HIS | 2013 | Male | 0.77 | 4.22 |
SAU | Saudi Arabia HIS | 2013 | Female | 0.68 | 3.94 |
DOM | DHS | 2013 | Male | 1.10 | 5.72 |
DOM | DHS | 2013 | Female | 0.90 | 4.37 |
GBR | Health Survey for England | 2011 | Male | 0.94 | 5.56 |
GBR | Health Survey for England | 2011 | Female | 1.02 | 4.47 |
USA | NHANES | 2011 | Male | 1.13 | 5.59 |
USA | NHANES | 2011 | Female | 1.27 | 5.84 |
Estimation of BMI distribution
To estimate the distribution of BMI, we assume that:
BMI = C_{1}u + C_{2} C_{1} > 0, C_{2} ≥ 10
where C_{1} is a positive scaling constant and C_{2} is a shifting constant. Note that a constraint of greater than or equal to 10 was imposed on C_{2}. Because the lower limit of a population BMI distribution rarely falls below 10, imposing this constraint enhances the accuracy of the optimization results. u is a random variable following the beta distribution with values ranging from zero to one.
u ∼ Beta (α, β), α > 1, β > 1
where α and β are the shape parameters. When α > 1, β > 1, and α = β, the beta distribution is unimodal and symmetric. When α > 1, β > 1, and α < β , the distribution is unimodal and positively skewed. In contrast with other distributions such as log normal and gamma, a beta distribution is relatively light-tailed and provides more stable estimation at the tails of the distribution.
where D refers to the Euclidean distance, which is the shortest distance between two points, in this case two vectors s and t. s is a vector consisting of the observed mean BMI (\( {\overline{X}}_{bmi} \)) and prevalence of overweight and obesity (\( {\overline{p}}_{bmi\ge 25},{\overline{p}}_{bmi\ge 30} \)); t is a vector consisting of the predicted mean BMI (\( {\tilde{X}}_{bmi} \)) and prevalence (\( {\tilde{p}}_{bmi\ge 25},{\tilde{p}}_{bmi\ge 30} \)) for a given set of α, β, C_{1}, and C_{2}. The large-scale bound-constrained optimization algorithm (L-BFGS-B) was used for the optimization [11]. Other optimization algorithms, including conjugate gradient, Nelder-Mead, and Broyden-Fletcher-Foldfarb-Shannon algorithms were considered. However, L-BFGS-B was chosen as it provided a much more efficient optimization process and more stable results.
Simulation
Descriptive statistics of the distributions considered for simulation
Normal | Log normal^{a} | Gamma^{a} | |
---|---|---|---|
Parameters | μ = 24 (mean) σ = 4 (standard deviation) | Log(μ) = 0 (log mean) Log(σ) = 0.1 (log standard deviation) | κ = 1 (shape) θ = 2 (scale) |
Mean | 24 | ||
Standard deviation | 4 | ||
Skewness | 0.007 | 0.31 | 1.95 |
Kurtosis | 2.95 | 3.12 | 8.47 |
A random sample of 500 observations were drawn from each of the three distributions. The sample mean and prevalence of overweight and obesity were calculated. Based on these three statistics, we applied the proposed method to approximate the distribution of BMI. The process was repeated 1000 times for each of the distributions.
In addition to calculating bias and mean squared errors, the Kolmogrov-Smirnov test was performed to examine how well the predicted distributions matched the actual distributions of the sample. We computed the proportion of the time in which the test falsely rejected the null hypothesis that the empirical and true distribution are equal with α = 0.05.
Applied example
Further validation was performed using data from the 2011–2012 NHANES. Specifically, based on mean and prevalence information by age and sex, we estimated the distributions of BMI for males and females for each 10-year age group from 20 to 70+ years old. The empirical distributions were compared against the distribution of actual data using the Kolmogrov-Smirnov test.
All analyses were conducted in R 3.0.1.
Results and discussion
Simulation
Biases and mean squared errors (in parentheses) in estimated parameters, and Kolmogrov-Smirnov test results
Normal | Log normal | Gamma | ||
---|---|---|---|---|
Bias (MSE) | \( \overline{X} \) | −0.006 | −0.010 | −0.036 |
(0.058) | (0.062) | (0.068) | ||
SD | −0.121 | −0.070 | −0.026 | |
(0.035) | (0.035) | (0.146) | ||
\( {\widehat{p}}_{\ge 25} \) | 0.001 | 0.001 | 0.016 | |
(0.001) | (0) | (0.001) | ||
\( {\widehat{p}}_{\ge 30} \) | 0 | −0.002 | 0.003 | |
(0) | (0) | (0) | ||
Kolmogrov-Smirnov test false rejection rate | 2.5 % | 2.3 % | 24.9 % |
Applied example
Kolmogrov-Smirnov Test results (p-values) comparing the empirical distribution to the NHANES data distribution
Kolmogrov-Smirnov test statistics and p-values | ||
---|---|---|
Age group | Male | Female |
20-29 | 0.043 | 0.069 |
p = 0.755 | p = 0.218 | |
30-39 | 0.064 | 0.066 |
p = 0.314 | p = 0.280 | |
40-49 | 0.046 | 0.056 |
p = 0.772 | p = 0.490 | |
50-59 | 0.048 | 0.081 |
p = 0.721 | p = 0.099 | |
60-69 | 0.068 | 0.043 |
p = 0.475 | p = 0.833 | |
70+ | 0.036 | 0.063 |
p = 0.958 | p = 0.381 |
Conclusions
In this study, we proposed a novel method to approximate the entire distribution of BMI using three commonly available statistics, namely mean BMI and prevalence of overweight and obesity. We assessed the method using a series of simulations, and the results indicated that the method performed well in approximating distributions with a wide range of skewness and kurtosis. We illustrated the application of the method using data from NHANES, which similarly demonstrated the accuracy of the approach.
A major appeal of the proposed method lies in its use of readily available health statistics. Distributions of BMI can be approximated without the need to collect a large amount of data. Moreover, past BMI distributions can be retrospectively constructed using historical information on mean BMI and prevalence of overweight and obesity. In addition, the current method is robust and can adequately estimate distributions which do not conform with the underlying distribution (beta distribution) assumed by the method. As part of the Global Burden of Disease Study 2013, we applied the proposed method to historical data to reconstruct the BMI distributions by age and sex for 192 countries from 1980 to 2013. Without utilizing the new approach, obtaining precise BMI distributions would have been impossible as in many countries historical individual-level BMI data were unavailable [13].
One of the limitations of this method, however, is the reduction in accuracy when the true distribution contains outliers. Specifically, our method may be inadequate at capturing outliers at the tails of a distribution. This limitation may be due to the assumption of the beta distribution in our approximation strategy. Although the beta distribution offers the flexibility to model a wide variety of distributional shapes, it is relatively weak at handling extreme kurtosis. Alternative distributions such as log normal and the gamma offer better capability at capturing long, heavy-tailed distributions. Nevertheless, the lack of finite upper bounds for these distributions posed challenges in the optimization process, which led to instability in estimation.
Despite this limitation, results from our simulations showed that the prevalence of overweight and obesity estimated from the empirical distributions are unbiased. This implies that, although the method may be limited in identifying the precise BMI value of outliers, it is able to offer an accurate approximation of the proportion of extreme values. Additionally, it is worth emphasizing that the design of the method is very flexible. For this simulation, three values were utilized in the optimization function. Additional statistics, such as prevalence of underweight and prevalence of different obesity classes, could be easily incorporated to the method and improve the accuracy of the distribution approximation.
In summary, the algorithm proposed in this paper serves as an efficient method to approximate BMI distributions. In fact, this algorithm can be applied to estimating the distribution of other continuous risk factors such as blood pressure and glucose level and facilitate more accurate assessment of associated disease burden. While the method performed well in various situations, some aspects can be improved. Future studies can explore non-parametric density approximation techniques to expand the flexibility of the method.
Declarations
Acknowledgment
This work was funded by a grant from the Bill & Melinda Gates Foundation. The funders had no role in study design, data collection and analysis, interpretation of data, decision to publish, or preparation of the manuscript. The corresponding author had full access to all data analyzed and had final responsibility for the decision to submit this original research paper for publication.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Ng M, Fleming T, Robinson M, Thomson B, Graetz N, Margono C, Mullany EC, Biryukov S, Abbafati C, Abera SF, Abraham JP, Abu-Rmeileh NME, Achoki T, AlBuhairan FS, Alemu ZA, Alfonso R, Ali MK, Ali R, Guzman NA, Ammar W, Anwari P, Banerjee A, Barquera S, Basu S, Bennett DA, Bhutta Z, Blore J, Cabral N, Nonato IC, Chang J-C, et al. Global, regional, and national prevalence of overweight and obesity in children and adults during 1980?2013: a systematic analysis for the Global Burden of Disease Study 2013. The Lancet. 2014;384:766–81.Google Scholar
- Ogden CL, Carroll MD, Kit BK, Flegal KM. Prevalence of childhood and adult obesity in the United States, 2011–2012. JAMA. 2014;311:806–14.View ArticlePubMedPubMed CentralGoogle Scholar
- WHO | Physical status: the use and interpretation of anthropometry [http://www.who.int/childgrowth/publications/physical_status/en/]
- World Health Organization. Obesity: preventing and managing the global epidemic. Report of WHO consultation. World Health Rep Organ Tech Rep Ser 2000 , 894:i–xii, 1–253.Google Scholar
- Penman AD, Johnson WD. The changing shape of the body mass index distribution curve in the population: implications for public health policy to reduce the prevalence of adult obesity. Prev Chronic Dis. 2006;3.Google Scholar
- Razak F, Corsi DJ, Subramanian SV. Change in the Body Mass Index Distribution for Women: Analysis of Surveys from 37 Low- and Middle-Income Countries. PLoS Med. 2013;10, e1001367.View ArticlePubMedPubMed CentralGoogle Scholar
- Flegal KM, Troiano RP. Changes in the distribution of body mass index of adults and children in the US population. Int J Obes Relat Metab Disord. 2000;24:807.View ArticlePubMedGoogle Scholar
- Katzmarzyk PT. Prevalence of class I, II and III obesity in Canada. Can Med Assoc J. 2006;174:156–7.View ArticleGoogle Scholar
- Vander Hoorn S, Ezzati M, Rodgers A, Lopez AD, Murray CJ. Estimating attributable burden of disease from exposure and hazard data. Comp Quantif Health Risks Glob Reg Burd Dis Attrib Sel Major Risk Factors Geneva World Health Organ. 2004;2129–40.Google Scholar
- Wijnhoven TM, van Raaij JM, Spinelli A, Starc G, Hassapidou M, Spiroski I, Rutter H, Martos É, Rito AI, Hovengen R, et al. WHO European Childhood Obesity Surveillance Initiative: body mass index and level of overweight among 6-9-year-old children from school year 2007/2008 to school year 2009/2010. BMC Public Health. 2014;14:806.Google Scholar
- Byrd R, Lu P, Nocedal J, Zhu C. A Limited Memory Algorithm for Bound Constrained Optimization. SIAM J Sci Comput. 1995;16:1190–208.View ArticleGoogle Scholar
- Robey RR, Barcikowski RS. Type I error and the number of iterations in Monte Carlo studies of robustness. Br J Math Stat Psychol. 1992;45:283–8.View ArticleGoogle Scholar
- Forouzanfar MH, Alexander L, Anderson HR, Bachman VF, Biryukov S, Brauer M, Burnett R, Casey D, Coates MM, Cohen A, Delwiche K, Estep K, Frostad JJ, Kc A, Kyu HH, Moradi-Lakeh M, Ng M, Slepak EL, Thomas BA, Wagner J, Aasvang GM, Abbafati C, Ozgoren AA, Abd-Allah F, Abera SF, Aboyans V, Abraham B, Abraham JP, Abubakar I, Abu-Rmeileh NME, et al. Global, regional, and national comparative risk assessment of 79 behavioural, environmental and occupational, and metabolic risks or clusters of risks in 188 countries, 1990–2013: a systematic analysis for the Global Burden of Disease Study 2013. The Lancet. 2015; 386:2287–323.Google Scholar