The use of income information of census enumeration area as a proxy for the household income in a household survey
© Gomes et al; licensee BioMed Central Ltd. 2009
Received: 12 December 2008
Accepted: 22 September 2009
Published: 22 September 2009
Some of the Census Enumeration Areas' (CEA) information may help planning the sample of population studies but it can also be used for some analyses that require information that is more difficult to obtain at the individual or household level, such as income. This paper verifies if the income information of CEA can be used as a proxy for household income in a household survey.
A population-based survey conducted from January to December 2003 obtained data from a probabilistic sample of 1,734 households of Niterói, Rio de Janeiro, Brazil. Uniform semi-association models were adjusted in order to obtain information about the agreement/disagreement structure of data. The distribution of nutritional status categories of the population of Niterói according to income quintiles was performed using both CEA- and household-level income measures and then compared using Wald statistics for homogeneity. Body mass index was calculated using body mass and stature data measured in the households and then used to define nutritional status categories according to the World Health Organization. All estimates and statistics were calculated accounting for the structural information of the sample design and a significance level lower than 5% was adopted.
The classification of households in the quintiles of household income was associated with the classification of these households in the quintiles of CEA income. The distribution of the nutritional status categories in all income quintiles did not differ significantly according to the source of income information (household or CEA) used in the definition of quintiles.
The structure of agreement/disagreement between quintiles of the household's monthly per capita income and quintiles of the head-of-household's mean nominal monthly income of the CEA, as well as the results produced by these measures when they were associated with the nutritional status of the population, showed that the CEA's income information can be used when income information at the individual or household levels is not available.
The place of health on the international agenda for development has been broadened  and health inequalities between and within countries have become a topic of great interest [2–5]. The concept of health inequalities includes the presence of unfair, avoidable, or remediable health differences among populations or specific groups defined according to social, economic, demographic or geographic criteria . It implies a failure in avoiding or overcoming these differences that overlooks basic human rights .
For these reasons, it is common that population surveys collect socioeconomic information when the purpose is either exploratory or descriptive (and this information becomes the main focus) and socioeconomic information is associated with outcomes or other variables of interest.
Income and education are the most used variables to characterize and/or discriminate among socioeconomic groups. However, the collection of this information, particularly income, is sometimes difficult and can be influenced by other factors in population-based studies. These interferences may result in either total failure to obtain it or misreporting (under- or overestimation) .
In Brazil, the Census Enumeration Areas (CEA) are used to assess the data of the Brazilian Demographic Census but they are also used as conglomerates of households for other population-based surveys. They are defined as contiguous groups of approximately 300 households respecting administrative and political boundaries and identified by stable and easy location reference points . Some of the CEA's information may help in planning the sample of such studies but it can also be used for some analyses that require information that is difficult to obtain at the individual or household level, such as income.
Although the use of this kind of information would be especially useful in developing countries, the few available studies in this area found in the literature were conducted exclusively in high-income countries in North America or Europe and in Australia [10–19]. There are remarkable differences in the methods of these studies, such as the definition of area levels, the independent and outcome variables adopted, and the statistical analysis, all of which hinder detailed comparisons. Most studies are interested in substituting individual-level [10, 12, 13, 15–17, 19], and more rarely household-level , information by the area level available in the census. Variables used to describe socioeconomic status vary from self-reported income data to socioeconomic scales. There is a variety of health outcomes in the analyses that make it difficult to generalize . Furthermore, the studies analyze the data using different procedures such as factor analysis, log-linear models, and estimation of correlation, agreement and reliability indexes such as intra-class correlation, Cohen's kappa or Kendall's coefficient of concordance. As a result, it is difficult to explain or predict the situations for which the different levels of socioeconomic information (e.g. CEA, household, individual) would produce similar results.
Additionally, no study has empirically compared the trade-offs in terms of cost savings, potential bias or loss of accuracy due to the use of area-level instead of individual- or household-level information. It has been suggested that the census-aggregated information is complementary because it may have a different construct meaning, depending on how it is defined in association with the health outcomes . In middle- and low-income countries, savings would probably surpass bias and accuracy costs.
The gap between the year of the census (every 10 years in Brazil) and the year of a given survey may play a crucial role in the socioeconomic characteristics of the population. Additionally, the fact that some countries' economic growth may be stationary or there is very discrete social mobility may facilitate the comparisons because there may not be expressive changes in family or individuals' income or socioeconomic status between the year of the census and the survey. On the other hand, if the country's economic growth is reflected in individual and family income, one may not be able to use the census information.
The purpose of the present study was to assess the validity of household income data from CEA to represent household income obtained in a household survey. In practical terms, it sought to verify if the CEA income information could be used as a proxy for household income in a household survey conducted to assess the nutritional status of the population of Niterói, a city in the state of Rio de Janeiro, Brazil.
The Nutrition, Physical Activity and Health Survey (Pesquisa de Nutrição, Atividade Física e Saúde - PNAFS) was the first household survey conducted to assess the nutritional status and health conditions of adolescents and adults living in Niterói, Rio de Janeiro, Brazil. Data collection was carried out between January and December 2003. Niterói is located in the metropolitan region of Rio de Janeiro that had 459,451 inhabitants in 2000, according to the last Brazilian census .
To guarantee the representativeness of the population of Niterói, a probabilistic sample of households was designed. The households were selected from the 2000 population CEA listing (there are 696 CEA with an average of 216 households per CEA in Niterói)  in two stages (CEA and household).
In the first stage, 110 CEA were selected, systematically, with probabilities proportional to the number of permanent private households. Prior to selection, the CEA were ordered from lowest to highest according to the head-of-household's mean nominal monthly income, thus implicitly stratifying the CEA by mean income and ensuring the selection of CEA from all income levels.
In the second stage, 16 households were selected in each CEA with equal probability, using an inverse sampling procedure , analogous to that applied in the World Health Survey in Brazil , leading to a sample size of 1,734 households (3,619 subjects) after the refusal of 26 households to participate in the study.
The sample weights were calculated as the product of inverse selection probabilities in each stage, using the estimator proposed by Haldane  adapted to be used in household surveys . To reduce selection biases, common in household surveys, the natural sampling weights were calibrated to provide estimates that coincide with known population totals .
The calibration post-strata were defined using the variables age and sex. The combination of the two categories of sex (male and female) and age -- categorized as seven brackets: 10-19.9; 20-29.9; 30-39.9; 40-49.9; 50-59.9; 60-69.9; and 70 years or more -- resulted in 14 post-strata (2 sexes × 7 age brackets). For the calibration of sampling weights, the household natural weight (W ij ) was multiplied by a calibration factor (g ij ), providing the household calibrated weight , where i represents the index of the selected CEA, j the index of the selected household and d the 14 post-strata domains, as indicated above. The Generalized Regression Estimator proposed by Deville & Särndal  was adopted to estimate the calibration factor g ij as , where q ij is a constant usually defined as 1 , x ij represents the vector of auxiliary variables (i.e., sex and age), t x denotes the vector of known population totals, and the vector with the estimates of the auxiliary variables calculated using natural sample weights.
Despite its extensive use, the Cohen's Kappa index (κ) does not provide information about the agreement/disagreement structure of data and it cannot be used to analyze ordinal scale categories , such as education or income strata. For this reason, the adjustment of uniform semi-association models was performed in the present analysis . It is a generalized linear model of the Poisson family with log link function that considers the ordination structure of the variable's categories. Three components of the structure of agreement and disagreement compose this model: (1) the agreement at random; (2) the agreement due to the association between classifications; and (3) the agreement after eliminating the effects of the agreement at random and the association between variables [25, 26]. Beyond the combination of the effects of agreement and disagreement, the semi-association model also considers the variations by categories, in the main diagonal, different from other models that assume the agreement is the same for each cell in the main diagonal [25, 26].
Besides the adjustment of the model that assesses the agreement/disagreement between the classification of income categories defined according to the CEA or the field-obtained information on household income, two other models were adjusted: (1) the information on household income from male-headed and (2) female-headed households. This was motivated by the hypothesis that when the woman is the head of the household, she may not know exactly her spouse's income and vice versa. Therefore, household income might be estimated with different errors if the head of the household knows or does not know the spouse's income.
The adjusted model agreement grades were estimated for each cell in terms of odds ratio (OR), using the measure τ ij (where i indicates the line and j the column of the cell) as proposed by Darroch & McCloud .
In addition to the adjustment of the model, Cohen's weighted Kappa , Kendall's coefficient of concordance , Krippendorff's alpha reliability coefficient  and Spearman's correlation ρ  were also estimated in order to check the robustness of the model and to allow comparisons to other studies.
To illustrate the comparison between the two income information applied to an epidemiologic study, the distribution of nutritional status categories of the population of Niterói (≥ 10 years of age) according to income quintiles (CEA and household) was performed.
To test the hypothesis that the distribution of the population by nutritional status categories according to the household income quintiles is equal to the distribution according to the quintiles constructed with the income of CEA the Wald statistic for homogeneity based in the sampling design was used .
Body mass and stature data were collected in the households and used to calculate the body mass index (BMI = body mass in kilograms divided by stature in squared meters) as described elsewhere . BMI for age and sex was used to define the nutritional status of adolescents (10-20 years of age) using the cut-off points presently recommended by the World Health Organization (WHO): low-BMI-for-age/thinness (< -2 Standard Deviations), overweight (≥ 1 Standard Deviation) and obesity (≥ 2 Standard Deviations) . For adults (≥ 20 years of age), the BMI cut-off points of < 18.5 kg/m2, ≥ 25 kg/m2 and ≥ 30 kg/m2 were used to define the categories of low-BMI/underweight, overweight and obesity, respectively .
The Institutional Review Board of the Sergio Arouca National School of Public Health of the Oswaldo Cruz Foundation approved all research procedures.
All estimates and statistics were calculated using the calibrated weights based on the structural information of the sample design, and a significance level lower than 5% was adopted. The analyses were conducted in R language and environment, version 2.6.1 .
Results and Discussion
Uniform semi-association model
Uniform semi-association models adjusted using only household income information of households in which the head-of-household was male or female
Sex of the head-of-household
The parameters β and δ i (i = j, i = 1 ..., i), estimated by and measure the association and agreement, respectively, between the measures of ordinal classification of income. The estimates of these parameters are statistically different from zero (p < 0.001) (Table 1), which indicates that the assessments of income quintiles made by means of CEA and household information are not different and that the classification of households in the quintiles of household income tend to be associated with the classification of these households in the quintiles of CEA income. The agreement between measures ranges from 0 (no agreement) to 1 (perfect agreement). All agreement estimates ( ) are significantly different from zero, which means that there is enough statistical evidence to reject the hypothesis that there is no agreement between the pairs of quintiles defined by household- and CEA-level information (Tables 1 and 2).
Table 2 shows the estimates of the cited parameters for male-headed and female-headed households. The uniform semi-association models adjusted for the distinct sexes has resulted in the same conclusion suggesting that the sex of the head of the household does not influence the structure of agreement/disagreement of the information of CEA and household income.
The values of indicates the OR of assessment measures to be concordant rather than discordant. Observing the values in the first line (fixing line 1 and varying columns) or in the fifth column (fixing column 5 and varying lines) of Table 3, it is possible to note that the agreement increases as the quintiles get more distant from the others, according to the interpretation above (Table 3).
According to the agreement classifications more widely used [37, 38], the estimated indexes indicate fair to moderate levels of agreement and are statistically significant: Kappa w = 0.49 (p < 0.001); Kendall's coefficient of concordance = 0.41 (p < 0.001); Krippendorff's alpha = 0.48; Spearman's ρ = 0.49 (p < 0.001).
The results of studies that investigate the use of area-level socioeconomic information as proxy of household or individual information are still controversial regarding the agreement of income measures as well as the results produced by each measure when related with an outcome. This is expected because the analysed outcomes, methods employed in the definition of socioeconomic strata, and the partitioning criteria used to define the territories vary between studies [12, 16, 39].
On one hand, the literature indicates that the information of both levels can be used without jeopardizing the analyses on health inequities because they produce similar results [15, 16, 18]. On the other hand, there is also evidence that the use of area-level information could result in substantial errors in the classification of socioeconomic conditions and, therefore, could not predict certain health outcomes as well as individual-level information [10–14, 17, 19].
In the present study, the structure of agreement/disagreement between quintiles of household monthly per capita income and quintiles of the head-of-household's mean nominal monthly income of the CEA, as well as the results produced by these measures when they were associated with the nutritional status of the population of Niterói, showed that the CEA's income information can be used when income information at the individual or household level are not available.
The hypothesis that the sex of the head of the household would not influence the structure of agreement/disagreement of income categories could not be rejected. Other factors, such as race [10, 39], that could influence this structure were not analyzed in the present study. Another limitation of this analysis consists of the definition of partitioning of income strata in quintiles. This procedure was adopted based on an applied criterion, with an analytic purpose, since partitions with large intervals are generally used only with the purpose of planning sample and/or study design. However, it is important to register that partitions in larger intervals could have led to different conclusions because the distance between categories would be reduced, which could result in greater classification errors. On the other hand, smaller partitions -- in thirds, for example -- would increase the distance between categories, diminishing the chances of occurrence of misclassifications.
It is also important to pay attention to the time between when the information were assessed. This is particularly important in countries undergoing fast economic growth or greater socioeconomic mobility. The survey used in the present analysis was conducted only three years after the 2000 Brazilian census . During this three-year period, the Brazilian economy was stable due to inexpressive and unsustained economic growth, and socioeconomic mobility was also compromised by declining gross domestic product, wage-share and continuous reduction of the formal employment sector.
Additionally, it is also important to note that the aggregated census information comes from individually collected information, which may raise the question whether the individual information collected in the census is reliable. The census information cannot be regarded as gold standard but it is expected to constitute more robust information than that collected in surveys because there are many more quality control mechanisms, proportionally fewer missing values, higher trust in the interviewer as an employee of a known institution (the Census Bureau), and no variance due to sample design.
Another issue that could be raised when dealing with aggregated data is that the income distribution within a given aggregated level (e.g., CEA, city) may vary according to the distance from a predetermined center. However, the adjusted models have not taken into account the modifiable areal unit problem and ecological fallacies due to aggregation [40–42], a limitation of the present study due to the absence of information on the distance between households and a predetermined center.
Furthermore, the inference and conclusions of this study may not apply to different variables of interest, countries, sizes and boundaries of enumeration areas, and possibly survey designs. Therefore, comparisons by other studies should be carefully made, taking this limitation into account.
It is remarkable that until this paper, the few studies on this theme had been solely derived from high-income countries (United States of America [10–12], Canada [13–16], Italy , Spain  and Australia ). For this reason, it would be important to encourage other analyses using different levels of aggregation and territories in low- and middle-income countries such as Brazil, which would make intra- and international comparison possible, contributing to the collection of evidence about the use of socioeconomic information aggregated by CEA in the absence of individual information.
This is perhaps the first study conducted in a developing country that compares the use of area- versus household-level income measures in association with a health outcome (nutritional status). The study indicates that CEA's income information may be used as a proxy for household income in the absence of individual- or household-level information. The sex of the source of household income information did not influence the structure of agreement/disagreement of income categories. Additionally, the association between income quintiles and nutritional status is similar whether CEA- or household-level income measures were used.
The Nutrition, Physical Activity, and Health Survey was partially funded by the Brazilian National Research Council (Conselho Nacional de Desenvolvimento Científico e Tecnológico - CNPq; grants nos. 471172/2001-4 and 475122/2003-8) and by the Oswaldo Cruz Foundation (PAPES III - Program to Support Strategic Projects in Health, no. 250.139). LAA received a research productivity grant from CNPq (grant no. 301076/89-8 and 311801/06-4). MTLV received a research productivity grant from CNPq (grant no. 302992/2003-0).
- World Health Organization (WHO), Secretariat of the Commission on Social Determinants of Health: Action on the social determinants of health: learning from previous experiences. Geneva 2005.Google Scholar
- WHO: The World Health Report 2003. Shaping the future. Geneva 2003.Google Scholar
- Evans T, Whitehead M, Diderichsen F, Bhuiya A, Wirth M, Eds: Challenging inequities in health: from ethics to action. New York: Oxford University Press; 2001.Google Scholar
- Leon DA, Walt G, Eds: Poverty, inequality and health: an international perspective. New York: Oxford University Press; 2001.Google Scholar
- Kim JY, Millen JV, Irwin A, Gershman J, Eds: Dying for growth: global inequality and the health of the poor. Monroe: Common Courage Press; 2000.Google Scholar
- WHO, Executive Board EB115/35: Note by the Secretariat. In 115th Session: 25 November 2004; Geneva. the Commission on Social Determinants of Health: WHO; 2004:1-3.Google Scholar
- Graham H, Kelly MP:Health inequalities: concepts, frameworks and policy. NHS Health Development Agency; 2004. [http://www.nice.org.uk/niceMedia/pdf/health_inequalities_policy_graham.pdf]Google Scholar
- Barros RP, Cury S, Ulyssea G: A desigualdade de renda no Brasil encontra-se subestimada? Uma análise comparativa com base na PNAD, na POF e nas Contas Nacionais. Rio de Janeiro: Instituto de Pesquisa Econômica Aplicada; 2007.Google Scholar
- Instituto Brasileiro de Geografia e Estatística: Censo Demográfico 2000: agregado por setores censitários dos resultados do universo. Rio de Janeiro; 2003.Google Scholar
- Diez Roux AV, Kiefe CI, Jacobs DR Jr, Haan M, Jackson SA, Nieto FJ, Paton CC, Schulz R: Area characteristics and individual-level socioeconomic position indicators in three population-based epidemiologic studies. Ann Epidemiol 2001, 11: 395-405. 10.1016/S1047-2797(01)00221-6View ArticlePubMedGoogle Scholar
- Diez Roux AV, Merkin SS, Hannan P, Jacobs DR, Kiefe CI: Area characteristics, individual-level socioeconomic indicators, and smoking in young adults. Am J Epidemiol 2003, 157: 315-326. 10.1093/aje/kwf207View ArticlePubMedGoogle Scholar
- Geronimus AT, Bound J: Use of census-based aggregate variables to proxy for socioeconomic group: evidence from national samples. Am J Epidemiol 1998, 148: 475-486.View ArticlePubMedGoogle Scholar
- Demissie K, Hanley JA, Menzies D, Joseph L, Ernst P: Agreement in measuring socio-economic status: area-based versus individual measures. Chronic Dis Can 2000, 21: 1-7.PubMedGoogle Scholar
- Hanley GE, Morgan S: On the validity of area-based income measures to proxy household income. BMC Health Services Research 2008, 8: 1472-6963. 10.1186/1472-6963-8-79View ArticleGoogle Scholar
- Janssen I, Boyce WF, Simpson K, Pickett W: Influence of individual- and area-level measures of socioeconomic status on obesity, unhealthy eating, and physical inactivity in Canadian adolescents. Am J Clin Nutr 2006, 83: 139-145. 10.1183/09031936.03.00091202PubMedGoogle Scholar
- Southern DA, McLaren L, Hawe P, Knudtson ML, Ghali WA: Individual-level and neighborhood-level income measures: agreement and association with outcomes in a cardiac disease cohort. Med Care 2005, 43: 1116-1122. 10.1097/01.mlr.0000182517.57235.6dView ArticlePubMedGoogle Scholar
- Cesaroni G, Farchi S, Davoli M, Forastiere F, Perucci CA: Individual and area-based indicators of socioeconomic status and childhood asthma. Eur Respir J 2003, 22: 619-624. 10.1016/j.puhe.2005.02.008View ArticlePubMedGoogle Scholar
- Domínguez-Berjón F, Borrell C, Rodríguez-Sanz M, Pastor V: The usefulness of area-based socioeconomic measures to monitor social inequalities in health in Southern Europe. Eur J Public Health 2006, 16: 54-61. 10.1093/biomet/33.3.222View ArticlePubMedGoogle Scholar
- Walker AE, Becker NG: Health inequalities across socio-economic groups: comparing geographic-area-based and individual-based indicators. Public Health 2005, 119: 1097-1104. 10.1590/S0102-311X2005000700010View ArticlePubMedGoogle Scholar
- Haldane JBS: On the method of estimating frequencies. Biometrika 1945, 33: 222-225. 10.1093/biomet/33.3.222View ArticlePubMedGoogle Scholar
- Vasconcellos MT, Silva PL, Szwarcwald CL: Sampling design for the World Health Survey in Brazil. Cad Saude Publica 2005, 21: 89-99. 10.1177/001316446002000104View ArticlePubMedGoogle Scholar
- Silva PLN: Calibration estimation: when and why, how much and how. Rio de Janeiro: Instituto Brasileiro de Geografia e Estatística; 2004.Google Scholar
- Deville JC, Särndal CE: Calibration estimators in survey sampling. J Am Stat Assoc 1992, 87: 376-382. 10.2307/2290268View ArticleGoogle Scholar
- Cohen J: A coefficient of agreement for nominal scales. Educ Psychol Meas 1960, 20: 37-46. 10.1177/001316446002000104View ArticleGoogle Scholar
- Goodman LA: Simple models for the analysis of association in cross-classifications having ordered categories. J Am Stat Assoc 1979, 74: 537-552. 10.2307/2286971View ArticleGoogle Scholar
- Silva EF, Pereira MG: Rating of the structures of agreement and disagreement in reliability studies. Rev Saude Publica 1998, 32: 383-393. 10.1590/S0034-89101998000400012View ArticlePubMedGoogle Scholar
- Darroch J, McCloud PI: Category of distinguishability and observer agreement. Aust J Stat 1986, 28: 371-388. 10.1111/j.1467-842X.1986.tb00709.xView ArticleGoogle Scholar
- Cohen J: Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull 1968, 70: 213-220. 10.1037/h0026256View ArticlePubMedGoogle Scholar
- Kendall MG: Rank correlation methods. 4th edition. London: Charles Griffin & Company; 1970.Google Scholar
- Krippendorff K: Content analysis: an introduction to its methodology. Beverly Hills: Sage; 1980.Google Scholar
- Spearman C: The proof and measurement of association between two things. Amer J Psychol 1904, 15: 72-101. 10.2307/1412159View ArticleGoogle Scholar
- Pessoa DGC, Silva PLN, Eds: Análise de dados amostrais complexos. São Paulo: Associação Brasileira de Estatística; 1998.Google Scholar
- Bossan FM, Anjos LA, Vasconcellos MTL, Wahrlich V: Nutritional status of the adult population in Niterói, Rio de Janeiro, Brazil: the Nutrition, Physical Activity, and Health Survey. Cad Saude Publica 2007, 23: 1867-1876. 10.1191/096228098672090967View ArticlePubMedGoogle Scholar
- De Onis M, Onyango AW, Borghi E, Siyam A, Nishida C, Siekmann J: Development of a WHO growth reference for school-aged children and adolescents. Bull World Health Organ 2007, 85: 660-667. 10.2471/BLT.07.043497View ArticlePubMedPubMed CentralGoogle Scholar
- WHO: Physical status: The Use and Interpretation of Anthropometry. Report of a WHO Expert Committee. Technical Report Series 854. Geneva 1995.Google Scholar
- R Development Core Team: R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2007.Google Scholar
- Shrout PE: Measurement reliability and agreement in psychiatry. Stat Methods Med Res 1998, 7: 301-317. 10.1068/a160017View ArticlePubMedGoogle Scholar
- Landis JR, Koch GG: The measurement of observer agreement for categorical data. Biometrics 1977, 33: 159-174. 10.2307/2529310View ArticlePubMedGoogle Scholar
- Braveman PA, Cubbin C, Egerter S, Chideya S, Marchi KS, Metzler M, Posner S: Socioeconomic status in health research: one size does not fit all. JAMA 2005, 294: 2879-2888. 10.1001/jama.294.22.2879View ArticlePubMedGoogle Scholar
- Openshaw S: Ecological fallacies and the analysis of areal census data. Environ Plan A 1984, 16: 17-31. 10.1068/a160017View ArticlePubMedGoogle Scholar
- Openshaw S: The modifiable areal unit problem. Concepts and techniques in modern geography. Norwich: Geo Books; 1984.Google Scholar
- Tagashira N, Okabe A: The modifiable areal unit problem in a regression model whose independent variable is a distance from a predetermined point. Geogr Anal 2002, 34: 1-20. 10.1353/geo.2002.0006Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.