The link between a vegetable-rich diet and its health benefits is well established (Slavin & Lloyd, 2012). Indeed, regular consumption of vegetables has been associated with a reduced risk of chronic diseases such as cardiovascular diseases (Blekkenhorst et al., 2018), certain cancers (Terry, Terry, & Wolk, 2001), and type 2 diabetes (Carter, Gray, Troughton, Khunti, & Davies, 2010). Numerous studies have identified several predictors of healthy eating habits. Among other things, socioeconomic factors, such as higher education and income (Mello Rodrigues, Bray, Fernandes, & Luci Bernardo, 2019), and environmental factors, including access to fresh produce, have been highlighted (Mello Rodrigues et al., 2019). The New York Community Health Survey (NYC CHS) is an annual survey conducted by the New York City Department of Health and Mental Hygiene (DOHMH). The survey aims to collect detailed information on the health status, health behaviors, and socioeconomic conditions of adult New York City residents (aged 18 and older). The NYC CHS provides essential data to monitor health trends, inform public health policy, and assess local health service needs. Here, we propose to use data from the 2019 survey to investigate the relationship between selected predictors and vegetable consumption.
# loading libraries of interest
library(haven)
library(dplyr)
library(pscl)
library(tidyverse)
library(pscl)
library(MASS)
library(car)
library(svydiags)
library(sjstats)
library(pheatmap)
# Loading original dataset
chs19 <- read_sas("https://www1.nyc.gov/assets/doh/downloads/sas/episrv/chs2019_public.sas7bdat")
keepcols <- c("strata", "wt20_dual", "nutrition47")
# require a data use agreement
donotuse <-
c(
"industry_groupnew",
"uhf34",
"borough",
"dphonew06",
"indpho",
"lowinchousing19",
"howlong19new",
"everevicted_q1",
"newgenderid19",
"nativeindig",
"demog183",
"demog135r",
"asianancestry",
"countrybirth19",
"bthregion2",
"demog192r",
"numadults2",
"occupation_groupnew"
)
# administrative variables
admvars <-
c("survey",
"strata_q1",
"cid",
"qxvers",
"wt20_dual_q1",
"wt_compare")
# requiring different weights
q1 <- grep("q1$", colnames(chs19), value = TRUE)
# Generating 4 random variables from a seed value generated from my name
utf8ToInt("Andre Nana") |>
sum() |>
set.seed()
samplecols <- colnames(chs19)[!colnames(chs19) %in% c(keepcols, admvars, donotuse, q1)]
(myrandomcols <- sample(samplecols, 4))
## [1] "pcp19" "mood5" "usborn" "maritalstatus19"
# Dataset retained for the assignment
nutri0 <- dplyr::select(chs19, keepcols, myrandomcols)
We retain the following variables:
nutrition47 (outcome of interest): number of cups of vegetables eaten the day before the interview
usborn (main predictor): coded as “US born”, “Foreign born”, “Don’t know” and “refused”
maritalstatus19 coded as “Married”, “Divorced”, “Widowed”, “Separated”, “Never married”, “A member of an unmarried couple living together”, “Don’t know” and “refused”
mood5 which assesses the mood of the participant during the past 30 days (how often did you feel… That everything was an effort?) and coded as “All of the time”, “Most of the time”, “Some of the time”, “A little of the time”, and “None of the time”
pcp19 which assesses having one or more personal doctor or healthcare provider coded as “Yes”, “No”, “Don’t know” and “refused”
## nutri1 is coded to include missing values of predictors as "Missing value" in Epi Table 1. "Don't know" and "refused" are coded as missing values.
nutri1 <- nutri0 %>%
mutate(
usborn_cat = factor(case_when(
usborn == 1 ~ "1.US Born",
usborn == 2 ~ "2.Foreign Born",
TRUE ~ "3.Missing value")),
marital_cat = factor(
case_when(
maritalstatus19 == 1 ~ "1.Married",
maritalstatus19 == 2 ~ "2.Divorced",
maritalstatus19 == 3 ~ "3.Widowed",
maritalstatus19 == 4 ~ "4.Separated",
maritalstatus19 == 5 ~ "5.Never married",
maritalstatus19 == 6 ~ "6.Living together",
TRUE ~ "7.Missing value")),
mood_cat = factor(
case_when(
mood5 == 1 ~ "1.All of the time",
mood5 == 2 ~ "2.Most of the time",
mood5 == 3 ~ "3.Some of the time",
mood5 == 4 ~ "4.A little of the time",
mood5 == 5 ~ "5.None of the time",
TRUE ~ "6.Missing value")),
personal_doc = factor(case_when(
pcp19 == 1 ~ "1.Yes",
pcp19 == 2 ~ "2.No",
TRUE ~ "3.Missing value"))
)
## nutri2 is coded to include missing values of predictors as NAs for the modeling
nutri2 <- nutri0 %>%
mutate(
usborn_cat = factor(case_when(
usborn == 1 ~ "1.US Born",
usborn == 2 ~ "2.Foreign Born",
TRUE ~ NA)),
marital_cat = factor(
case_when(
maritalstatus19 == 1 ~ "1.Married",
maritalstatus19 == 2 ~ "2.Divorced",
maritalstatus19 == 3 ~ "3.Widowed",
maritalstatus19 == 4 ~ "4.Separated",
maritalstatus19 == 5 ~ "5.Never married",
maritalstatus19 == 6 ~ "6.Living together",
TRUE ~ NA)),
mood_cat = factor(
case_when(
mood5 == 1 ~ "1.All of the time",
mood5 == 2 ~ "2.Most of the time",
mood5 == 3 ~ "3.Some of the time",
mood5 == 4 ~ "4.A little of the time",
mood5 == 5 ~ "5.None of the time",
TRUE ~ NA)),
personal_doc = factor(case_when(
pcp19 == 1 ~ "1.Yes",
pcp19 == 2 ~ "2.No",
TRUE ~ NA))
)
mm <- model.matrix( ~ ., data = nutri2[, c("marital_cat", "mood_cat", "usborn_cat", "personal_doc")])
pheatmap(cor(mm[, -1]), color = colorRampPalette(c("#f0f0f0", "#bdbdbd", "#636363"))(100),
display_numbers = TRUE)
Visually inspecting the heatmap, it appears that multicollinearity among the predictors does not seem to be an issue. This is further confirmed by the Generalized Variance Inflation Factor GVIF of the negative binomial regression with 1.054 for usborn, 1.08 for marital status, 1.02 for mood and 1.03 for personal care provider, which are largely below the value of 4 frequently used as a rule thumb for further investigations 1.
library(table1)
##
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
##
## units, units<-
table1(~ marital_cat + mood_cat + personal_doc | usborn_cat, data = nutri1,
caption = "Characteristics of the study population, stratified by place of birth",
footnote = "Data from The 2019 New York Community Health Survey")
1.US Born (N=4951) |
2.Foreign Born (N=3812) |
3.Missing value (N=40) |
Overall (N=8803) |
|
---|---|---|---|---|
Data from The 2019 New York Community Health Survey | ||||
marital_cat | ||||
1.Married | 1563 (31.6%) | 1749 (45.9%) | 15 (37.5%) | 3327 (37.8%) |
2.Divorced | 543 (11.0%) | 497 (13.0%) | 4 (10.0%) | 1044 (11.9%) |
3.Widowed | 447 (9.0%) | 294 (7.7%) | 1 (2.5%) | 742 (8.4%) |
4.Separated | 172 (3.5%) | 235 (6.2%) | 3 (7.5%) | 410 (4.7%) |
5.Never married | 1878 (37.9%) | 767 (20.1%) | 8 (20.0%) | 2653 (30.1%) |
6.Living together | 314 (6.3%) | 239 (6.3%) | 2 (5.0%) | 555 (6.3%) |
7.Missing value | 34 (0.7%) | 31 (0.8%) | 7 (17.5%) | 72 (0.8%) |
mood_cat | ||||
1.All of the time | 347 (7.0%) | 268 (7.0%) | 5 (12.5%) | 620 (7.0%) |
2.Most of the time | 323 (6.5%) | 253 (6.6%) | 4 (10.0%) | 580 (6.6%) |
3.Some of the time | 944 (19.1%) | 569 (14.9%) | 2 (5.0%) | 1515 (17.2%) |
4.A little of the time | 933 (18.8%) | 794 (20.8%) | 7 (17.5%) | 1734 (19.7%) |
5.None of the time | 2404 (48.6%) | 1928 (50.6%) | 22 (55.0%) | 4354 (49.5%) |
personal_doc | ||||
1.Yes | 4361 (88.1%) | 3147 (82.6%) | 33 (82.5%) | 7541 (85.7%) |
2.No | 564 (11.4%) | 643 (16.9%) | 6 (15.0%) | 1213 (13.8%) |
3.Missing value | 26 (0.5%) | 22 (0.6%) | 1 (2.5%) | 49 (0.6%) |
The majority of US-born participants reported never being married 1878 (37.9%), not feeling low anytime during the past 30 days 2404 (48.6%), and having a personal care provider 4361 (88.1%)
The majority of foreign born participants reported being married 1749 (45.9%), not feeling low anytime during the past 30 days 1928 (50.6%), and having a personal care provider as well 3147 (82.6%)
# histogram to visually assess the frequency of values
ggplot(nutri1, aes(x=nutrition47)) +
geom_histogram(binwidth = 0.5) +
theme_classic()
## Warning: Removed 150 rows containing non-finite outside the scale range
## (`stat_bin()`).
# assessing the proportion of zeros
summarise(nutri1,
zeros = nrow(filter(nutri1, nutrition47 == 0)),
nonzeros = nrow(filter(nutri1, nutrition47 > 0)),
proportion_zero = zeros/(zeros + nonzeros),
mean = mean(nutrition47, na.rm=TRUE),
variance = var(nutrition47, na.rm=TRUE))
## # A tibble: 1 × 5
## zeros nonzeros proportion_zero mean variance
## <int> <int> <dbl> <dbl> <dbl>
## 1 1711 6942 0.198 1.45 2.78
Our outcome variable has less than 20% of zero. Therefore, we predict that zero inflation may not be an issue. However, the variance (2.78) is greater than the mean (0.1977), suggesting overdispersion. We can anticipate that a negative binomial regression may be suitable in our case. But we will proceed with the analysis to find out.
# Poisson regression
fit.pois <- glm(nutrition47 ~ usborn_cat + marital_cat + mood_cat + personal_doc,
data = nutri2,
family = poisson(link = "log"))
# Negative binomial
fit.negbin <- MASS::glm.nb(nutrition47 ~ usborn_cat + marital_cat + mood_cat + personal_doc, data = nutri2)
# Intercept-only Zero-inflated Poisson regression
fit.ZIpois <-
pscl::zeroinfl(nutrition47 ~ usborn_cat + marital_cat + mood_cat + personal_doc|1,
dist = "poisson",
data = nutri2)
# Intercept-only Zero-inflated negative binomial regression
fit.ZInegbin <-
pscl::zeroinfl(nutrition47 ~ usborn_cat + marital_cat + mood_cat + personal_doc|1,
dist = "negbin",
data = nutri2)
# loglikelihoods and degrees of freedom
ll.pois <- logLik(fit.pois)
ll.negbin <- logLik(fit.negbin)
ll.ZIpois <- logLik(fit.ZIpois)
ll.ZInegbin <- logLik(fit.ZInegbin)
# Data frame to store and display the loglikelihoods and dfs
log_likelihoods <- data.frame(
Model = c("Poisson", "Negative Binomial", "Zero-Inflated Poisson", "Zero-Inflated Negative Binomial"),
LogLikelihood = c(as.numeric(ll.pois), as.numeric(ll.negbin), as.numeric(ll.ZIpois), as.numeric(ll.ZInegbin)),
DegreesOfFreedom = c(attr(ll.pois, "df"), attr(ll.negbin, "df"), attr(ll.ZIpois, "df"), attr(ll.ZInegbin, "df"))
)
# printing outputs in descending order of likelihoods
print(log_likelihoods |> arrange(desc(LogLikelihood)))
## Model LogLikelihood DegreesOfFreedom
## 1 Negative Binomial -13057.72 13
## 2 Zero-Inflated Negative Binomial -13057.72 14
## 3 Poisson -13283.88 12
## 4 Zero-Inflated Poisson -13283.90 13
Remarks: Here, the poisson and the intercept-only zero-inflated poisson models have quasi-identical loglikelihoods with respectively -13283.88 and -13283.90. The negative binomial and the zero-inflated negative binomial also have identical loglikelihoods (-13057.72) which is greater than the Poisson models. The zero-inflated models do not perform better than the non-zero-inflated one. These results suggest that zero-inflation may not be a concern, but there is overdispersion.
# Pairwise-chi square test
# negative binomial vs poisson
negbin_poisson <- pchisq(2 * (ll.negbin - ll.pois), df=1,
lower.tail=FALSE)
# zero inflated poisson model vs poisson model
zeropoisson_poisson <- pchisq(2 * (ll.ZIpois - ll.pois), df=1,
lower.tail=FALSE)
# negative binomial vs zero inflated poisson model
negbin_zeropoisson <- pchisq(2 * (ll.negbin - ll.ZIpois), df=0,
lower.tail=FALSE)
# zero inflated negative binomial vs zero inflated poisson model
zeronegbin_zeropoisson <- pchisq(2 * (ll.ZInegbin - ll.ZIpois), df=1,
lower.tail=FALSE)
# zero inflated negative binomial vs negative binomial
zeronegbin_negbin <- pchisq(2 * (ll.ZInegbin - ll.negbin), df=1,
lower.tail=FALSE)
chi_square_results <- data.frame(
Comparison = c(
"Negative Binomial vs Poisson",
"Zero-Inflated Poisson vs Poisson",
"Negative Binomial vs Zero-Inflated Poisson",
"Zero-Inflated Negative Binomial vs Zero-Inflated Poisson",
"Zero-Inflated Negative Binomial vs Negative Binomial"
),
p_value = c(
negbin_poisson,
zeropoisson_poisson,
negbin_zeropoisson,
zeronegbin_zeropoisson,
zeronegbin_negbin
)
)
# Print the data frame
print(chi_square_results)
## Comparison p_value
## 1 Negative Binomial vs Poisson 2.251998e-100
## 2 Zero-Inflated Poisson vs Poisson 1.000000e+00
## 3 Negative Binomial vs Zero-Inflated Poisson 0.000000e+00
## 4 Zero-Inflated Negative Binomial vs Zero-Inflated Poisson 2.211031e-100
## 5 Zero-Inflated Negative Binomial vs Negative Binomial 1.000000e+00
The Negative Binomial (NB) model and the intercept-only Zero-Inflated Negative Binomial (ZINB) model fit the data significantly better than both the Poisson and intercept-only Zero-Inflated Poisson models
Between NB and ZINB, the p-value indicates no significant difference. This suggests that the excess zeros in the data may not be prominent enough to justify the added complexity of the ZINB model.
Therefore we will retain the negative binomial as best fit.
fit.negbin <- MASS::glm.nb(nutrition47 ~ usborn_cat + marital_cat + mood_cat + personal_doc, data = nutri2)
summary(fit.negbin)
##
## Call:
## MASS::glm.nb(formula = nutrition47 ~ usborn_cat + marital_cat +
## mood_cat + personal_doc, data = nutri2, init.theta = 6.708326213,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.41241 0.04359 9.462 < 2e-16 ***
## usborn_cat2.Foreign Born -0.18850 0.02087 -9.033 < 2e-16 ***
## marital_cat2.Divorced -0.15334 0.03379 -4.538 5.69e-06 ***
## marital_cat3.Widowed -0.15995 0.03902 -4.099 4.15e-05 ***
## marital_cat4.Separated -0.07150 0.04954 -1.443 0.148966
## marital_cat5.Never married -0.10012 0.02471 -4.053 5.07e-05 ***
## marital_cat6.Living together -0.02871 0.04210 -0.682 0.495240
## mood_cat2.Most of the time -0.12882 0.05844 -2.204 0.027499 *
## mood_cat3.Some of the time 0.07839 0.04644 1.688 0.091394 .
## mood_cat4.A little of the time 0.12349 0.04547 2.716 0.006614 **
## mood_cat5.None of the time 0.15614 0.04209 3.710 0.000208 ***
## personal_doc2.No -0.06411 0.03002 -2.136 0.032693 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(6.7083) family taken to be 1)
##
## Null deviance: 8419.8 on 8512 degrees of freedom
## Residual deviance: 8241.2 on 8501 degrees of freedom
## (290 observations deleted due to missingness)
## AIC: 26141
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 6.708
## Std. Err.: 0.455
##
## 2 x log-likelihood: -26115.443
coefficients <- coef(fit.negbin)
# rate ratios
RR <- exp(coefficients)
sd <- summary(fit.negbin)$coefficients[, "Std. Error"]
ci_lower <- round(exp(coefficients - 1.96 * sd), 3)
ci_upper <- round(exp(coefficients + 1.96 * sd), 3)
RR_results <- data.frame(
RateRatio = round(RR, 3),
CI_Lower = ci_lower,
CI_Upper = ci_upper
)
print(RR_results)
## RateRatio CI_Lower CI_Upper
## (Intercept) 1.510 1.387 1.645
## usborn_cat2.Foreign Born 0.828 0.795 0.863
## marital_cat2.Divorced 0.858 0.803 0.917
## marital_cat3.Widowed 0.852 0.789 0.920
## marital_cat4.Separated 0.931 0.845 1.026
## marital_cat5.Never married 0.905 0.862 0.950
## marital_cat6.Living together 0.972 0.895 1.055
## mood_cat2.Most of the time 0.879 0.784 0.986
## mood_cat3.Some of the time 1.082 0.987 1.185
## mood_cat4.A little of the time 1.131 1.035 1.237
## mood_cat5.None of the time 1.169 1.076 1.270
## personal_doc2.No 0.938 0.884 0.995
vif(fit.negbin)
## GVIF Df GVIF^(1/(2*Df))
## usborn_cat 1.054147 1 1.026717
## marital_cat 1.080901 5 1.007810
## mood_cat 1.022751 4 1.002816
## personal_doc 1.025477 1 1.012658
Interpretation of the results
The rate of consomption of cups of vegetables among foreign-born participants was 0.828 (95%CI = 0.795, 0.863) times the rate among United States-born participants, adjusted for marital status, mood and having a personal care provider. Therefore, US-born participants were more likely to have vegetables than foreign born participants.
All Generalized Variance Inflation Factors are around 1, well below 4, which is the rule of thumb to start investigating a multicollinearity issue.
par(mfrow = c(2, 2))
plot(fit.negbin)
library(survey)
# transforming variables into categorical variables
chs19 <- chs19 %>%
mutate(
usborn_cat2 = factor(case_when(
usborn == 1 ~ "1.US Born",
usborn == 2 ~ "2.Foreign Born",
TRUE ~ NA)))
chs19[c("maritalstatus19", "mood5", "pcp19")] <- lapply(chs19[c("maritalstatus19", "mood5", "pcp19")], as.factor)
# Setting up the survey design
chs.dsgn <-
svydesign(
ids = ~ 1,
strata = ~ strata,
weights = ~ wt20_dual,
data = chs19,
nest = TRUE,
na.rm = TRUE
)
# fitting the model
fit.negbin.weighted <- svyglm.nb(nutrition47 ~ usborn_cat2 + maritalstatus19 + mood5 + pcp19, design = chs.dsgn)
summary(fit.negbin.weighted)
## Survey-sampled mle:
## survey::svymle(loglike = sjstats_loglik, gradient = sjstats_score,
## design = design, formulas = list(theta = ~1, eta = formula),
## start = c(mod$theta, stats::coef(mod)), na.action = "na.omit")
## Coef SE p.value
## theta.(Intercept) 5.95749311 2.11602034 0.00487
## eta.(Intercept) 0.39486659 0.11059408 < 0.001
## eta.usborn_cat22.Foreign Born -0.22548899 0.04014150 < 0.001
## eta.maritalstatus192 -0.16421861 0.05521656 0.00294
## eta.maritalstatus193 -0.21084150 0.05947075 < 0.001
## eta.maritalstatus194 0.07096567 0.36079913 0.84407
## eta.maritalstatus195 -0.13441756 0.03972232 < 0.001
## eta.maritalstatus196 -0.08971871 0.06323739 0.15597
## eta.mood52 -0.16662615 0.12583635 0.18545
## eta.mood53 0.07921717 0.11812847 0.50247
## eta.mood54 0.18747265 0.11452730 0.10165
## eta.mood55 0.17890966 0.11374857 0.11575
## eta.pcp192 -0.06346046 0.04506488 0.15907
## Stratified Independent Sampling design (with replacement)
## stats::update(design, scaled.weights = dw/mean(dw, na.rm = TRUE))
# Rate ratio of foreign born vs US born
rate_ratio = round(exp(-0.22548899), 3)
lower_bound = round(exp(-0.22548899 - 1.96*0.04014150), 3)
upper_bound = round(exp(-0.22548899 + 1.96*0.04014150), 3)
c(rate_ratio, lower_bound, upper_bound)
## [1] 0.798 0.738 0.863
Here again, the rate of consomption of cups of vegetables among foreign-born participants was 0.798 (95%CI = 0.738, 0.863) times the rate among United States-born participants, adjusted for marital status, mood and having a personal care provider. Therefore, US-born participants were more likely to have vegetables
Both the unweighted (rate ratio = 0.828 [95%CI = 0.795, 0.863]) and the weighted negative binomial regressions (0.798 [95%CI = 0.738, 0.863]) suggested that foreign-born participants were less likely to have had cups of vegetables compared to U.S-born participants, adjusted for marital status, mood, and having a healthcare provider. However, these results should be interpreted with caution since our analysis included only 4 predictors for the purpose of this exercise.
Blekkenhorst, L. C., Sim, M., Bondonno, C. P., Bondonno, N. P., Ward, N. C., Prince, R. L., . . . Hodgson, J. M. (2018). Cardiovascular Health Benefits of Specific Vegetable Types: A Narrative Review. Nutrients, 10(5), 595.
Carter, P., Gray, L. J., Troughton, J., Khunti, K., & Davies, M. J. (2010). Fruit and vegetable intake and incidence of type 2 diabetes mellitus: systematic review and meta-analysis. Bmj, 341.
Mello Rodrigues, V., Bray, J., Fernandes, A. C., & Luci Bernardo, G. (2019). Vegetable Consumption and Factors Associated with Increased Intake among College Students: A Scoping Review of the Last 10 Years. 11(7). doi: 10.3390/nu11071634
Slavin, J. L., & Lloyd, B. (2012). Health Benefits of Fruits and Vegetables. Advances in Nutrition, 3(4), 506-516. doi: https://doi.org/10.3945/an.112.002154
Terry, P., Terry, J. B., & Wolk, A. (2001). Fruit and vegetable consumption in the prevention of cancer: an update. Journal of Internal Medicine, 250(4), 280-290. doi: https://doi.org/10.1111/j.1365-2796.2001.00886.x