Introduction and rationale

The link between a vegetable-rich diet and its health benefits is well established (Slavin & Lloyd, 2012). Indeed, regular consumption of vegetables has been associated with a reduced risk of chronic diseases such as cardiovascular diseases (Blekkenhorst et al., 2018), certain cancers (Terry, Terry, & Wolk, 2001), and type 2 diabetes (Carter, Gray, Troughton, Khunti, & Davies, 2010). Numerous studies have identified several predictors of healthy eating habits. Among other things, socioeconomic factors, such as higher education and income (Mello Rodrigues, Bray, Fernandes, & Luci Bernardo, 2019), and environmental factors, including access to fresh produce, have been highlighted (Mello Rodrigues et al., 2019). The New York Community Health Survey (NYC CHS) is an annual survey conducted by the New York City Department of Health and Mental Hygiene (DOHMH). The survey aims to collect detailed information on the health status, health behaviors, and socioeconomic conditions of adult New York City residents (aged 18 and older). The NYC CHS provides essential data to monitor health trends, inform public health policy, and assess local health service needs. Here, we propose to use data from the 2019 survey to investigate the relationship between selected predictors and vegetable consumption.

Importing dataset, variable selection, and manipulation

# loading libraries of interest
library(haven)
library(dplyr)
library(pscl)
library(tidyverse)
library(pscl)
library(MASS)
library(car)
library(svydiags)
library(sjstats)
library(pheatmap)

# Loading original dataset
chs19 <- read_sas("https://www1.nyc.gov/assets/doh/downloads/sas/episrv/chs2019_public.sas7bdat")

keepcols <- c("strata", "wt20_dual", "nutrition47")

# require a data use agreement
donotuse <-
  c(
    "industry_groupnew",
    "uhf34",
    "borough",
    "dphonew06",
    "indpho",
    "lowinchousing19",
    "howlong19new",
    "everevicted_q1",
    "newgenderid19",
    "nativeindig",
    "demog183",
    "demog135r",
    "asianancestry",
    "countrybirth19",
    "bthregion2",
    "demog192r",
    "numadults2",
    "occupation_groupnew"
  )
# administrative variables
admvars <-
  c("survey",
    "strata_q1",
    "cid",
    "qxvers",
    "wt20_dual_q1",
    "wt_compare")
# requiring different weights
q1 <- grep("q1$", colnames(chs19), value = TRUE)

# Generating 4 random variables from a seed value generated from my name 

utf8ToInt("Andre Nana") |>  
  sum() |>  
  set.seed() 

samplecols <- colnames(chs19)[!colnames(chs19) %in% c(keepcols, admvars, donotuse, q1)]
(myrandomcols <- sample(samplecols, 4))

## [1] "pcp19"           "mood5"           "usborn"          "maritalstatus19"

# Dataset retained for the assignment 
nutri0 <- dplyr::select(chs19, keepcols, myrandomcols)

We retain the following variables:

nutrition47 (outcome of interest): number of cups of vegetables eaten the day before the interview
usborn (main predictor): coded as “US born”, “Foreign born”, “Don’t know” and “refused”
maritalstatus19 coded as “Married”, “Divorced”, “Widowed”, “Separated”, “Never married”, “A member of an unmarried couple living together”, “Don’t know” and “refused”
mood5 which assesses the mood of the participant during the past 30 days (how often did you feel… That everything was an effort?) and coded as “All of the time”, “Most of the time”, “Some of the time”, “A little of the time”, and “None of the time”
pcp19 which assesses having one or more personal doctor or healthcare provider coded as “Yes”, “No”, “Don’t know” and “refused”

Recoding variables of interest

## nutri1 is coded to include missing values of predictors as "Missing value" in Epi Table 1. "Don't know" and "refused" are coded as missing values.
nutri1 <- nutri0 %>%
  mutate(
      usborn_cat = factor(case_when(
      usborn == 1 ~ "1.US Born",
      usborn == 2 ~ "2.Foreign Born",
      TRUE ~ "3.Missing value")),
    marital_cat = factor(
      case_when(
      maritalstatus19 == 1 ~ "1.Married",
      maritalstatus19 == 2 ~ "2.Divorced",
      maritalstatus19 == 3 ~ "3.Widowed",
      maritalstatus19 == 4 ~ "4.Separated",
      maritalstatus19 == 5 ~ "5.Never married",
      maritalstatus19 == 6 ~ "6.Living together",
      TRUE ~ "7.Missing value")),
    mood_cat = factor(
      case_when(
      mood5 == 1 ~ "1.All of the time",
      mood5 == 2 ~ "2.Most of the time",
      mood5 == 3 ~ "3.Some of the time",
      mood5 == 4 ~ "4.A little of the time",
      mood5 == 5 ~ "5.None of the time",
      TRUE ~ "6.Missing value")),
    personal_doc = factor(case_when(
      pcp19 == 1 ~ "1.Yes",
      pcp19 == 2 ~ "2.No",
      TRUE ~ "3.Missing value"))
    )

## nutri2 is coded to include missing values of predictors as NAs for the modeling
nutri2 <- nutri0 %>%
  mutate(
      usborn_cat = factor(case_when(
      usborn == 1 ~ "1.US Born",
      usborn == 2 ~ "2.Foreign Born",
      TRUE ~ NA)),
    marital_cat = factor(
      case_when(
      maritalstatus19 == 1 ~ "1.Married",
      maritalstatus19 == 2 ~ "2.Divorced",
      maritalstatus19 == 3 ~ "3.Widowed",
      maritalstatus19 == 4 ~ "4.Separated",
      maritalstatus19 == 5 ~ "5.Never married",
      maritalstatus19 == 6 ~ "6.Living together",
      TRUE ~ NA)),
    mood_cat = factor(
      case_when(
      mood5 == 1 ~ "1.All of the time",
      mood5 == 2 ~ "2.Most of the time",
      mood5 == 3 ~ "3.Some of the time",
      mood5 == 4 ~ "4.A little of the time",
      mood5 == 5 ~ "5.None of the time",
      TRUE ~ NA)),
    personal_doc = factor(case_when(
      pcp19 == 1 ~ "1.Yes",
      pcp19 == 2 ~ "2.No",
      TRUE ~ NA))
    )

Assessing multicollinearity among predictors

mm <- model.matrix( ~ ., data = nutri2[, c("marital_cat", "mood_cat", "usborn_cat", "personal_doc")])

pheatmap(cor(mm[, -1]), color = colorRampPalette(c("#f0f0f0", "#bdbdbd", "#636363"))(100),
  display_numbers = TRUE)

Visually inspecting the heatmap, it appears that multicollinearity among the predictors does not seem to be an issue. This is further confirmed by the Generalized Variance Inflation Factor GVIF of the negative binomial regression with 1.054 for usborn, 1.08 for marital status, 1.02 for mood and 1.03 for personal care provider, which are largely below the value of 4 frequently used as a rule thumb for further investigations ¹.

Demographic table

library(table1)

## 
## Attaching package: 'table1'

## The following objects are masked from 'package:base':
## 
##     units, units<-

table1(~  marital_cat + mood_cat + personal_doc | usborn_cat, data = nutri1,
       caption = "Characteristics of the study population, stratified by place of birth",
       footnote = "Data from The 2019 New York Community Health Survey")

Characteristics of the study population, stratified by place of birth
	1.US Born (N=4951)	2.Foreign Born (N=3812)	3.Missing value (N=40)	Overall (N=8803)
Data from The 2019 New York Community Health Survey
marital_cat
1.Married	1563 (31.6%)	1749 (45.9%)	15 (37.5%)	3327 (37.8%)
2.Divorced	543 (11.0%)	497 (13.0%)	4 (10.0%)	1044 (11.9%)
3.Widowed	447 (9.0%)	294 (7.7%)	1 (2.5%)	742 (8.4%)
4.Separated	172 (3.5%)	235 (6.2%)	3 (7.5%)	410 (4.7%)
5.Never married	1878 (37.9%)	767 (20.1%)	8 (20.0%)	2653 (30.1%)
6.Living together	314 (6.3%)	239 (6.3%)	2 (5.0%)	555 (6.3%)
7.Missing value	34 (0.7%)	31 (0.8%)	7 (17.5%)	72 (0.8%)
mood_cat
1.All of the time	347 (7.0%)	268 (7.0%)	5 (12.5%)	620 (7.0%)
2.Most of the time	323 (6.5%)	253 (6.6%)	4 (10.0%)	580 (6.6%)
3.Some of the time	944 (19.1%)	569 (14.9%)	2 (5.0%)	1515 (17.2%)
4.A little of the time	933 (18.8%)	794 (20.8%)	7 (17.5%)	1734 (19.7%)
5.None of the time	2404 (48.6%)	1928 (50.6%)	22 (55.0%)	4354 (49.5%)
personal_doc
1.Yes	4361 (88.1%)	3147 (82.6%)	33 (82.5%)	7541 (85.7%)
2.No	564 (11.4%)	643 (16.9%)	6 (15.0%)	1213 (13.8%)
3.Missing value	26 (0.5%)	22 (0.6%)	1 (2.5%)	49 (0.6%)

The majority of US-born participants reported never being married 1878 (37.9%), not feeling low anytime during the past 30 days 2404 (48.6%), and having a personal care provider 4361 (88.1%)
The majority of foreign born participants reported being married 1749 (45.9%), not feeling low anytime during the past 30 days 1928 (50.6%), and having a personal care provider as well 3147 (82.6%)

Regression models

Exploratory data analytics

# histogram to visually assess the frequency of values 
ggplot(nutri1, aes(x=nutrition47)) +
  geom_histogram(binwidth = 0.5) +
  theme_classic()

## Warning: Removed 150 rows containing non-finite outside the scale range
## (`stat_bin()`).

# assessing the proportion of zeros
summarise(nutri1, 
          zeros = nrow(filter(nutri1, nutrition47 == 0)),
          nonzeros = nrow(filter(nutri1, nutrition47 > 0)),
          proportion_zero = zeros/(zeros + nonzeros),
          mean = mean(nutrition47, na.rm=TRUE),
          variance = var(nutrition47, na.rm=TRUE))

## # A tibble: 1 × 5
##   zeros nonzeros proportion_zero  mean variance
##   <int>    <int>           <dbl> <dbl>    <dbl>
## 1  1711     6942           0.198  1.45     2.78

Our outcome variable has less than 20% of zero. Therefore, we predict that zero inflation may not be an issue. However, the variance (2.78) is greater than the mean (0.1977), suggesting overdispersion. We can anticipate that a negative binomial regression may be suitable in our case. But we will proceed with the analysis to find out.

Fitting different models

# Poisson regression
fit.pois <- glm(nutrition47 ~ usborn_cat + marital_cat + mood_cat + personal_doc,
                data = nutri2,
                family = poisson(link = "log"))

# Negative binomial
fit.negbin <- MASS::glm.nb(nutrition47 ~ usborn_cat + marital_cat + mood_cat + personal_doc, data = nutri2)

# Intercept-only Zero-inflated Poisson regression
fit.ZIpois <- 
  pscl::zeroinfl(nutrition47 ~ usborn_cat + marital_cat + mood_cat + personal_doc|1,
                 dist = "poisson",
                 data = nutri2)

# Intercept-only Zero-inflated negative binomial regression
fit.ZInegbin <- 
  pscl::zeroinfl(nutrition47 ~ usborn_cat + marital_cat + mood_cat + personal_doc|1,
                         dist = "negbin",
                         data = nutri2)

Nested deviance test for model selection

# loglikelihoods  and degrees of freedom
ll.pois <- logLik(fit.pois)
ll.negbin <- logLik(fit.negbin)
ll.ZIpois <- logLik(fit.ZIpois)
ll.ZInegbin <- logLik(fit.ZInegbin)

# Data frame to store and display the loglikelihoods  and dfs
log_likelihoods <- data.frame(
  Model = c("Poisson", "Negative Binomial", "Zero-Inflated Poisson", "Zero-Inflated Negative Binomial"),
  LogLikelihood = c(as.numeric(ll.pois), as.numeric(ll.negbin), as.numeric(ll.ZIpois), as.numeric(ll.ZInegbin)),
  DegreesOfFreedom = c(attr(ll.pois, "df"), attr(ll.negbin, "df"), attr(ll.ZIpois, "df"), attr(ll.ZInegbin, "df"))
)

# printing outputs in descending order of likelihoods
print(log_likelihoods |> arrange(desc(LogLikelihood)))

##                             Model LogLikelihood DegreesOfFreedom
## 1               Negative Binomial     -13057.72               13
## 2 Zero-Inflated Negative Binomial     -13057.72               14
## 3                         Poisson     -13283.88               12
## 4           Zero-Inflated Poisson     -13283.90               13

Remarks: Here, the poisson and the intercept-only zero-inflated poisson models have quasi-identical loglikelihoods with respectively -13283.88 and -13283.90. The negative binomial and the zero-inflated negative binomial also have identical loglikelihoods (-13057.72) which is greater than the Poisson models. The zero-inflated models do not perform better than the non-zero-inflated one. These results suggest that zero-inflation may not be a concern, but there is overdispersion.

# Pairwise-chi square test

# negative binomial vs poisson
negbin_poisson <- pchisq(2 * (ll.negbin - ll.pois), df=1, 
       lower.tail=FALSE)

# zero inflated poisson model vs poisson model
zeropoisson_poisson <- pchisq(2 * (ll.ZIpois - ll.pois), df=1, 
       lower.tail=FALSE)

# negative binomial vs zero inflated poisson model 
negbin_zeropoisson <- pchisq(2 * (ll.negbin - ll.ZIpois), df=0, 
       lower.tail=FALSE)

# zero inflated negative binomial vs zero inflated poisson model
zeronegbin_zeropoisson <- pchisq(2 * (ll.ZInegbin - ll.ZIpois), df=1, 
       lower.tail=FALSE)

# zero inflated negative binomial vs negative binomial
zeronegbin_negbin <- pchisq(2 * (ll.ZInegbin - ll.negbin), df=1, 
       lower.tail=FALSE)


chi_square_results <- data.frame(
  Comparison = c(
    "Negative Binomial vs Poisson",
    "Zero-Inflated Poisson vs Poisson",
    "Negative Binomial vs Zero-Inflated Poisson",
    "Zero-Inflated Negative Binomial vs Zero-Inflated Poisson",
    "Zero-Inflated Negative Binomial vs Negative Binomial"
  ),
  p_value = c(
    negbin_poisson,
    zeropoisson_poisson,
    negbin_zeropoisson,
    zeronegbin_zeropoisson,
    zeronegbin_negbin
  )
)

# Print the data frame
print(chi_square_results)

##                                                 Comparison       p_value
## 1                             Negative Binomial vs Poisson 2.251998e-100
## 2                         Zero-Inflated Poisson vs Poisson  1.000000e+00
## 3               Negative Binomial vs Zero-Inflated Poisson  0.000000e+00
## 4 Zero-Inflated Negative Binomial vs Zero-Inflated Poisson 2.211031e-100
## 5     Zero-Inflated Negative Binomial vs Negative Binomial  1.000000e+00

The Negative Binomial (NB) model and the intercept-only Zero-Inflated Negative Binomial (ZINB) model fit the data significantly better than both the Poisson and intercept-only Zero-Inflated Poisson models
Between NB and ZINB, the p-value indicates no significant difference. This suggests that the excess zeros in the data may not be prominent enough to justify the added complexity of the ZINB model.

Therefore we will retain the negative binomial as best fit.

fit.negbin <- MASS::glm.nb(nutrition47 ~ usborn_cat + marital_cat + mood_cat + personal_doc, data = nutri2)

summary(fit.negbin)

## 
## Call:
## MASS::glm.nb(formula = nutrition47 ~ usborn_cat + marital_cat + 
##     mood_cat + personal_doc, data = nutri2, init.theta = 6.708326213, 
##     link = log)
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     0.41241    0.04359   9.462  < 2e-16 ***
## usborn_cat2.Foreign Born       -0.18850    0.02087  -9.033  < 2e-16 ***
## marital_cat2.Divorced          -0.15334    0.03379  -4.538 5.69e-06 ***
## marital_cat3.Widowed           -0.15995    0.03902  -4.099 4.15e-05 ***
## marital_cat4.Separated         -0.07150    0.04954  -1.443 0.148966    
## marital_cat5.Never married     -0.10012    0.02471  -4.053 5.07e-05 ***
## marital_cat6.Living together   -0.02871    0.04210  -0.682 0.495240    
## mood_cat2.Most of the time     -0.12882    0.05844  -2.204 0.027499 *  
## mood_cat3.Some of the time      0.07839    0.04644   1.688 0.091394 .  
## mood_cat4.A little of the time  0.12349    0.04547   2.716 0.006614 ** 
## mood_cat5.None of the time      0.15614    0.04209   3.710 0.000208 ***
## personal_doc2.No               -0.06411    0.03002  -2.136 0.032693 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(6.7083) family taken to be 1)
## 
##     Null deviance: 8419.8  on 8512  degrees of freedom
## Residual deviance: 8241.2  on 8501  degrees of freedom
##   (290 observations deleted due to missingness)
## AIC: 26141
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  6.708 
##           Std. Err.:  0.455 
## 
##  2 x log-likelihood:  -26115.443

coefficients <- coef(fit.negbin)

# rate ratios
RR <- exp(coefficients)
sd <- summary(fit.negbin)$coefficients[, "Std. Error"]
ci_lower <- round(exp(coefficients - 1.96 * sd), 3)
ci_upper <- round(exp(coefficients + 1.96 * sd), 3)

RR_results <- data.frame(
  RateRatio = round(RR, 3),
  CI_Lower = ci_lower,
  CI_Upper = ci_upper
)

print(RR_results)

##                                RateRatio CI_Lower CI_Upper
## (Intercept)                        1.510    1.387    1.645
## usborn_cat2.Foreign Born           0.828    0.795    0.863
## marital_cat2.Divorced              0.858    0.803    0.917
## marital_cat3.Widowed               0.852    0.789    0.920
## marital_cat4.Separated             0.931    0.845    1.026
## marital_cat5.Never married         0.905    0.862    0.950
## marital_cat6.Living together       0.972    0.895    1.055
## mood_cat2.Most of the time         0.879    0.784    0.986
## mood_cat3.Some of the time         1.082    0.987    1.185
## mood_cat4.A little of the time     1.131    1.035    1.237
## mood_cat5.None of the time         1.169    1.076    1.270
## personal_doc2.No                   0.938    0.884    0.995

vif(fit.negbin)

##                  GVIF Df GVIF^(1/(2*Df))
## usborn_cat   1.054147  1        1.026717
## marital_cat  1.080901  5        1.007810
## mood_cat     1.022751  4        1.002816
## personal_doc 1.025477  1        1.012658

Interpretation of the results

The rate of consomption of cups of vegetables among foreign-born participants was 0.828 (95%CI = 0.795, 0.863) times the rate among United States-born participants, adjusted for marital status, mood and having a personal care provider. Therefore, US-born participants were more likely to have vegetables than foreign born participants.

All Generalized Variance Inflation Factors are around 1, well below 4, which is the rule of thumb to start investigating a multicollinearity issue.

Plotting the model

par(mfrow = c(2, 2))
plot(fit.negbin)

The residuals are scattered arround 0, with no specific pattern,
In the Q-Q plot, they line with the 45 degree reference line, except for high values
We do have a few high leverage observations 7049, 6147 and 3879
Overall, the negative binomial model shows a pretty good fit

Weighted Negative Binomial Regression

library(survey)
# transforming variables into categorical variables
chs19 <- chs19 %>%
  mutate(
      usborn_cat2 = factor(case_when(
      usborn == 1 ~ "1.US Born",
      usborn == 2 ~ "2.Foreign Born",
      TRUE ~ NA)))

chs19[c("maritalstatus19", "mood5", "pcp19")] <- lapply(chs19[c("maritalstatus19", "mood5", "pcp19")], as.factor)

# Setting up the survey design
chs.dsgn <-
  svydesign(
    ids = ~ 1,
    strata = ~ strata,
    weights =  ~ wt20_dual,
    data = chs19, 
    nest = TRUE,
    na.rm = TRUE
  )

# fitting the model
fit.negbin.weighted <- svyglm.nb(nutrition47 ~ usborn_cat2 + maritalstatus19 + mood5 + pcp19, design = chs.dsgn)

summary(fit.negbin.weighted)

## Survey-sampled mle: 
## survey::svymle(loglike = sjstats_loglik, gradient = sjstats_score, 
##     design = design, formulas = list(theta = ~1, eta = formula), 
##     start = c(mod$theta, stats::coef(mod)), na.action = "na.omit")
##                                      Coef         SE p.value
## theta.(Intercept)              5.95749311 2.11602034 0.00487
## eta.(Intercept)                0.39486659 0.11059408 < 0.001
## eta.usborn_cat22.Foreign Born -0.22548899 0.04014150 < 0.001
## eta.maritalstatus192          -0.16421861 0.05521656 0.00294
## eta.maritalstatus193          -0.21084150 0.05947075 < 0.001
## eta.maritalstatus194           0.07096567 0.36079913 0.84407
## eta.maritalstatus195          -0.13441756 0.03972232 < 0.001
## eta.maritalstatus196          -0.08971871 0.06323739 0.15597
## eta.mood52                    -0.16662615 0.12583635 0.18545
## eta.mood53                     0.07921717 0.11812847 0.50247
## eta.mood54                     0.18747265 0.11452730 0.10165
## eta.mood55                     0.17890966 0.11374857 0.11575
## eta.pcp192                    -0.06346046 0.04506488 0.15907
## Stratified Independent Sampling design (with replacement)
## stats::update(design, scaled.weights = dw/mean(dw, na.rm = TRUE))

# Rate ratio of foreign born vs US born
rate_ratio = round(exp(-0.22548899), 3)
lower_bound = round(exp(-0.22548899 - 1.96*0.04014150), 3)
upper_bound = round(exp(-0.22548899 + 1.96*0.04014150), 3)

c(rate_ratio, lower_bound, upper_bound)

## [1] 0.798 0.738 0.863

Here again, the rate of consomption of cups of vegetables among foreign-born participants was 0.798 (95%CI = 0.738, 0.863) times the rate among United States-born participants, adjusted for marital status, mood and having a personal care provider. Therefore, US-born participants were more likely to have vegetables

Conclusion

Both the unweighted (rate ratio = 0.828 [95%CI = 0.795, 0.863]) and the weighted negative binomial regressions (0.798 [95%CI = 0.738, 0.863]) suggested that foreign-born participants were less likely to have had cups of vegetables compared to U.S-born participants, adjusted for marital status, mood, and having a healthcare provider. However, these results should be interpreted with caution since our analysis included only 4 predictors for the purpose of this exercise.

References

Blekkenhorst, L. C., Sim, M., Bondonno, C. P., Bondonno, N. P., Ward, N. C., Prince, R. L., . . . Hodgson, J. M. (2018). Cardiovascular Health Benefits of Specific Vegetable Types: A Narrative Review. Nutrients, 10(5), 595.

Carter, P., Gray, L. J., Troughton, J., Khunti, K., & Davies, M. J. (2010). Fruit and vegetable intake and incidence of type 2 diabetes mellitus: systematic review and meta-analysis. Bmj, 341.

Mello Rodrigues, V., Bray, J., Fernandes, A. C., & Luci Bernardo, G. (2019). Vegetable Consumption and Factors Associated with Increased Intake among College Students: A Scoping Review of the Last 10 Years. 11(7). doi: 10.3390/nu11071634

Slavin, J. L., & Lloyd, B. (2012). Health Benefits of Fruits and Vegetables. Advances in Nutrition, 3(4), 506-516. doi: https://doi.org/10.3945/an.112.002154

Terry, P., Terry, J. B., & Wolk, A. (2001). Fruit and vegetable consumption in the prevention of cancer: an update. Journal of Internal Medicine, 250(4), 280-290. doi: https://doi.org/10.1111/j.1365-2796.2001.00886.x

https://online.stat.psu.edu/stat462/node/180/↩︎

Assessing the association between place of birth and consumption of vegetables | evidence from the 2019 New York Community Health Survey

Andre Nana