Effect of smoking history on head and neck cancer survival

1. Introduction and rationale
2. Data exploration and manipulation
3. Demographic table
4. Kaplan-Meier plots
5. Directed Acyclic Graph (DAG)
6. Cox proportional hazard model
7. Conclusion
Multivariate Imputation by Chained Equations using the package mice

1. Introduction and rationale

Head and neck cancers represent a major public health problem worldwide. Alcohol consumption, smoking, and infection by the human papillomavirus(HPV)) are the main causes of these diseases (Leemans et al., 2011). Treatment relies on different strategies, including chemotherapy, radiation therapy, and surgery, based on specific criteria such as the tumor stage and grade, the patient’s age, and response to initial treatment, among other factors. Using data collected from the medical records of 215 patients treated definitively for HNSCC at the University of Texas MD Anderson Cancer Center, we propose to evaluate the impact of smoking history on head and neck cancer patient’s survival.

2. Data exploration and manipulation

a. Loading relevant libraries

library(knitr)
opts_chunk$set(tidy.opts=list(width.cutoff=80),tidy=TRUE, echo = TRUE, message = FALSE)

# loading packages
library(survival)
library(dplyr)
library(table1)
library(survminer)
library(readxl)
library(dagitty)
library(stargazer)
library(pheatmap)
library(gtsummary)
library(mice)

b. Loading data and selecting variables of interest

hnscc_raw <- read_excel("HNSCC-MDA-Data_update_20240514.xlsx")
hnscc0 <- hnscc_raw %>%
    select(Sex, Age, Diag, Grade, Stage, `Survival  (months)`, `Alive or Dead`, `Smoking History`,
        `Current Smoker`, `BMI start treat (kg/m2)`)

c. Assessing missing data and recoding variables of interest

na_count <- hnscc0 %>%
    summarize_all(~sum(is.na(.)))
print(t(na_count))

##                         [,1]
## Sex                        0
## Age                        0
## Diag                       0
## Grade                      0
## Stage                      0
## Survival  (months)         0
## Alive or Dead              0
## Smoking History            0
## Current Smoker             0
## BMI start treat (kg/m2)    0

We do not have any missing values for the selected variables

hnscc1 <- hnscc0 %>%
  mutate(
    
    # Outcome of interest
    
      # for the demographic table
    Outcome = `Alive or Dead`, 
    
      # for the survival analysis
    Cens = case_when( 
     `Alive or Dead` == "Alive" ~ 0,
     `Alive or Dead` == "Dead" ~ 1), 
    
    # Time to event data
     Time = `Survival  (months)`, 
    
     # Main exposure
    `Smoking history` = factor(case_when(
      `Smoking History`== 0 ~ "Never",
      `Smoking History`== 1 ~ "< 10 pack",
      `Smoking History`== 2 ~ ">= 10 pack"),
      levels = c("Never", "< 10 pack", ">= 10 pack")),
    
    
    # Other predictors
       # transforming age into categories
    `Age group` = factor(cut(Age, breaks = c(20, 50, 60, 70, 95))), 
    
    `Current smoker` = factor(case_when(
      `Current Smoker` == 0 ~ "No",
      `Current Smoker` == 1 ~ "Yes")),
   
    `Stage` = factor(case_when(
      Stage %in% c("IVA", "IVB") ~ "IV",
      Stage %in% c("I", "II", "III") ~ Stage),
      levels = c("I", "II", "III", "IV") ),
    
    BMI = factor(case_when(
      `BMI start treat (kg/m2)` < 18.5 ~ "Underweight",
      `BMI start treat (kg/m2)` >= 18.5 & `BMI start treat (kg/m2)`< 24.9 ~ "Normal weight",
      `BMI start treat (kg/m2)` >= 24.9 & `BMI start treat (kg/m2)`< 30 ~ "Overweight",
      `BMI start treat (kg/m2)` >= 30 ~ "Obese"),
      levels = c("Underweight", "Normal weight","Overweight","Obese" )),
    
    Diagnostic = factor(case_when(
      Diag %in% c("CA alveolar ridge", "CA buccal mucosa", "CA oral tongue", 
              "CA soft palate", "CA retromolar trigone", "CA BOT",  
              "recurrence CA retromolar trigone") ~ "Oral",
      Diag %in% c("CA oropharynx", "CA posterior pharyngeal wall", "CA pharyngeal", 
              "CA hypopharynx", "CA pyriform sinus","CA glossopharyngeal sulcus",
              "CA tonsil") ~ "Pharyngeal",
      Diag %in% c("CA larynx", "CA supraglottic") ~ "Laryngeal",
      Diag %in% c("CA maxillary sinus", "NPC") ~ "Sinonasal/pharyngeal",
      Diag == "CUP" ~ "Unknown Primary",
      TRUE ~ "Other"  )),
    
    Grade = as.factor(Grade)
    )

d. Assessing multicollinearity among predictors

The heatmap shows that smoking history is correlated with current smoking, with a Pearson correlation coefficient of 0.6. Therefore we will include only smoking history in our prediction models. All remaining pairwise correlation coefficients were less than or equal to 0.25.

# We renamed these variables to make the heatmap readable in the knitted pdf
# file
hnscc1$ag = hnscc1$`Age group`
hnscc1$st = hnscc1$Stage
hnscc1$hist = hnscc1$`Smoking history`
hnscc1$sm = hnscc1$`Current smoker`
hnscc1$dia = hnscc1$Diagnostic

mm <- model.matrix(~., data = hnscc1[, c("ag", "st", "hist", "sm", "dia", "BMI")])

# Create the heatmap without dendrograms
pheatmap(cor(mm[, -1]), color = colorRampPalette(c("#ebf5fb", "#aed6f1", "#3498db"))(100),
    display_numbers = TRUE, cluster_rows = FALSE, cluster_cols = FALSE)

3. Demographic table

library(table1)
table1(~Sex + Age + `Age group` + Outcome + `Current smoker` + Diagnostic + Grade +
    Stage + BMI | `Smoking history`, data = hnscc1, caption = "Characteristics of the study population, stratified by 
       current smoking status",
    footnote = "Data source: Imaging and clinical data archive for 
       head and neck squamous cell carcinoma patients treated with radiotherapy")

Characteristics of the study population, stratified by current smoking status
	Never (N=79)	< 10 pack (N=23)	>= 10 pack (N=113)	Overall (N=215)
Data source: Imaging and clinical data archive for head and neck squamous cell carcinoma patients treated with radiotherapy
Sex
Female	15 (19.0%)	3 (13.0%)	15 (13.3%)	33 (15.3%)
Male	64 (81.0%)	20 (87.0%)	98 (86.7%)	182 (84.7%)
Age
Mean (SD)	55.2 (11.2)	57.8 (8.85)	58.5 (8.70)	57.2 (9.79)
Median [Min, Max]	54.0 [24.0, 91.0]	56.0 [45.0, 77.0]	59.0 [35.0, 81.0]	57.0 [24.0, 91.0]
Age group
(20,50]	26 (32.9%)	6 (26.1%)	21 (18.6%)	53 (24.7%)
(50,60]	31 (39.2%)	9 (39.1%)	41 (36.3%)	81 (37.7%)
(60,70]	15 (19.0%)	5 (21.7%)	43 (38.1%)	63 (29.3%)
(70,95]	7 (8.9%)	3 (13.0%)	8 (7.1%)	18 (8.4%)
Outcome
Alive	54 (68.4%)	16 (69.6%)	68 (60.2%)	138 (64.2%)
Dead	25 (31.6%)	7 (30.4%)	45 (39.8%)	77 (35.8%)
Current smoker
No	79 (100%)	20 (87.0%)	46 (40.7%)	145 (67.4%)
Yes	0 (0%)	3 (13.0%)	67 (59.3%)	70 (32.6%)
Diagnostic
Laryngeal	11 (13.9%)	3 (13.0%)	10 (8.8%)	24 (11.2%)
Oral	27 (34.2%)	11 (47.8%)	54 (47.8%)	92 (42.8%)
Pharyngeal	36 (45.6%)	7 (30.4%)	41 (36.3%)	84 (39.1%)
Sinonasal/pharyngeal	3 (3.8%)	1 (4.3%)	5 (4.4%)	9 (4.2%)
Unknown Primary	2 (2.5%)	1 (4.3%)	3 (2.7%)	6 (2.8%)
Grade
moderately diff.	26 (32.9%)	11 (47.8%)	52 (46.0%)	89 (41.4%)
moderately to poorly diff.	2 (2.5%)	1 (4.3%)	9 (8.0%)	12 (5.6%)
poorly diff.	41 (51.9%)	9 (39.1%)	39 (34.5%)	89 (41.4%)
undiff.	1 (1.3%)	0 (0%)	2 (1.8%)	3 (1.4%)
well diff.	8 (10.1%)	2 (8.7%)	9 (8.0%)	19 (8.8%)
Well to moderately diff.	1 (1.3%)	0 (0%)	2 (1.8%)	3 (1.4%)
Stage
I	0 (0%)	0 (0%)	4 (3.5%)	4 (1.9%)
II	1 (1.3%)	0 (0%)	4 (3.5%)	5 (2.3%)
III	9 (11.4%)	5 (21.7%)	17 (15.0%)	31 (14.4%)
IV	69 (87.3%)	18 (78.3%)	88 (77.9%)	175 (81.4%)
BMI
Underweight	2 (2.5%)	1 (4.3%)	2 (1.8%)	5 (2.3%)
Normal weight	16 (20.3%)	5 (21.7%)	37 (32.7%)	58 (27.0%)
Overweight	30 (38.0%)	6 (26.1%)	43 (38.1%)	79 (36.7%)
Obese	31 (39.2%)	11 (47.8%)	31 (27.4%)	73 (34.0%)

There is an over representation of male participants, with 84.7%(n=182)
The majority of participants were in the (50,60] years age range, with 81 participants or 37.7%.
The majority - 67.4%- of participants reporting not being current smokers (n=145)
The majority of participants could be classified as overweight (36.7%) or obese (34.0%)
Most cancers could be anatomically classified as oral cancers with 42.8% (n=92)

4. Kaplan-Meier plots

# Kaplan meier estimates
kmsmoker <- survminer::surv_fit(Surv(Time, Cens) ~ as.factor(`Current smoker`), data = hnscc1)
kmsmoking_history <- survminer::surv_fit(Surv(Time, Cens) ~ as.factor(`Smoking history`),
    data = hnscc1)
kmAge <- survminer::surv_fit(Surv(Time, Cens) ~ as.factor(`Age group`), data = hnscc1)
kmStage <- survminer::surv_fit(Surv(Time, Cens) ~ as.factor(Stage), data = hnscc1)
kmSex <- survminer::surv_fit(Surv(Time, Cens) ~ Sex, data = hnscc1)

# survival plot with follow-up table
survminer::ggsurvplot(kmsmoker, xlab = "Time (Months)", ylab = "Survival Probability",
    pval = TRUE, title = "Survival by current smoking status")

survminer::ggsurvplot(kmsmoking_history, xlab = "Time (Months)", ylab = "Survival Probability",
    pval = TRUE, title = "Survival by history of smoking")

survminer::ggsurvplot(kmAge, xlab = "Time (Months)", ylab = "Survival Probability",
    pval = TRUE, title = "Survival by age group")

survminer::ggsurvplot(kmStage, xlab = "Time (Months)", ylab = "Survival Probability",
    pval = TRUE, title = "Survival by cancer stage")

survminer::ggsurvplot(kmSex, xlab = "Time (Monthss)", ylab = "Survival Probability",
    pval = TRUE, title = "Survival by sex")

# Here we create an other object (kmsmoker2 for instance), because <- NULL
# suppresses the call output and simplify the reading, but kmsmoker cannot be
# used in the log-minus log plot in later sections

kmsmoker2 = kmsmoker
kmsmoking_history2 = kmsmoking_history
kmStage2 = kmStage
kmAge2 = kmAge
kmSex2 = kmSex


kmsmoker2$call <- NULL
kmsmoking_history2$call <- NULL
kmStage2$call <- NULL
kmAge2$call <- NULL
kmSex2$call <- NULL

kmsmoker2

##                                   n events median 0.95LCL 0.95UCL
## as.factor(`Current smoker`)=No  145     45  105.9    97.6      NA
## as.factor(`Current smoker`)=Yes  70     32   86.8    75.4      NA

kmsmoking_history2

##                                           n events median 0.95LCL 0.95UCL
## as.factor(`Smoking history`)=Never       79     25    106    97.6      NA
## as.factor(`Smoking history`)=< 10 pack   23      7     NA    62.6      NA
## as.factor(`Smoking history`)=>= 10 pack 113     45    119    86.8      NA

kmStage2

##                        n events median 0.95LCL 0.95UCL
## as.factor(Stage)=I     4      4   30.5    16.0      NA
## as.factor(Stage)=II    5      2     NA    58.0      NA
## as.factor(Stage)=III  31     12  105.9    79.7      NA
## as.factor(Stage)=IV  175     59  118.9    93.9      NA

kmAge2

##                                 n events median 0.95LCL 0.95UCL
## as.factor(`Age group`)=(20,50] 53     14     NA      NA      NA
## as.factor(`Age group`)=(50,60] 81     24     NA      NA      NA
## as.factor(`Age group`)=(60,70] 63     31   88.4    75.4      NA
## as.factor(`Age group`)=(70,95] 18      8   80.0    42.5      NA

kmSex2

##              n events median 0.95LCL 0.95UCL
## Sex=Female  33     11     NA    79.5      NA
## Sex=Male   182     66    106    93.9      NA

We can see from the figure that the survival curves look different for current smoker (blue) and non current smoker (orange). However, the log rank test p-value of 0.08 is not significant. Based on these data, were can conclude that there was no difference in survival based on current smoking status. Nonetheless, the median survival time was 105.9 months for non-current smoker, versus 86.8 months for current smokers.
For smoking history, the curves of the different comparison groups are crossing and sometimes overlaping. The log-rank test p-value is also not significant (0.53), suggesting that survival did not differ by smoking history.
Survival did not differ by sex either, as the survival curves for Males and Females are crossing and the log-rank test p-value was not significant (0.82)
For age, there are crossing, and overlapping between the curves of the different age groups, and the log-rank test p-value was significant also not significant,(0.08), suggesting that survival did not differ by age group. However, the results could have been different, had we retained a different segmentation of age.
With regard to stage, there is also a lot of overlap between the different stages. Nonetheless, the log-rank p-value is 0.0037, suggesting that survival differed by the stage of the cancer. Surprisingly, in this study, the lower stages had the poorest survival with the lowest median survival of 30.5 months. The last patient in the stage 1 group died before the 46th month. All the lower stage cancer patients had a history of smoking more than 10 packs but were no current smokers.Two of them had poorly differentiated tumors, 1 had a moderately to poor differentiation and the remaining patient had moderately differented tumor. There is a relationship between tumor differentiation and survival, with poor differentiation associated with poor outcome.

quantile(kmStage, probs = 1)

## $quantile
##                            100
## as.factor(Stage)=I    45.93333
## as.factor(Stage)=II         NA
## as.factor(Stage)=III        NA
## as.factor(Stage)=IV  118.93333
## 
## $lower
##                      100
## as.factor(Stage)=I    NA
## as.factor(Stage)=II   NA
## as.factor(Stage)=III  NA
## as.factor(Stage)=IV   NA
## 
## $upper
##                      100
## as.factor(Stage)=I    NA
## as.factor(Stage)=II   NA
## as.factor(Stage)=III  NA
## as.factor(Stage)=IV   NA

hnscc1 |>
    filter(Stage == "I") |>
    select(`Current smoker`, `Smoking history`, Grade, BMI, `Age group`, Diagnostic) |>
    summary()

##  Current smoker   Smoking history                        Grade  
##  No :4          Never     :0      moderately diff.          :1  
##  Yes:0          < 10 pack :0      moderately to poorly diff.:1  
##                 >= 10 pack:4      poorly diff.              :2  
##                                   undiff.                   :0  
##                                   well diff.                :0  
##                                   Well to moderately diff.  :0  
##             BMI      Age group                Diagnostic
##  Underweight  :0   (20,50]:0   Laryngeal           :0   
##  Normal weight:1   (50,60]:2   Oral                :3   
##  Overweight   :2   (60,70]:2   Pharyngeal          :1   
##  Obese        :1   (70,95]:0   Sinonasal/pharyngeal:0   
##                                Unknown Primary     :0   
##

5. Directed Acyclic Graph (DAG)

DAG <- dagitty("dag {
bb=\"0,0,1,1\"
\"Smoking history\" [exposure,pos=\"0.184,0.451\"]
\"overall survival\" [outcome,pos=\"0.827,0.439\"]
\"Unknown HPV status\" [pos=\"0.5, 0.3\"]
Sex [pos=\"0.7,0.13\"]
BMI [pos=\"0.5,0.561\"]
Grade [pos=\"0.5, 0.897\"]
Diagnostic [pos=\"0.5, 0.63\"]
Age [pos=\"0.35,0.1\"]
Stage [pos=\"0.5,0.729\"]
\"Smoking history\" -> \"overall survival\"
\"Smoking history\" -> BMI
\"Smoking history\" -> Grade
\"Smoking history\" -> Stage
\"Smoking history\" -> Diagnostic
Age -> \"Smoking history\"
Age -> \"overall survival\"
Age -> \"Unknown HPV status\"
\"Unknown HPV status\" -> \"overall survival\"
BMI -> \"overall survival\"
Grade -> \"overall survival\"
Diagnostic -> \"overall survival\"
Sex -> \"Smoking history\"
Sex -> \"overall survival\"
Stage -> \"overall survival\"
}")

plot(DAG)

In this model, our exposure of interest is smoking history, and our outcome is overall survival. In this DAG, we hypothesis that Age, Sex and the unmeasured HPV status are confounders. Therefore, we will adjust for Age and Sex. BMI, Diagnostic, Stage and Grade are mediators in the causal pathway between history of smoking and survival. Therefore, we will not adjust for these variables, since we want to estimate the total effect of smoking history on survival.

6. Cox proportional hazard model

a. Univariate regression

for each of your predictor variables to estimate their crude associations with survival.

fitcoxph_smoker <- coxph(Surv(Time, Cens) ~ `Smoking history`, hnscc1)
fitcoxph_Age <- coxph(Surv(Time, Cens) ~ `Age group`, hnscc1)
fitcoxph_Sex <- coxph(Surv(Time, Cens) ~ Sex, hnscc1)

smoker <- tbl_regression(fitcoxph_smoker, exponentiate = TRUE)
Age <- tbl_regression(fitcoxph_Age, exponentiate = TRUE)
Sex <- tbl_regression(fitcoxph_Sex, exponentiate = TRUE)

smoker

Characteristic	HR	95% CI	p-value
Smoking history
Never	—	—
< 10 pack	1.07	0.46, 2.48	0.9
>= 10 pack	1.31	0.80, 2.15	0.3
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

Age

Characteristic	HR	95% CI	p-value
Age group
(20,50]	—	—
(50,60]	1.06	0.55, 2.05	0.9
(60,70]	1.87	0.99, 3.54	0.053
(70,95]	1.86	0.78, 4.45	0.2
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

Sex

Characteristic	HR	95% CI	p-value
Sex
Female	—	—
Male	1.08	0.57, 2.04	0.8
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

In this univariate analysis:

Participants a smoking history of less than 10 pack a year, had 1.07 (95% CI = 0.46, 2.48) times the hazard of dying from head and neck cancer, compared to those who never smoked.
Participants with a smoking history of at least 10 pack a year, had 1.31 (95% CI = 0.80, 2.15) times the hazard of dying from head and neck cancer, compared to those who never smoked.
Participants aged (50,60] had 1.06 (95% CI = 0.55, 2.05) times the hazard of dying from head and neck cancer, compared to participants aged (20,50].
Participants aged (60,70] had 1.87(95% CI = 0.99, 3.54) times the hazard of dying from head and neck cancer, compared to participants aged (20,50].
Participants aged (70,95] had 1.86 (95% CI = 0.78, 4.45) times the hazard of dying from head and neck cancer, compared to participants aged (20,50].
Male participants had 1.08 (95% CI = 0.57, 2.04) times the hazard of dying from head and neck cancer, compared to female participants

In these univariate models, survival from head and neck cancer did not differ by smoking history, age group or sex

b. Multivariate regression

Based on our theory displayed in the DAG, we will estimate the total effect of smoking history on survival, controlled for sex and age categories, which we believe, are confounders.

fitcoxph <- coxph(Surv(Time, Cens) ~ `Smoking history` + Sex + `Age group`, hnscc1)

results <- tbl_regression(fitcoxph, exponentiate = TRUE)
results

Characteristic	HR	95% CI	p-value
Smoking history
Never	—	—
< 10 pack	1.04	0.45, 2.43	>0.9
>= 10 pack	1.19	0.72, 1.98	0.5
Sex
Female	—	—
Male	1.06	0.56, 2.03	0.9
Age group
(20,50]	—	—
(50,60]	1.04	0.53, 2.01	>0.9
(60,70]	1.78	0.93, 3.41	0.083
(70,95]	1.85	0.77, 4.42	0.2
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

In this multivariate analysis:

Participants with a smoking history of less than 10 pack a year, had 1.04 (95% CI = 0.45, 2.43) times the hazard of dying from head and neck cancer, compared to those who never smoked, adjusted for sex and age group
Participants had a smoking history of at least 10 pack a year, had 1.19 (95% CI = 0.72, 1.98) times the hazard of dying from head and neck cancer, compared to those who never smoked, adjusted for sex and age group
Participants aged (50,60] had 1.04 (95% CI = 0.53, 2.01) times the hazard of dying from head and neck cancer, compared to participants aged (20,50], adjusted for smoking history and sex.
Participants aged (60,70] had 1.78(95% CI = 0.93, 3.41) times the hazard of dying from head and neck cancer, compared to participants aged (20,50], adjusted for smoking history and sex.
Participants aged (70,95] had 1.85(95% CI = 0.77, 4.42) times the hazard of dying from head and neck cancer, compared to participants aged (20,50], adjusted for smoking history and sex.
Male participants had 1.06 (95% CI = 0.56, 2.02) times the hazard of dying from head and neck cancer, compared to female participants, adjusted for smoking history and age

The results of this multivariate analysis are very similar to the univariate models and suggest that survival from head and neck cancer did not differ by smoking history, age group or sex

c. Diagnostics of proportional hazards assumptions

Log minus log plot

ggsurvplot(kmsmoking_history, fun = "cloglog", xlab = "Time (in months) using log",
    ylab = "log-log survival", main = "log-log curves by smoking history")

ggsurvplot(kmAge, fun = "cloglog", xlab = "Time (in months) using log", ylab = "log-log survival",
    main = "log-log curves by age group")

ggsurvplot(kmSex, fun = "cloglog", xlab = "Time (in months) using log", ylab = "log-log survival",
    main = "log-log curves by sex")

It seems like the proportional hazards assumption holds for smoking history, to a certain extend. For age groups, the assumptions may hold between (20,50](orange curve) and (60,70](cyan blue curve), but there are overlapping between the (50,60] group and the others. The proportional hazards assumption does not hold for sex.

Schoenfeld test for proportional hazards

fitzph <- cox.zph(fitcoxph)
fitzph

##                   chisq df    p
## `Smoking history`  2.50  2 0.29
## Sex                1.57  1 0.21
## `Age group`        2.31  3 0.51
## GLOBAL             7.20  6 0.30

plot(fitzph)

The Schoenfeld residuals test shows non-significant p-values for all individual covariates and the global test. The residuals plot looks unsystematic for smoking history and age group, but there is a pattern for the sex variable. We conclude that the proportional hazards assumption does not hold for sex, and decide to proceed with a stratified cox proportional hazard model, with sex as the stratified variable

d. Stratification

fitcoxph_strat <- coxph(Surv(Time, Cens) ~ `Smoking history` + strata(Sex) + `Age group`,
    hnscc1)

results_strat <- tbl_regression(fitcoxph_strat, exponentiate = TRUE)
results_strat

Characteristic	HR	95% CI	p-value
Smoking history
Never	—	—
< 10 pack	1.02	0.44, 2.37	>0.9
>= 10 pack	1.21	0.73, 1.99	0.5
Age group
(20,50]	—	—
(50,60]	1.02	0.52, 1.97	>0.9
(60,70]	1.73	0.91, 3.32	0.10
(70,95]	1.87	0.78, 4.48	0.2
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

The stratified model yields similar results and survival did not differ by smoking history or age group.

e. Interpretation and discussion

Results

We wanted to assess the effect of smoking history on survival from head and neck cancer, adjusted for age and sex that we identified as confounders, based on our DAG.
In our final stratified model, participant who had smoking history of less than 10 pack a year, had 1.02 (95%CI = 0.44, 2.37) times the hazard of dying from head and neck cancer, compared to those who never smoked, adjusted for sex and age group, and participant who had smoking history of at least 10 pack a year, had 1.21 (95%CI = 0.73, 1.99) times the hazard of dying from head and neck cancer, compared to those who never smoked, adjusted for sex and age group. With regard to age groups, all estimates remained non-significant.
These results suggest that head and neck cancer survival was not associated wit history of smoking or age

Discussion

From the Kaplan Mieir plots, we found that the survival did not differ by current smoking status or history of smoking, and that lower cancer stages had poorer survival. From our cox proportional hazard model, we found also that survival did not differ by smoking status both at the univariate, multivariate analysis and the stratified cox proportional hazards model. This is against our expectations, as it is well documented that smoking, whether current or not, has a negative impact on cancer survival, including head and neck cancer. It is also obvious that the higher the stage, the more severe the prognosis.

Potential reasons for these curious findings:
The source of the data and the nature of the study design. This is a retrospective cohort study, as data were collected from existing medical records from October 2003 to August 2013. Therefore, we may assume that the data were not expressly designed for a survival analysis.
Survival was defined as the time from diagnosis to the date of death due to any cause. We are not even sure if patients died from head and neck cancer or something else. Therefore we have competing risks in this case, since patients could have died from head and neck cancer OR any other disease. Subsequently, we cannot confirm that the non-informative censoring assumption is not violated. A competing risk analysis may be more suitable for this data.
The assessment of smoking history is subject to information bias, with possible misclassification of exposure (differential or non-differential).
We did not measure the effect of major confounders such as HPV infection and race.
We did not take treatment into consideration in this analysis (it would require a thorough assessment and classification, given the complexity of treatment methods and regimen)
It is worth noting that our study included only 215 patients, which may be insufficient to achieve the 80% power needed to detect a meaningful effect size. For example, the unexpected finding that patients with lower grades had poorer survival could be related to their specific characteristics (such as poorly differentiated tumors or a history of heavy smoking). However, this result might also stem from the fact that there were only 4 patients in that subgroup.

7. Conclusion

In this analysis, we aimed to evaluate the impact of smoking history on survival in head and neck cancer, while accounting for selected confounders - sex and age - identified through our theoretical framework as depicted in the DAG. Our findings showed no significant predictors of survival in the univariate, multivariate, or stratified models, and survival probabilities did not differ across any of the selected predictors. Interestingly, we observed an unexpected result: patients with lower-stage disease had the lowest median survival. These findings contradict our initial hypothesis that smoking history would negatively affect survival. Potential explanations for these results are discussed in the interpretation and discussion sections. To address this research question more effectively, a prospective cohort study with sufficient statistical power would be more appropriate.

Multivariate Imputation by Chained Equations using the package mice

hpv<- read.csv("HNSCC-Clinical-Data.csv")
hpv1 <- hpv |> mutate(
  Status = factor(case_when(
    `HPV.Status` == "N" ~ "Negative",
    `HPV.Status` == "P" ~ "Positive",
    `HPV.Status` == "Unknown" ~ "Unknown",
    TRUE ~ NA_character_),
    levels = c("Negative", "Positive", "Unknown")),
  
  Cens1 = case_when(
    `Vital.status` == "Alive" ~ 0,
    `Vital.status` == "Dead" ~ 1),
  
  Time = `Overall.survival_duration`)

# Setting the number of imputation to 100, default predictor matrix and methods

hpv_imp <- mice(hpv1, m = 100, seed = 12345)

## Warning: Number of logged events: 20

hpv_fit <- with(hpv_imp, coxph(Surv(Time, Cens1) ~ Status))

# Pool the results from the multiple imputations
pooled <- pool(hpv_fit)

# Get the summary of the pooled model
summary_pooled <- summary(pooled)

# Calculate the odds ratios and confidence intervals
hazards_ratios <- round(exp(summary_pooled$estimate), 2)
conf_int_lower <- round(exp(summary_pooled$estimate - 1.96 * summary_pooled$std.error),
    2)
conf_int_upper <- round(exp(summary_pooled$estimate + 1.96 * summary_pooled$std.error),
    2)

# Combine the results into a data frame
results <- data.frame(Predictor = summary_pooled[, 1], hazards_ratios = hazards_ratios,
    CI_Lower = conf_int_lower, CI_Upper = conf_int_upper)

# Display the results
results

Remark:

Patients with a positive HPV status, had 0.37 (95% CI = 0.22, 0.64) times the hazard of dying from head and neck cancer, compared to those with negative HPV status.
Patients with an Unknown HPV status, had 0.39 (95% CI = 0.23 0.68) times the hazard of dying from head and neck cancer, compared to those with negative HPV status.

This is consistent with the literature, since a positive HPV infection is associated with a better survival (Li et al. 2018)

References

Leemans, C. R., Braakhuis, B. J., & Brakenhoff, R. H. (2011). The molecular biology of head and neck cancer. Nature Reviews Cancer, 11(1), 9–22.
Li, H., Torabi, S. J., Yarbrough, W. G., Mehra, S., Osborn, H. A., & Judson, B. (2018). Association of human papillomavirus status at head and neck carcinoma subsites with overall survival. JAMA Otolaryngology–Head & Neck Surgery, 144(6), 519–525. https://doi.org/10.1001/jamaoto.2018.0395