Andre Nana
This project investigates recurrence in thyroid cancer patients using both statistical logistic regression and machine learning methods. With the logistic regression, we explore the relationship between ever smoking and the likelihood of recurrence. We then shift to machine learning techniques for risk stratification and evaluate the predictive accuracy of the following algorithms:
These algorithms are evaluated based on the following performance metrics:
The dataset in this study is publicly accessible and the University of California Machine Learning Repository, and was generously provided by Borzooei et al., who published their original findings in 2024.
Disclaimer: This analysis is intended for educational and research purposes only and has not been peer-reviewed. While efforts have been made to ensure the accuracy of the methods and results, the author does not guarantee the correctness or completeness of the analysis. The author bears no responsibility or liability for any errors, omissions, or outcomes resulting from the use of this material. Use at your own discretion.
# Core data handling & wrangling
library(tidyverse)
library(readxl)
# Data exploration & summary
library(naniar)
library(table1)
library(gtsummary)
# Modeling frameworks
library(lme4)
library(reticulate)
# Model evaluation
library(pROC)
library(precrec)
# Machine learning
library(caret)
library(caretEnsemble)
library(nnet)
library(randomForest)
library(e1071)
library(rpart)
library(ModelMetrics)
library(iml)
# Loading Data
tcr_original <- read.csv("Thyroid_Diff.csv")
colnames(tcr_original)
## [1] "Age" "Gender" "Smoking"
## [4] "Hx.Smoking" "Hx.Radiothreapy" "Thyroid.Function"
## [7] "Physical.Examination" "Adenopathy" "Pathology"
## [10] "Focality" "Risk" "T"
## [13] "N" "M" "Stage"
## [16] "Response" "Recurred"
glimpse(tcr_original)
## Rows: 383
## Columns: 17
## $ Age <int> 27, 34, 30, 62, 62, 52, 41, 46, 51, 40, 75, 59, 4…
## $ Gender <chr> "F", "F", "F", "F", "F", "M", "F", "F", "F", "F",…
## $ Smoking <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", …
## $ Hx.Smoking <chr> "No", "Yes", "No", "No", "No", "No", "Yes", "No",…
## $ Hx.Radiothreapy <chr> "No", "No", "No", "No", "No", "No", "No", "No", "…
## $ Thyroid.Function <chr> "Euthyroid", "Euthyroid", "Euthyroid", "Euthyroid…
## $ Physical.Examination <chr> "Single nodular goiter-left", "Multinodular goite…
## $ Adenopathy <chr> "No", "No", "No", "No", "No", "No", "No", "No", "…
## $ Pathology <chr> "Micropapillary", "Micropapillary", "Micropapilla…
## $ Focality <chr> "Uni-Focal", "Uni-Focal", "Uni-Focal", "Uni-Focal…
## $ Risk <chr> "Low", "Low", "Low", "Low", "Low", "Low", "Low", …
## $ T <chr> "T1a", "T1a", "T1a", "T1a", "T1a", "T1a", "T1a", …
## $ N <chr> "N0", "N0", "N0", "N0", "N0", "N0", "N0", "N0", "…
## $ M <chr> "M0", "M0", "M0", "M0", "M0", "M0", "M0", "M0", "…
## $ Stage <chr> "I", "I", "I", "I", "I", "I", "I", "I", "I", "I",…
## $ Response <chr> "Indeterminate", "Excellent", "Excellent", "Excel…
## $ Recurred <chr> "No", "No", "No", "No", "No", "No", "No", "No", "…
vis_miss(tcr_original, sort_miss = TRUE)
383 observation and 17 variables, no missing values
table1(~. |Recurred,
data = tcr_original,
caption = "Description of case by recurrence status",
footnote = "Borzooei, S. & Tarokhian, A. (2023). Differentiated Thyroid Cancer Recurrence [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5632J.")
No (N=275) |
Yes (N=108) |
Overall (N=383) |
|
---|---|---|---|
Borzooei, S. & Tarokhian, A. (2023). Differentiated Thyroid Cancer Recurrence [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5632J. | |||
Age | |||
Mean (SD) | 38.4 (12.9) | 47.1 (18.3) | 40.9 (15.1) |
Median [Min, Max] | 36.0 [17.0, 81.0] | 44.5 [15.0, 82.0] | 37.0 [15.0, 82.0] |
Gender | |||
F | 246 (89.5%) | 66 (61.1%) | 312 (81.5%) |
M | 29 (10.5%) | 42 (38.9%) | 71 (18.5%) |
Smoking | |||
No | 259 (94.2%) | 75 (69.4%) | 334 (87.2%) |
Yes | 16 (5.8%) | 33 (30.6%) | 49 (12.8%) |
Hx.Smoking | |||
No | 261 (94.9%) | 94 (87.0%) | 355 (92.7%) |
Yes | 14 (5.1%) | 14 (13.0%) | 28 (7.3%) |
Hx.Radiothreapy | |||
No | 274 (99.6%) | 102 (94.4%) | 376 (98.2%) |
Yes | 1 (0.4%) | 6 (5.6%) | 7 (1.8%) |
Thyroid.Function | |||
Clinical Hyperthyroidism | 17 (6.2%) | 3 (2.8%) | 20 (5.2%) |
Clinical Hypothyroidism | 10 (3.6%) | 2 (1.9%) | 12 (3.1%) |
Euthyroid | 234 (85.1%) | 98 (90.7%) | 332 (86.7%) |
Subclinical Hyperthyroidism | 5 (1.8%) | 0 (0%) | 5 (1.3%) |
Subclinical Hypothyroidism | 9 (3.3%) | 5 (4.6%) | 14 (3.7%) |
Physical.Examination | |||
Diffuse goiter | 7 (2.5%) | 0 (0%) | 7 (1.8%) |
Multinodular goiter | 88 (32.0%) | 52 (48.1%) | 140 (36.6%) |
Normal | 5 (1.8%) | 2 (1.9%) | 7 (1.8%) |
Single nodular goiter-left | 63 (22.9%) | 26 (24.1%) | 89 (23.2%) |
Single nodular goiter-right | 112 (40.7%) | 28 (25.9%) | 140 (36.6%) |
Adenopathy | |||
Bilateral | 5 (1.8%) | 27 (25.0%) | 32 (8.4%) |
Left | 5 (1.8%) | 12 (11.1%) | 17 (4.4%) |
No | 247 (89.8%) | 30 (27.8%) | 277 (72.3%) |
Right | 18 (6.5%) | 30 (27.8%) | 48 (12.5%) |
Extensive | 0 (0%) | 7 (6.5%) | 7 (1.8%) |
Posterior | 0 (0%) | 2 (1.9%) | 2 (0.5%) |
Pathology | |||
Follicular | 16 (5.8%) | 12 (11.1%) | 28 (7.3%) |
Hurthel cell | 14 (5.1%) | 6 (5.6%) | 20 (5.2%) |
Micropapillary | 48 (17.5%) | 0 (0%) | 48 (12.5%) |
Papillary | 197 (71.6%) | 90 (83.3%) | 287 (74.9%) |
Focality | |||
Multi-Focal | 66 (24.0%) | 70 (64.8%) | 136 (35.5%) |
Uni-Focal | 209 (76.0%) | 38 (35.2%) | 247 (64.5%) |
Risk | |||
Intermediate | 38 (13.8%) | 64 (59.3%) | 102 (26.6%) |
Low | 237 (86.2%) | 12 (11.1%) | 249 (65.0%) |
High | 0 (0%) | 32 (29.6%) | 32 (8.4%) |
T | |||
T1a | 48 (17.5%) | 1 (0.9%) | 49 (12.8%) |
T1b | 38 (13.8%) | 5 (4.6%) | 43 (11.2%) |
T2 | 131 (47.6%) | 20 (18.5%) | 151 (39.4%) |
T3a | 55 (20.0%) | 41 (38.0%) | 96 (25.1%) |
T3b | 2 (0.7%) | 14 (13.0%) | 16 (4.2%) |
T4a | 1 (0.4%) | 19 (17.6%) | 20 (5.2%) |
T4b | 0 (0%) | 8 (7.4%) | 8 (2.1%) |
N | |||
N0 | 241 (87.6%) | 27 (25.0%) | 268 (70.0%) |
N1a | 12 (4.4%) | 10 (9.3%) | 22 (5.7%) |
N1b | 22 (8.0%) | 71 (65.7%) | 93 (24.3%) |
M | |||
M0 | 275 (100%) | 90 (83.3%) | 365 (95.3%) |
M1 | 0 (0%) | 18 (16.7%) | 18 (4.7%) |
Stage | |||
I | 268 (97.5%) | 65 (60.2%) | 333 (86.9%) |
II | 7 (2.5%) | 25 (23.1%) | 32 (8.4%) |
III | 0 (0%) | 4 (3.7%) | 4 (1.0%) |
IVA | 0 (0%) | 3 (2.8%) | 3 (0.8%) |
IVB | 0 (0%) | 11 (10.2%) | 11 (2.9%) |
Response | |||
Biochemical Incomplete | 12 (4.4%) | 11 (10.2%) | 23 (6.0%) |
Excellent | 207 (75.3%) | 1 (0.9%) | 208 (54.3%) |
Indeterminate | 54 (19.6%) | 7 (6.5%) | 61 (15.9%) |
Structural Incomplete | 2 (0.7%) | 89 (82.4%) | 91 (23.8%) |
Those who recurred tend to be older and of higher stage of cancer among other attributes.
Our hypothesis is that ever smoking (whether in the past of currently), is associated with thyroid cancer recurrence. We will assume that only age and gender fit the criteria for confounders in our data set.
tcr_log = tcr_original |>
mutate( Recurred01 = case_when(
Recurred=="Yes"~1,
Recurred=="No"~0),
Age_cat = case_when(
Age < 55 ~ "0-55",
Age >= 55 ~ "55+"),
Smoke = case_when(
Smoking == "Yes" | Hx.Smoking == "Yes" ~ "Yes",
Smoking == "No" & Hx.Smoking == "No" ~ "No",
TRUE~NA_character_
)
)
table1(~ Recurred + Age_cat + Gender | Smoke,
data = tcr_log,
caption = "Description of patients by smoking status",
footnote = "Borzooei, S. & Tarokhian, A. (2023). Differentiated Thyroid Cancer Recurrence [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5632J.")
No (N=318) |
Yes (N=65) |
Overall (N=383) |
|
---|---|---|---|
Borzooei, S. & Tarokhian, A. (2023). Differentiated Thyroid Cancer Recurrence [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5632J. | |||
Recurred | |||
No | 247 (77.7%) | 28 (43.1%) | 275 (71.8%) |
Yes | 71 (22.3%) | 37 (56.9%) | 108 (28.2%) |
Age_cat | |||
0-55 | 265 (83.3%) | 40 (61.5%) | 305 (79.6%) |
55+ | 53 (16.7%) | 25 (38.5%) | 78 (20.4%) |
Gender | |||
F | 288 (90.6%) | 24 (36.9%) | 312 (81.5%) |
M | 30 (9.4%) | 41 (63.1%) | 71 (18.5%) |
model.log <- glm(Recurred01 ~ Smoke + Age_cat + Gender,
data = tcr_log,
family="binomial")
summary(model.log)
##
## Call:
## glm(formula = Recurred01 ~ Smoke + Age_cat + Gender, family = "binomial",
## data = tcr_log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.6854 0.1662 -10.142 < 2e-16 ***
## SmokeYes 0.7109 0.3485 2.040 0.04138 *
## Age_cat55+ 1.3430 0.2857 4.701 2.59e-06 ***
## GenderM 1.2633 0.3318 3.807 0.00014 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 455.63 on 382 degrees of freedom
## Residual deviance: 389.09 on 379 degrees of freedom
## AIC: 397.09
##
## Number of Fisher Scoring iterations: 3
tbl_regression(model.log, exponentiate = TRUE)
Characteristic | OR | 95% CI | p-value |
---|---|---|---|
Smoke | |||
No | — | — | |
Yes | 2.04 | 1.02, 4.02 | 0.041 |
Age_cat | |||
0-55 | — | — | |
55+ | 3.83 | 2.19, 6.73 | <0.001 |
Gender | |||
F | — | — | |
M | 3.54 | 1.84, 6.81 | <0.001 |
Abbreviations: CI = Confidence Interval, OR = Odds Ratio |
We may conclude that ever smoking is indeed a “cause” for thyroid cancer recurrence, adjusting for age and gender, based on our causal theory (adjusted odds ratio = 2.04 [1.02, 4.02]). Note that we cannot interpret the coefficient for age and gender as it would lead to table two fallacy.
Now we develop different machine learning models using the original data set
# data
tcr_ml = tcr_original
# common pre-processing steps
tcr_ml$Recurred <- factor(tcr_ml$Recurred, levels = c("No", "Yes"))
# Defining predictors
predictors <- c("Age", "Gender", "Smoking", "Hx.Smoking", "Hx.Radiothreapy",
"Thyroid.Function", "Physical.Examination", "Adenopathy",
"Pathology", "Focality", "Risk", "T", "N", "M", "Stage",
"Response")
outcome <- "Recurred"
# Splitting data
set.seed(123)
trainIndex <- createDataPartition(tcr_ml[[outcome]], p = 0.8, list = FALSE)
train_all <- tcr_ml[trainIndex, ]
test_all <- tcr_ml[-trainIndex, ]
# 5-fold cross-validation
ctrl <- trainControl(
method = "cv",
number = 5,
summaryFunction = twoClassSummary,
classProbs = TRUE,
savePredictions = "final"
)
# Fit logistic regression with caret
set.seed(123)
fit_glm <- caret::train(
Recurred ~ .,
data = train_all[, c(predictors, outcome)],
method = "glm",
family = binomial,
metric = "ROC",
trControl = ctrl
)
# Predictions
# Probabilities for positive class "Yes"
pred_prob_glm <- predict(fit_glm, newdata = test_all, type = "prob")
# Class predictions
pred_class_glm <- predict(fit_glm, newdata = test_all)
# Make sure both outcome and predictions are factors with the same levels
test_all$Recurred <- factor(test_all$Recurred, levels = c("No","Yes"))
pred_class_glm <- factor(pred_class_glm, levels = c("No","Yes"))
cm_glm <- caret::confusionMatrix(
data = pred_class_glm,
reference = test_all$Recurred,
positive = "Yes"
)
print(cm_glm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 53 1
## Yes 2 20
##
## Accuracy : 0.9605
## 95% CI : (0.8889, 0.9918)
## No Information Rate : 0.7237
## P-Value [Acc > NIR] : 9.227e-08
##
## Kappa : 0.9027
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9524
## Specificity : 0.9636
## Pos Pred Value : 0.9091
## Neg Pred Value : 0.9815
## Prevalence : 0.2763
## Detection Rate : 0.2632
## Detection Prevalence : 0.2895
## Balanced Accuracy : 0.9580
##
## 'Positive' Class : Yes
##
# Convert confusion matrix table to a data frame for plotting
cm_table <- as.data.frame(cm_glm$table)
# Rename columns for clarity
colnames(cm_table) <- c("Prediction", "Reference", "Freq")
# Plot heatmap
ggplot(cm_table, aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = Freq), size = 6, fontface = "bold") +
scale_fill_gradient(low = "lightblue", high = "steelblue") +
labs(title = "Confusion Matrix - Logistic Regression",
x = "Actual",
y = "Predicted") +
theme_minimal(base_size = 14)
roc_obj <- pROC::roc(
response = test_all$Recurred,
predictor = pred_prob_glm[,"Yes"],
levels = c("No", "Yes"), # first = controls, second = cases
direction = "<"
)
# Plot ROC curve
plot(
roc_obj,
col = "#4978C9",
lwd = 2,
main = "ROC Curve for Logistic Regression"
)
# Print AUC
auc_val_glm <- pROC::auc(roc_obj)
print(auc_val_glm)
## Area under the curve: 0.9667
brier_val_glm <- ModelMetrics::brier(
as.numeric(test_all$Recurred) - 1, # converts "No"=0, "Yes"=1
pred_prob_glm[,"Yes"]
)
print(brier_val_glm)
## [1] 0.03542186
vi_glm <- varImp(fit_glm, scale = TRUE)
print(vi_glm)
## glm variable importance
##
## only 20 most important variables shown (out of 40)
##
## Overall
## PathologyPapillary 100.0000
## `FocalityUni-Focal` 17.5102
## GenderM 8.1102
## Age 0.7684
## AdenopathyLeft 0.7161
## ResponseExcellent 0.7030
## NN1b 0.7002
## NN1a 0.6859
## `ResponseStructural Incomplete` 0.6764
## AdenopathyRight 0.6756
## Thyroid.FunctionEuthyroid 0.6104
## `Thyroid.FunctionClinical Hypothyroidism` 0.5513
## StageII 0.5354
## ResponseIndeterminate 0.5230
## AdenopathyNo 0.4369
## `Thyroid.FunctionSubclinical Hypothyroidism` 0.4032
## `Thyroid.FunctionSubclinical Hyperthyroidism` 0.3960
## `PathologyHurthel cell` 0.2788
## SmokingYes 0.2731
## `Physical.ExaminationSingle nodular goiter-left` 0.2707
plot(vi_glm, top = 20)
The logistic regression model demonstrates excellent performance, with an accuracy of 0.96 (95% CI: 0.89–0.99) and a high kappa value (0.90), indicating strong agreement beyond chance. It achieves balanced sensitivity (0.95) and specificity (0.96), along with a strong AUC of 0.97, reflecting excellent discrimination. The low Brier score (0.035) indicates good calibration, meaning predicted probabilities closely match observed outcomes.
# Fit KNN with caret
set.seed(123)
fit_knn <- caret::train(
Recurred ~ .,
data = train_all[, c(predictors, outcome)],
method = "knn",
metric = "ROC",
trControl = ctrl,
tuneLength = 10 # automatically tunes 'k'
)
# Predictions
# Probabilities for positive class "Yes"
pred_prob_knn <- predict(fit_knn, newdata = test_all, type = "prob")
# Class predictions
pred_class_knn <- predict(fit_knn, newdata = test_all)
# Make sure both outcome and predictions are factors with the same levels
test_all$Recurred <- factor(test_all$Recurred, levels = c("No","Yes"))
pred_class_knn <- factor(pred_class_knn, levels = c("No","Yes"))
cm_knn <- caret::confusionMatrix(
data = pred_class_knn,
reference = test_all$Recurred,
positive = "Yes"
)
print(cm_knn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 55 7
## Yes 0 14
##
## Accuracy : 0.9079
## 95% CI : (0.8194, 0.9622)
## No Information Rate : 0.7237
## P-Value [Acc > NIR] : 7.294e-05
##
## Kappa : 0.7432
##
## Mcnemar's Test P-Value : 0.02334
##
## Sensitivity : 0.6667
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.8871
## Prevalence : 0.2763
## Detection Rate : 0.1842
## Detection Prevalence : 0.1842
## Balanced Accuracy : 0.8333
##
## 'Positive' Class : Yes
##
# Convert confusion matrix table to a data frame for plotting
cm_table <- as.data.frame(cm_knn$table)
# Rename columns for clarity
colnames(cm_table) <- c("Prediction", "Reference", "Freq")
# Plot heatmap
ggplot(cm_table, aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = Freq), size = 6, fontface = "bold") +
scale_fill_gradient(low = "lightblue", high = "steelblue") +
labs(title = "Confusion Matrix - KNN",
x = "Actual",
y = "Predicted") +
theme_minimal(base_size = 14)
roc_obj_knn <- pROC::roc(
response = test_all$Recurred,
predictor = pred_prob_knn[,"Yes"],
levels = c("No", "Yes"), # first = controls, second = cases
direction = "<"
)
# Plot ROC curve
plot(
roc_obj_knn,
col = "#4978C9",
lwd = 2,
main = "ROC Curve for KNN"
)
# Print AUC
auc_val_knn <- pROC::auc(roc_obj_knn)
print(auc_val_knn)
## Area under the curve: 0.9576
brier_val_knn <- ModelMetrics::brier(
as.numeric(test_all$Recurred) - 1, # convert "No"=0, "Yes"=1
pred_prob_knn[,"Yes"]
)
print(brier_val_knn)
## [1] 0.07307638
Unlike parametric models such as logistic regression, KNN does not provide interpretable variable importance. Attempts to compute importance using caret’s varImp() may fail when predictors are non-numeric. Alternative methods such as permutation importance can be used.
# library(iml) requires
permutation_importance <- function(model, train_data, predictors, outcome,
loss = "ce", n_repetitions = 5) {
# model : a trained caret model (fit_knn, fit_glm, etc.)
# train_data : the training dataset
# predictors : vector of predictor column names
# outcome : outcome column name
# loss : loss function ("ce"=cross-entropy, "auc", "accuracy", etc.)
# n_repetitions: how many times to shuffle each variable
# Create predictor wrapper
predictor_obj <- Predictor$new(
model = model,
data = train_data[, predictors],
y = train_data[[outcome]]
)
# Compute feature importance
imp <- FeatureImp$new(
predictor_obj,
loss = loss,
n.repetitions = n_repetitions
)
return(imp)
}
imp_knn <- permutation_importance(
model = fit_knn,
train_data = train_all,
predictors = predictors,
outcome = outcome,
loss = "ce", # cross-entropy
n_repetitions = 5
)
# Print results
print(imp_knn)
## Interpretation method: FeatureImp
## error function: ce
##
## Analysed predictor:
## Prediction task: classification
## Classes:
##
## Analysed data:
## Sampling from data.frame with 307 rows and 16 columns.
##
##
## Head of results:
## feature importance.05 importance importance.95 permutation.error
## 1 Response 1.733333 1.888889 1.955556 0.16612378
## 2 Age 1.666667 1.777778 1.844444 0.15635179
## 3 Risk 1.259259 1.444444 1.659259 0.12703583
## 4 Adenopathy 1.229630 1.296296 1.414815 0.11400651
## 5 T 1.192593 1.259259 1.318519 0.11074919
## 6 Gender 1.022222 1.111111 1.207407 0.09771987
# Plot importance
plot(imp_knn) + theme_classic()
The KNN model shows a noticeably different performance profile compared to logistic regression. While logistic regression achieves both high sensitivity (0.95) and specificity (0.96) with an overall accuracy of 0.96, KNN attains perfect specificity (1.00) and PPV (1.00) but at the cost of much lower sensitivity (0.67), meaning it misses a substantial number of true recurrence cases. Both models demonstrate strong discrimination (AUC ≈ 0.96–0.97), but KNN’s higher Brier score (0.073 vs. 0.035) suggests poorer probability calibration. Overall, logistic regression provides a more balanced and reliable performance, whereas KNN is highly conservative—excellent at ruling out non-recurrence but less effective at detecting recurrence.
# Fit Decision Tree with caret
set.seed(123)
fit_dt <- caret::train(
Recurred ~ .,
data = train_all[, c(predictors, outcome)],
method = "rpart",
metric = "ROC",
trControl = ctrl,
tuneLength = 10 # try different complexity parameters
)
# Predictions
# Probabilities for positive class "Yes"
pred_prob_dt <- predict(fit_dt, newdata = test_all, type = "prob")
# Class predictions
pred_class_dt <- predict(fit_dt, newdata = test_all)
# Make sure both outcome and predictions are factors with the same levels
test_all$Recurred <- factor(test_all$Recurred, levels = c("No","Yes"))
pred_class_dt <- factor(pred_class_dt, levels = c("No","Yes"))
cm_dt <- caret::confusionMatrix(
data = pred_class_dt,
reference = test_all$Recurred,
positive = "Yes"
)
print(cm_dt)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 54 2
## Yes 1 19
##
## Accuracy : 0.9605
## 95% CI : (0.8889, 0.9918)
## No Information Rate : 0.7237
## P-Value [Acc > NIR] : 9.227e-08
##
## Kappa : 0.8998
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9048
## Specificity : 0.9818
## Pos Pred Value : 0.9500
## Neg Pred Value : 0.9643
## Prevalence : 0.2763
## Detection Rate : 0.2500
## Detection Prevalence : 0.2632
## Balanced Accuracy : 0.9433
##
## 'Positive' Class : Yes
##
# Convert confusion matrix table to a data frame for plotting
cm_table_dt <- as.data.frame(cm_dt$table)
# Rename columns for clarity
colnames(cm_table_dt) <- c("Prediction", "Reference", "Freq")
# Plot heatmap
ggplot(cm_table_dt, aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = Freq), size = 6, fontface = "bold") +
scale_fill_gradient(low = "lightblue", high = "steelblue") +
labs(title = "Confusion Matrix - Decision Tree",
x = "Actual",
y = "Predicted") +
theme_minimal(base_size = 14)
roc_obj_dt <- pROC::roc(
response = test_all$Recurred,
predictor = pred_prob_dt[,"Yes"],
levels = c("No", "Yes"), # first = controls, second = cases
direction = "<"
)
# Plot ROC curve
plot(
roc_obj_dt,
col = "#4978C9",
lwd = 2,
main = "ROC Curve for Decision Tree"
)
# Print AUC
auc_val_dt <- pROC::auc(roc_obj_dt)
print(auc_val_dt)
## Area under the curve: 0.9117
brier_val_dt <- ModelMetrics::brier(
as.numeric(test_all$Recurred) - 1, # convert factor ("No","Yes") → 0/1
pred_prob_dt[,"Yes"] # probability for "Yes"
)
print(brier_val_dt)
## [1] 0.03088535
vi_dt <- varImp(fit_dt, scale = TRUE)
print(vi_dt)
## rpart variable importance
##
## only 20 most important variables shown (out of 43)
##
## Overall
## ResponseStructural Incomplete 100.000
## RiskLow 81.926
## AdenopathyNo 71.480
## ResponseExcellent 70.247
## NN1b 55.835
## Age 15.508
## RiskIntermediate 4.979
## TT3a 2.722
## ResponseIndeterminate 2.591
## TT2 2.329
## Physical.ExaminationMultinodular goiter 2.163
## Physical.ExaminationSingle nodular goiter-left 1.961
## PathologyMicropapillary 0.000
## TT1b 0.000
## PathologyPapillary 0.000
## Thyroid.FunctionEuthyroid 0.000
## `PathologyHurthel cell` 0.000
## `Thyroid.FunctionSubclinical Hyperthyroidism` 0.000
## Physical.ExaminationNormal 0.000
## SmokingYes 0.000
plot(vi_dt, top = 20)
The decision tree model delivers a balanced performance between logistic regression and KNN. Like logistic regression, it achieves high overall accuracy (0.96) and strong predictive balance, with sensitivity (0.90) and specificity (0.98) both remaining high. Compared to logistic regression (AUC = 0.97, Brier = 0.035), the decision tree shows slightly lower discrimination (AUC = 0.91) but slightly better calibration (Brier = 0.031). In contrast, KNN sacrifices sensitivity (0.67) in favor of perfect specificity (1.00) and PPV (1.00), making it overly conservative and prone to missing true recurrences. Thus, while logistic regression remains the most discriminative model and KNN the most cautious, the decision tree provides a strong compromise—offering high accuracy and interpretability with reliable performance across both positive and negative predictions.
# Fit Random Forest with caret
set.seed(123)
fit_rf <- caret::train(
Recurred ~ .,
data = train_all[, c(predictors, outcome)],
method = "rf",
metric = "ROC",
trControl = ctrl,
tuneLength = 5 # tries multiple mtry values
)
# Predictions
# Probabilities for positive class "Yes"
pred_prob_rf <- predict(fit_rf, newdata = test_all, type = "prob")
# Class predictions
pred_class_rf <- predict(fit_rf, newdata = test_all)
# Make sure both outcome and predictions are factors with the same levels
test_all$Recurred <- factor(test_all$Recurred, levels = c("No","Yes"))
pred_class_rf <- factor(pred_class_rf, levels = c("No","Yes"))
cm_rf <- caret::confusionMatrix(
data = pred_class_rf,
reference = test_all$Recurred,
positive = "Yes"
)
print(cm_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 54 5
## Yes 1 16
##
## Accuracy : 0.9211
## 95% CI : (0.836, 0.9705)
## No Information Rate : 0.7237
## P-Value [Acc > NIR] : 1.818e-05
##
## Kappa : 0.7902
##
## Mcnemar's Test P-Value : 0.2207
##
## Sensitivity : 0.7619
## Specificity : 0.9818
## Pos Pred Value : 0.9412
## Neg Pred Value : 0.9153
## Prevalence : 0.2763
## Detection Rate : 0.2105
## Detection Prevalence : 0.2237
## Balanced Accuracy : 0.8719
##
## 'Positive' Class : Yes
##
# Convert confusion matrix table to a data frame for plotting
cm_table_rf <- as.data.frame(cm_rf$table)
# Rename columns for clarity
colnames(cm_table_rf) <- c("Prediction", "Reference", "Freq")
# Plot heatmap
ggplot(cm_table_rf, aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = Freq), size = 6, fontface = "bold") +
scale_fill_gradient(low = "lightblue", high = "steelblue") +
labs(title = "Confusion Matrix - Random Forest",
x = "Actual",
y = "Predicted") +
theme_minimal(base_size = 14)
roc_obj_rf <- pROC::roc(
response = test_all$Recurred,
predictor = pred_prob_rf[,"Yes"],
levels = c("No", "Yes"), # first = controls, second = cases
direction = "<"
)
# Plot ROC curve
plot(
roc_obj_rf,
col = "#4978C9",
lwd = 2,
main = "ROC Curve for Random Forest"
)
# Print AUC
auc_val_rf <- pROC::auc(roc_obj_rf)
print(auc_val_rf)
## Area under the curve: 0.9697
brier_val_rf <- ModelMetrics::brier(
as.numeric(test_all$Recurred) - 1, # convert "No"=0, "Yes"=1
pred_prob_rf[,"Yes"] # probability for "Yes"
)
print(brier_val_rf)
## [1] 0.05987721
vi_rf <- varImp(fit_rf, scale = TRUE)
print(vi_rf)
## rf variable importance
##
## only 20 most important variables shown (out of 40)
##
## Overall
## ResponseStructural Incomplete 100.000
## RiskLow 54.006
## ResponseExcellent 48.486
## AdenopathyNo 46.965
## NN1b 40.925
## RiskIntermediate 28.758
## Age 18.675
## GenderM 13.067
## FocalityUni-Focal 12.241
## AdenopathyRight 11.741
## ResponseIndeterminate 11.260
## SmokingYes 10.432
## TT4a 9.718
## MM1 8.624
## StageII 8.462
## TT3b 7.815
## TT2 6.697
## StageIVB 5.789
## TT3a 4.774
## Physical.ExaminationMultinodular goiter 4.199
plot(vi_rf, top = 20)
The random forest model achieves solid performance, with high specificity (0.98) and good PPV (0.94), but its sensitivity drops to 0.76, meaning it misses more true recurrences than logistic regression (0.95) or decision trees (0.90). While its overall accuracy (0.92) is lower than both logistic regression and decision trees (0.96), its AUC (0.97) is among the highest, showing excellent ability to discriminate between recurrent and non-recurrent cases. Compared to KNN, random forest provides a better balance between sensitivity and specificity, though with slightly less calibration (Brier = 0.060 vs. 0.073 for KNN). In summary, random forest offers strong discriminative power and reliability, but logistic regression and decision trees maintain a better balance between sensitivity and specificity, whereas KNN remains more conservative but less sensitive.
# Fit SVM with radial basis kernel using caret
set.seed(123)
fit_svm <- caret::train(
Recurred ~ .,
data = train_all[, c(predictors, outcome)],
method = "svmRadial",
metric = "ROC",
trControl = ctrl,
tuneLength = 10 # tune over cost & sigma
)
# Predictions
# Probabilities for positive class "Yes"
pred_prob_svm <- predict(fit_svm, newdata = test_all, type = "prob")
# Class predictions
pred_class_svm <- predict(fit_svm, newdata = test_all)
# Make sure both outcome and predictions are factors with the same levels
test_all$Recurred <- factor(test_all$Recurred, levels = c("No","Yes"))
pred_class_svm <- factor(pred_class_svm, levels = c("No","Yes"))
cm_svm <- caret::confusionMatrix(
data = pred_class_svm,
reference = test_all$Recurred,
positive = "Yes"
)
print(cm_svm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 54 2
## Yes 1 19
##
## Accuracy : 0.9605
## 95% CI : (0.8889, 0.9918)
## No Information Rate : 0.7237
## P-Value [Acc > NIR] : 9.227e-08
##
## Kappa : 0.8998
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9048
## Specificity : 0.9818
## Pos Pred Value : 0.9500
## Neg Pred Value : 0.9643
## Prevalence : 0.2763
## Detection Rate : 0.2500
## Detection Prevalence : 0.2632
## Balanced Accuracy : 0.9433
##
## 'Positive' Class : Yes
##
# Convert confusion matrix table to a data frame for plotting
cm_table_svm <- as.data.frame(cm_svm$table)
# Rename columns for clarity
colnames(cm_table_svm) <- c("Prediction", "Reference", "Freq")
# Plot heatmap
ggplot(cm_table_svm, aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = Freq), size = 6, fontface = "bold") +
scale_fill_gradient(low = "lightblue", high = "steelblue") +
labs(title = "Confusion Matrix - SVM",
x = "Actual",
y = "Predicted") +
theme_minimal(base_size = 14)
roc_obj_svm <- pROC::roc(
response = test_all$Recurred,
predictor = pred_prob_svm[,"Yes"],
levels = c("No", "Yes"), # first = controls, second = cases
direction = "<"
)
# Plot ROC curve
plot(
roc_obj_svm,
col = "#4978C9",
lwd = 2,
main = "ROC Curve for SVM"
)
# Print AUC
auc_val_svm <- pROC::auc(roc_obj_svm)
print(auc_val_svm)
## Area under the curve: 0.9697
brier_val_svm <- ModelMetrics::brier(
as.numeric(test_all$Recurred) - 1, # convert factor to 0/1
pred_prob_svm[,"Yes"] # predicted probabilities for "Yes"
)
print(brier_val_svm)
## [1] 0.03739232
Not interpretable
The SVM model performs on par with logistic regression, achieving the same high accuracy (0.96) and nearly identical balance between sensitivity (0.90) and specificity (0.98). Both models also demonstrate excellent discrimination (AUC ≈ 0.97) and strong calibration (Brier ≈ 0.037 for SVM vs. 0.035 for logistic regression), making them equally reliable for recurrence prediction. In contrast, the random forest shows strong discriminative ability (AUC = 0.97) but lower accuracy (0.92) and reduced sensitivity (0.76), indicating it tends to under-detect recurrences despite high specificity. Overall, logistic regression and SVM stand out as the most balanced and dependable models, while random forest offers strong discrimination but at the cost of missing more true positive cases.
# Fit ANN with caret (nnet)
set.seed(123)
fit_ann <- caret::train(
Recurred ~ .,
data = train_all[, c(predictors, outcome)],
method = "nnet",
metric = "ROC",
trControl = ctrl,
tuneLength = 5, # search over different hidden units/decay
trace = FALSE # suppress training output
)
# Predictions
# Probabilities for positive class "Yes"
pred_prob_ann <- predict(fit_ann, newdata = test_all, type = "prob")
# Class predictions
pred_class_ann <- predict(fit_ann, newdata = test_all)
# Make sure both outcome and predictions are factors with the same levels
test_all$Recurred <- factor(test_all$Recurred, levels = c("No","Yes"))
pred_class_ann <- factor(pred_class_ann, levels = c("No","Yes"))
cm_ann <- caret::confusionMatrix(
data = pred_class_ann,
reference = test_all$Recurred,
positive = "Yes"
)
print(cm_ann)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 52 2
## Yes 3 19
##
## Accuracy : 0.9342
## 95% CI : (0.8531, 0.9783)
## No Information Rate : 0.7237
## P-Value [Acc > NIR] : 3.843e-06
##
## Kappa : 0.8379
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9048
## Specificity : 0.9455
## Pos Pred Value : 0.8636
## Neg Pred Value : 0.9630
## Prevalence : 0.2763
## Detection Rate : 0.2500
## Detection Prevalence : 0.2895
## Balanced Accuracy : 0.9251
##
## 'Positive' Class : Yes
##
# Convert confusion matrix table to a data frame for plotting
cm_table_ann <- as.data.frame(cm_ann$table)
# Rename columns for clarity
colnames(cm_table_ann) <- c("Prediction", "Reference", "Freq")
# Plot heatmap
ggplot(cm_table_ann, aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = Freq), size = 6, fontface = "bold") +
scale_fill_gradient(low = "lightblue", high = "steelblue") +
labs(title = "Confusion Matrix - ANN",
x = "Actual",
y = "Predicted") +
theme_minimal(base_size = 14)
roc_obj_ann <- pROC::roc(
response = test_all$Recurred,
predictor = pred_prob_ann[,"Yes"],
levels = c("No", "Yes"), # first = controls, second = cases
direction = "<"
)
# Plot ROC curve
plot(
roc_obj_ann,
col = "#4978C9",
lwd = 2,
main = "ROC Curve for ANN"
)
# Print AUC
auc_val_ann <- pROC::auc(roc_obj_ann)
print(auc_val_ann)
## Area under the curve: 0.9732
brier_val_ann <- ModelMetrics::brier(
as.numeric(test_all$Recurred) - 1, # convert factor ("No","Yes") → 0/1
pred_prob_ann[,"Yes"] # probability for positive class
)
print(brier_val_ann)
## [1] 0.04275516
vi_ann <- varImp(fit_ann, scale = TRUE)
print(vi_ann)
## nnet variable importance
##
## only 20 most important variables shown (out of 40)
##
## Overall
## ResponseExcellent 100.00
## Thyroid.FunctionClinical Hypothyroidism 79.24
## ResponseIndeterminate 76.51
## RiskLow 52.74
## ResponseStructural Incomplete 48.05
## Thyroid.FunctionEuthyroid 45.73
## StageII 45.35
## NN1b 43.45
## PathologyHurthel cell 42.00
## FocalityUni-Focal 39.57
## TT2 27.32
## AdenopathyLeft 26.43
## TT4a 24.90
## StageIVA 24.48
## RiskIntermediate 23.60
## Hx.SmokingYes 23.54
## TT1b 21.52
## Thyroid.FunctionSubclinical Hypothyroidism 21.49
## Physical.ExaminationNormal 20.93
## GenderM 20.56
plot(vi_ann, top = 20)
The artificial neural network (ANN) achieves strong performance with accuracy of 0.93, sensitivity of 0.90, and specificity of 0.95, supported by an excellent AUC of 0.97. Compared to logistic regression and SVM, which both reach higher accuracy (0.96), slightly better calibration (Brier ≈ 0.035–0.037 vs. 0.043), and equally high AUC, the ANN is competitive but somewhat less precise in positive predictions (PPV = 0.86 vs. ~0.91–0.95). The decision tree matches SVM and logistic regression in accuracy (0.96) but lags in discrimination (AUC = 0.91), though its calibration is very strong (Brier ≈ 0.031). The random forest delivers excellent discrimination (AUC = 0.97) but sacrifices sensitivity (0.76) and accuracy (0.92), tending to under-detect recurrences despite high specificity. Finally, KNN stands out as the most unbalanced: it achieves perfect specificity (1.00) and PPV (1.00), but at the cost of poor sensitivity (0.67), missing many true recurrences, and a higher Brier score (0.073).
# Helper function to extract metrics
extract_metrics <- function(cm, auc_val) {
out <- data.frame(
Accuracy = cm$overall["Accuracy"],
Sensitivity = cm$byClass["Sensitivity"],
Specificity = cm$byClass["Specificity"],
PPV = cm$byClass["Pos Pred Value"],
NPV = cm$byClass["Neg Pred Value"],
AUC = as.numeric(auc_val)
)
return(out)
}
# Collect metrics for all models
results <- list(
"Logistic Regression" = extract_metrics(cm_glm, auc_val_glm),
"KNN" = extract_metrics(cm_knn, auc_val_knn),
"Decision Tree" = extract_metrics(cm_dt, auc_val_dt),
"Random Forest" = extract_metrics(cm_rf, auc_val_rf),
"SVM" = extract_metrics(cm_svm, auc_val_svm),
"ANN" = extract_metrics(cm_ann, auc_val_ann)
)
# Combine into one data frame
results_df <- do.call(rbind, results)
# Round for readability
results_df <- round(results_df, 3)
# Display
knitr::kable(results_df, caption = "Model Performance Summary")
Accuracy | Sensitivity | Specificity | PPV | NPV | AUC | |
---|---|---|---|---|---|---|
Logistic Regression | 0.961 | 0.952 | 0.964 | 0.909 | 0.981 | 0.967 |
KNN | 0.908 | 0.667 | 1.000 | 1.000 | 0.887 | 0.958 |
Decision Tree | 0.961 | 0.905 | 0.982 | 0.950 | 0.964 | 0.912 |
Random Forest | 0.921 | 0.762 | 0.982 | 0.941 | 0.915 | 0.970 |
SVM | 0.961 | 0.905 | 0.982 | 0.950 | 0.964 | 0.970 |
ANN | 0.934 | 0.905 | 0.945 | 0.864 | 0.963 | 0.973 |
Across all six models, performance was generally strong, with accuracies above 90% and AUC values consistently high, indicating excellent discrimination between recurrent and non-recurrent cases.
Logistic regression and SVM provide the most balanced performance, with high accuracy (~0.96), strong sensitivity and specificity, excellent AUC (~0.97), and reliable calibration, making them the most dependable overall.
Decision trees achieve similar accuracy (0.96) and very high specificity (0.98), but their discrimination (AUC = 0.91) is weaker, though they remain interpretable and practical.
Random forest delivers excellent discrimination (AUC = 0.97) but lower sensitivity (0.76) and overall accuracy (0.92), showing a tendency to under-detect recurrences despite high specificity.
KNN and ANN sit at opposite ends: KNN is overly conservative with perfect specificity and PPV but very low sensitivity (0.67), while ANN achieves the highest AUC (0.97) but lower PPV (0.86), reflecting a higher false-positive rate.
Thank you for reading!