Thyroid cancer recurrence prediction

Andre Nana

Thyroid gland (credit: Cleveland Clinic)

1. Introduction

This project investigates recurrence in thyroid cancer patients using both statistical logistic regression and machine learning methods. With the logistic regression, we explore the relationship between ever smoking and the likelihood of recurrence. We then shift to machine learning techniques for risk stratification and evaluate the predictive accuracy of the following algorithms:

logistic regression,
K-Nearest Neighbors (KNN),
Decision Tree,
Random Forest,
Support Vector Machines (SVM),and,
Artificial Neural Networks (ANN)

These algorithms are evaluated based on the following performance metrics:

Sensitivity (ability to correctly identify patients who experience recurrence),
Specificity (ability to correctly identify those who do not),
Positive Predictive Value (PPV) (the probability that predicted recurrences are true),
Negative Predictive Value (NPV) (the probability that predicted non-recurrences are true),
Area Under the ROC curve (AUC) (overall discriminatory ability),
Accuracy (the proportion of correct classifications across all cases).

The dataset in this study is publicly accessible and the University of California Machine Learning Repository, and was generously provided by Borzooei et al., who published their original findings in 2024.

Disclaimer: This analysis is intended for educational and research purposes only and has not been peer-reviewed. While efforts have been made to ensure the accuracy of the methods and results, the author does not guarantee the correctness or completeness of the analysis. The author bears no responsibility or liability for any errors, omissions, or outcomes resulting from the use of this material. Use at your own discretion.

2. Loading libraries and data

# Core data handling & wrangling
library(tidyverse)
library(readxl)

# Data exploration & summary
library(naniar)
library(table1)
library(gtsummary)

# Modeling frameworks
library(lme4)
library(reticulate)

# Model evaluation
library(pROC)
library(precrec)

# Machine learning
library(caret)
library(caretEnsemble)
library(nnet)
library(randomForest)
library(e1071)
library(rpart)
library(ModelMetrics)
library(iml)

# Loading Data
tcr_original <- read.csv("Thyroid_Diff.csv")

3. Exploratory data analysis

3.1. General information

colnames(tcr_original)

##  [1] "Age"                  "Gender"               "Smoking"             
##  [4] "Hx.Smoking"           "Hx.Radiothreapy"      "Thyroid.Function"    
##  [7] "Physical.Examination" "Adenopathy"           "Pathology"           
## [10] "Focality"             "Risk"                 "T"                   
## [13] "N"                    "M"                    "Stage"               
## [16] "Response"             "Recurred"

glimpse(tcr_original)

## Rows: 383
## Columns: 17
## $ Age                  <int> 27, 34, 30, 62, 62, 52, 41, 46, 51, 40, 75, 59, 4…
## $ Gender               <chr> "F", "F", "F", "F", "F", "M", "F", "F", "F", "F",…
## $ Smoking              <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", …
## $ Hx.Smoking           <chr> "No", "Yes", "No", "No", "No", "No", "Yes", "No",…
## $ Hx.Radiothreapy      <chr> "No", "No", "No", "No", "No", "No", "No", "No", "…
## $ Thyroid.Function     <chr> "Euthyroid", "Euthyroid", "Euthyroid", "Euthyroid…
## $ Physical.Examination <chr> "Single nodular goiter-left", "Multinodular goite…
## $ Adenopathy           <chr> "No", "No", "No", "No", "No", "No", "No", "No", "…
## $ Pathology            <chr> "Micropapillary", "Micropapillary", "Micropapilla…
## $ Focality             <chr> "Uni-Focal", "Uni-Focal", "Uni-Focal", "Uni-Focal…
## $ Risk                 <chr> "Low", "Low", "Low", "Low", "Low", "Low", "Low", …
## $ T                    <chr> "T1a", "T1a", "T1a", "T1a", "T1a", "T1a", "T1a", …
## $ N                    <chr> "N0", "N0", "N0", "N0", "N0", "N0", "N0", "N0", "…
## $ M                    <chr> "M0", "M0", "M0", "M0", "M0", "M0", "M0", "M0", "…
## $ Stage                <chr> "I", "I", "I", "I", "I", "I", "I", "I", "I", "I",…
## $ Response             <chr> "Indeterminate", "Excellent", "Excellent", "Excel…
## $ Recurred             <chr> "No", "No", "No", "No", "No", "No", "No", "No", "…

vis_miss(tcr_original, sort_miss = TRUE)

383 observation and 17 variables, no missing values

3.2. Table 1

table1(~. |Recurred,
       data = tcr_original,
       caption = "Description of case by recurrence status",
       footnote = "Borzooei, S. & Tarokhian, A. (2023). Differentiated Thyroid Cancer Recurrence [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5632J.")

Description of case by recurrence status
	No (N=275)	Yes (N=108)	Overall (N=383)
Borzooei, S. & Tarokhian, A. (2023). Differentiated Thyroid Cancer Recurrence [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5632J.
Age
Mean (SD)	38.4 (12.9)	47.1 (18.3)	40.9 (15.1)
Median [Min, Max]	36.0 [17.0, 81.0]	44.5 [15.0, 82.0]	37.0 [15.0, 82.0]
Gender
F	246 (89.5%)	66 (61.1%)	312 (81.5%)
M	29 (10.5%)	42 (38.9%)	71 (18.5%)
Smoking
No	259 (94.2%)	75 (69.4%)	334 (87.2%)
Yes	16 (5.8%)	33 (30.6%)	49 (12.8%)
Hx.Smoking
No	261 (94.9%)	94 (87.0%)	355 (92.7%)
Yes	14 (5.1%)	14 (13.0%)	28 (7.3%)
Hx.Radiothreapy
No	274 (99.6%)	102 (94.4%)	376 (98.2%)
Yes	1 (0.4%)	6 (5.6%)	7 (1.8%)
Thyroid.Function
Clinical Hyperthyroidism	17 (6.2%)	3 (2.8%)	20 (5.2%)
Clinical Hypothyroidism	10 (3.6%)	2 (1.9%)	12 (3.1%)
Euthyroid	234 (85.1%)	98 (90.7%)	332 (86.7%)
Subclinical Hyperthyroidism	5 (1.8%)	0 (0%)	5 (1.3%)
Subclinical Hypothyroidism	9 (3.3%)	5 (4.6%)	14 (3.7%)
Physical.Examination
Diffuse goiter	7 (2.5%)	0 (0%)	7 (1.8%)
Multinodular goiter	88 (32.0%)	52 (48.1%)	140 (36.6%)
Normal	5 (1.8%)	2 (1.9%)	7 (1.8%)
Single nodular goiter-left	63 (22.9%)	26 (24.1%)	89 (23.2%)
Single nodular goiter-right	112 (40.7%)	28 (25.9%)	140 (36.6%)
Adenopathy
Bilateral	5 (1.8%)	27 (25.0%)	32 (8.4%)
Left	5 (1.8%)	12 (11.1%)	17 (4.4%)
No	247 (89.8%)	30 (27.8%)	277 (72.3%)
Right	18 (6.5%)	30 (27.8%)	48 (12.5%)
Extensive	0 (0%)	7 (6.5%)	7 (1.8%)
Posterior	0 (0%)	2 (1.9%)	2 (0.5%)
Pathology
Follicular	16 (5.8%)	12 (11.1%)	28 (7.3%)
Hurthel cell	14 (5.1%)	6 (5.6%)	20 (5.2%)
Micropapillary	48 (17.5%)	0 (0%)	48 (12.5%)
Papillary	197 (71.6%)	90 (83.3%)	287 (74.9%)
Focality
Multi-Focal	66 (24.0%)	70 (64.8%)	136 (35.5%)
Uni-Focal	209 (76.0%)	38 (35.2%)	247 (64.5%)
Risk
Intermediate	38 (13.8%)	64 (59.3%)	102 (26.6%)
Low	237 (86.2%)	12 (11.1%)	249 (65.0%)
High	0 (0%)	32 (29.6%)	32 (8.4%)
T
T1a	48 (17.5%)	1 (0.9%)	49 (12.8%)
T1b	38 (13.8%)	5 (4.6%)	43 (11.2%)
T2	131 (47.6%)	20 (18.5%)	151 (39.4%)
T3a	55 (20.0%)	41 (38.0%)	96 (25.1%)
T3b	2 (0.7%)	14 (13.0%)	16 (4.2%)
T4a	1 (0.4%)	19 (17.6%)	20 (5.2%)
T4b	0 (0%)	8 (7.4%)	8 (2.1%)
N
N0	241 (87.6%)	27 (25.0%)	268 (70.0%)
N1a	12 (4.4%)	10 (9.3%)	22 (5.7%)
N1b	22 (8.0%)	71 (65.7%)	93 (24.3%)
M
M0	275 (100%)	90 (83.3%)	365 (95.3%)
M1	0 (0%)	18 (16.7%)	18 (4.7%)
Stage
I	268 (97.5%)	65 (60.2%)	333 (86.9%)
II	7 (2.5%)	25 (23.1%)	32 (8.4%)
III	0 (0%)	4 (3.7%)	4 (1.0%)
IVA	0 (0%)	3 (2.8%)	3 (0.8%)
IVB	0 (0%)	11 (10.2%)	11 (2.9%)
Response
Biochemical Incomplete	12 (4.4%)	11 (10.2%)	23 (6.0%)
Excellent	207 (75.3%)	1 (0.9%)	208 (54.3%)
Indeterminate	54 (19.6%)	7 (6.5%)	61 (15.9%)
Structural Incomplete	2 (0.7%)	89 (82.4%)	91 (23.8%)

Those who recurred tend to be older and of higher stage of cancer among other attributes.

4. Logistic regression (from a causal perspective)

Our hypothesis is that ever smoking (whether in the past of currently), is associated with thyroid cancer recurrence. We will assume that only age and gender fit the criteria for confounders in our data set.

4.1 Building new variables

tcr_log = tcr_original |>
  mutate( Recurred01 = case_when(
              Recurred=="Yes"~1,
              Recurred=="No"~0),
          Age_cat = case_when(
            Age < 55 ~ "0-55",
            Age >= 55 ~ "55+"),
          Smoke = case_when(
            Smoking == "Yes" | Hx.Smoking == "Yes" ~ "Yes",
            Smoking == "No" & Hx.Smoking == "No" ~ "No",
            TRUE~NA_character_
          )
    
  )

4.2 Stratifying by exposure

table1(~ Recurred + Age_cat + Gender | Smoke,
       data = tcr_log,
       caption = "Description of patients by smoking status",
       footnote = "Borzooei, S. & Tarokhian, A. (2023). Differentiated Thyroid Cancer Recurrence [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5632J.")

Description of patients by smoking status
	No (N=318)	Yes (N=65)	Overall (N=383)
Borzooei, S. & Tarokhian, A. (2023). Differentiated Thyroid Cancer Recurrence [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5632J.
Recurred
No	247 (77.7%)	28 (43.1%)	275 (71.8%)
Yes	71 (22.3%)	37 (56.9%)	108 (28.2%)
Age_cat
0-55	265 (83.3%)	40 (61.5%)	305 (79.6%)
55+	53 (16.7%)	25 (38.5%)	78 (20.4%)
Gender
F	288 (90.6%)	24 (36.9%)	312 (81.5%)
M	30 (9.4%)	41 (63.1%)	71 (18.5%)

4.3 Model

model.log <- glm(Recurred01 ~ Smoke + Age_cat + Gender, 
                 data = tcr_log,
                 family="binomial")
summary(model.log)

## 
## Call:
## glm(formula = Recurred01 ~ Smoke + Age_cat + Gender, family = "binomial", 
##     data = tcr_log)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.6854     0.1662 -10.142  < 2e-16 ***
## SmokeYes      0.7109     0.3485   2.040  0.04138 *  
## Age_cat55+    1.3430     0.2857   4.701 2.59e-06 ***
## GenderM       1.2633     0.3318   3.807  0.00014 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 455.63  on 382  degrees of freedom
## Residual deviance: 389.09  on 379  degrees of freedom
## AIC: 397.09
## 
## Number of Fisher Scoring iterations: 3

tbl_regression(model.log, exponentiate = TRUE)

Characteristic	OR	95% CI	p-value
Smoke
No	—	—
Yes	2.04	1.02, 4.02	0.041
Age_cat
0-55	—	—
55+	3.83	2.19, 6.73	<0.001
Gender
F	—	—
M	3.54	1.84, 6.81	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

We may conclude that ever smoking is indeed a “cause” for thyroid cancer recurrence, adjusting for age and gender, based on our causal theory (adjusted odds ratio = 2.04 [1.02, 4.02]). Note that we cannot interpret the coefficient for age and gender as it would lead to table two fallacy.

5. Machine learning

Now we develop different machine learning models using the original data set

5.1 Pre-processing steps

# data
tcr_ml = tcr_original

# common pre-processing steps

tcr_ml$Recurred <- factor(tcr_ml$Recurred, levels = c("No", "Yes"))

# Defining predictors
predictors <- c("Age", "Gender", "Smoking", "Hx.Smoking", "Hx.Radiothreapy",
                "Thyroid.Function", "Physical.Examination", "Adenopathy",
                "Pathology", "Focality", "Risk", "T", "N", "M", "Stage",
                "Response")

outcome <- "Recurred"

# Splitting data
set.seed(123)
trainIndex <- createDataPartition(tcr_ml[[outcome]], p = 0.8, list = FALSE)
train_all <- tcr_ml[trainIndex, ]
test_all  <- tcr_ml[-trainIndex, ]

# 5-fold cross-validation
ctrl <- trainControl(
  method = "cv",
  number = 5,
  summaryFunction = twoClassSummary,
  classProbs = TRUE,
  savePredictions = "final"
)

5.2 Logistic regression

5.2.1 Training model

# Fit logistic regression with caret
set.seed(123)
fit_glm <- caret::train(
  Recurred ~ ., 
  data = train_all[, c(predictors, outcome)],
  method = "glm",
  family = binomial,
  metric = "ROC",
  trControl = ctrl
)

# Predictions
# Probabilities for positive class "Yes"
pred_prob_glm <- predict(fit_glm, newdata = test_all, type = "prob")

# Class predictions
pred_class_glm <- predict(fit_glm, newdata = test_all)

# Make sure both outcome and predictions are factors with the same levels
test_all$Recurred <- factor(test_all$Recurred, levels = c("No","Yes"))
pred_class_glm <- factor(pred_class_glm, levels = c("No","Yes"))

5.2.2 Confusion matrix

cm_glm <- caret::confusionMatrix(
  data      = pred_class_glm,
  reference = test_all$Recurred,
  positive  = "Yes"
)
print(cm_glm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  53   1
##        Yes  2  20
##                                           
##                Accuracy : 0.9605          
##                  95% CI : (0.8889, 0.9918)
##     No Information Rate : 0.7237          
##     P-Value [Acc > NIR] : 9.227e-08       
##                                           
##                   Kappa : 0.9027          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9524          
##             Specificity : 0.9636          
##          Pos Pred Value : 0.9091          
##          Neg Pred Value : 0.9815          
##              Prevalence : 0.2763          
##          Detection Rate : 0.2632          
##    Detection Prevalence : 0.2895          
##       Balanced Accuracy : 0.9580          
##                                           
##        'Positive' Class : Yes             
##

# Convert confusion matrix table to a data frame for plotting
cm_table <- as.data.frame(cm_glm$table)

# Rename columns for clarity
colnames(cm_table) <- c("Prediction", "Reference", "Freq")

# Plot heatmap
ggplot(cm_table, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Freq), size = 6, fontface = "bold") +
  scale_fill_gradient(low = "lightblue", high = "steelblue") +
  labs(title = "Confusion Matrix - Logistic Regression",
       x = "Actual",
       y = "Predicted") +
  theme_minimal(base_size = 14)

5.2.3 ROC Curve & AUC

roc_obj <- pROC::roc(
  response  = test_all$Recurred,
  predictor = pred_prob_glm[,"Yes"],
  levels    = c("No", "Yes"),   # first = controls, second = cases
  direction = "<"
)

# Plot ROC curve
plot(
  roc_obj,
  col = "#4978C9",
  lwd = 2,
  main = "ROC Curve for Logistic Regression"
)

# Print AUC
auc_val_glm <- pROC::auc(roc_obj)
print(auc_val_glm)

## Area under the curve: 0.9667

5.2.4 Brier value

brier_val_glm <- ModelMetrics::brier(
  as.numeric(test_all$Recurred) - 1,   # converts "No"=0, "Yes"=1
  pred_prob_glm[,"Yes"]
)
print(brier_val_glm)

## [1] 0.03542186

5.2.5 Variable Importance

vi_glm <- varImp(fit_glm, scale = TRUE)
print(vi_glm)

## glm variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                                                   Overall
## PathologyPapillary                               100.0000
## `FocalityUni-Focal`                               17.5102
## GenderM                                            8.1102
## Age                                                0.7684
## AdenopathyLeft                                     0.7161
## ResponseExcellent                                  0.7030
## NN1b                                               0.7002
## NN1a                                               0.6859
## `ResponseStructural Incomplete`                    0.6764
## AdenopathyRight                                    0.6756
## Thyroid.FunctionEuthyroid                          0.6104
## `Thyroid.FunctionClinical Hypothyroidism`          0.5513
## StageII                                            0.5354
## ResponseIndeterminate                              0.5230
## AdenopathyNo                                       0.4369
## `Thyroid.FunctionSubclinical Hypothyroidism`       0.4032
## `Thyroid.FunctionSubclinical Hyperthyroidism`      0.3960
## `PathologyHurthel cell`                            0.2788
## SmokingYes                                         0.2731
## `Physical.ExaminationSingle nodular goiter-left`   0.2707

plot(vi_glm, top = 20)

5.2.6 Conclusion

The logistic regression model demonstrates excellent performance, with an accuracy of 0.96 (95% CI: 0.89–0.99) and a high kappa value (0.90), indicating strong agreement beyond chance. It achieves balanced sensitivity (0.95) and specificity (0.96), along with a strong AUC of 0.97, reflecting excellent discrimination. The low Brier score (0.035) indicates good calibration, meaning predicted probabilities closely match observed outcomes.

5.3 K-Nearest Neighbors

5.3.1 Training model

# Fit KNN with caret
set.seed(123)
fit_knn <- caret::train(
  Recurred ~ ., 
  data = train_all[, c(predictors, outcome)],
  method = "knn",
  metric = "ROC",
  trControl = ctrl,
  tuneLength = 10   # automatically tunes 'k'
)

# Predictions
# Probabilities for positive class "Yes"
pred_prob_knn <- predict(fit_knn, newdata = test_all, type = "prob")

# Class predictions
pred_class_knn <- predict(fit_knn, newdata = test_all)

# Make sure both outcome and predictions are factors with the same levels
test_all$Recurred <- factor(test_all$Recurred, levels = c("No","Yes"))
pred_class_knn    <- factor(pred_class_knn,    levels = c("No","Yes"))

5.3.2 Confusion matrix

cm_knn <- caret::confusionMatrix(
  data      = pred_class_knn,
  reference = test_all$Recurred,
  positive  = "Yes"
)
print(cm_knn)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  55   7
##        Yes  0  14
##                                           
##                Accuracy : 0.9079          
##                  95% CI : (0.8194, 0.9622)
##     No Information Rate : 0.7237          
##     P-Value [Acc > NIR] : 7.294e-05       
##                                           
##                   Kappa : 0.7432          
##                                           
##  Mcnemar's Test P-Value : 0.02334         
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.8871          
##              Prevalence : 0.2763          
##          Detection Rate : 0.1842          
##    Detection Prevalence : 0.1842          
##       Balanced Accuracy : 0.8333          
##                                           
##        'Positive' Class : Yes             
##

# Convert confusion matrix table to a data frame for plotting
cm_table <- as.data.frame(cm_knn$table)

# Rename columns for clarity
colnames(cm_table) <- c("Prediction", "Reference", "Freq")

# Plot heatmap
ggplot(cm_table, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Freq), size = 6, fontface = "bold") +
  scale_fill_gradient(low = "lightblue", high = "steelblue") +
  labs(title = "Confusion Matrix - KNN",
       x = "Actual",
       y = "Predicted") +
  theme_minimal(base_size = 14)

5.3.3 ROC Curve & AUC

roc_obj_knn <- pROC::roc(
  response  = test_all$Recurred,
  predictor = pred_prob_knn[,"Yes"],
  levels    = c("No", "Yes"),   # first = controls, second = cases
  direction = "<"
)

# Plot ROC curve
plot(
  roc_obj_knn,
  col = "#4978C9",
  lwd = 2,
  main = "ROC Curve for KNN"
)

# Print AUC
auc_val_knn <- pROC::auc(roc_obj_knn)
print(auc_val_knn)

## Area under the curve: 0.9576

5.3.4 Brier value

brier_val_knn <- ModelMetrics::brier(
  as.numeric(test_all$Recurred) - 1,   # convert "No"=0, "Yes"=1
  pred_prob_knn[,"Yes"]
)
print(brier_val_knn)

## [1] 0.07307638

5.3.5 Variable Importance

Unlike parametric models such as logistic regression, KNN does not provide interpretable variable importance. Attempts to compute importance using caret’s varImp() may fail when predictors are non-numeric. Alternative methods such as permutation importance can be used.

# library(iml) requires

permutation_importance <- function(model, train_data, predictors, outcome,
                                   loss = "ce", n_repetitions = 5) {
  # model        : a trained caret model (fit_knn, fit_glm, etc.)
  # train_data   : the training dataset
  # predictors   : vector of predictor column names
  # outcome      : outcome column name
  # loss         : loss function ("ce"=cross-entropy, "auc", "accuracy", etc.)
  # n_repetitions: how many times to shuffle each variable

  # Create predictor wrapper
  predictor_obj <- Predictor$new(
    model = model,
    data  = train_data[, predictors],
    y     = train_data[[outcome]]
  )

  # Compute feature importance
  imp <- FeatureImp$new(
    predictor_obj,
    loss = loss,
    n.repetitions = n_repetitions
  )

  return(imp)
}


imp_knn <- permutation_importance(
  model        = fit_knn,
  train_data   = train_all,
  predictors   = predictors,
  outcome      = outcome,
  loss         = "ce",     # cross-entropy
  n_repetitions = 5
)

# Print results
print(imp_knn)

## Interpretation method:  FeatureImp 
## error function: ce
## 
## Analysed predictor: 
## Prediction task: classification 
## Classes:  
## 
## Analysed data:
## Sampling from data.frame with 307 rows and 16 columns.
## 
## 
## Head of results:
##      feature importance.05 importance importance.95 permutation.error
## 1   Response      1.733333   1.888889      1.955556        0.16612378
## 2        Age      1.666667   1.777778      1.844444        0.15635179
## 3       Risk      1.259259   1.444444      1.659259        0.12703583
## 4 Adenopathy      1.229630   1.296296      1.414815        0.11400651
## 5          T      1.192593   1.259259      1.318519        0.11074919
## 6     Gender      1.022222   1.111111      1.207407        0.09771987

# Plot importance
plot(imp_knn) + theme_classic()

5.3.6 Conclusion

The KNN model shows a noticeably different performance profile compared to logistic regression. While logistic regression achieves both high sensitivity (0.95) and specificity (0.96) with an overall accuracy of 0.96, KNN attains perfect specificity (1.00) and PPV (1.00) but at the cost of much lower sensitivity (0.67), meaning it misses a substantial number of true recurrence cases. Both models demonstrate strong discrimination (AUC ≈ 0.96–0.97), but KNN’s higher Brier score (0.073 vs. 0.035) suggests poorer probability calibration. Overall, logistic regression provides a more balanced and reliable performance, whereas KNN is highly conservative—excellent at ruling out non-recurrence but less effective at detecting recurrence.

5.4 Decision Tree

5.4.1 Training model

# Fit Decision Tree with caret
set.seed(123)
fit_dt <- caret::train(
  Recurred ~ ., 
  data = train_all[, c(predictors, outcome)],
  method = "rpart",
  metric = "ROC",
  trControl = ctrl,
  tuneLength = 10   # try different complexity parameters
)

# Predictions
# Probabilities for positive class "Yes"
pred_prob_dt <- predict(fit_dt, newdata = test_all, type = "prob")

# Class predictions
pred_class_dt <- predict(fit_dt, newdata = test_all)

# Make sure both outcome and predictions are factors with the same levels
test_all$Recurred <- factor(test_all$Recurred, levels = c("No","Yes"))
pred_class_dt <- factor(pred_class_dt, levels = c("No","Yes"))

5.4.2 Confusion matrix

cm_dt <- caret::confusionMatrix(
  data      = pred_class_dt,
  reference = test_all$Recurred,
  positive  = "Yes"
)
print(cm_dt)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  54   2
##        Yes  1  19
##                                           
##                Accuracy : 0.9605          
##                  95% CI : (0.8889, 0.9918)
##     No Information Rate : 0.7237          
##     P-Value [Acc > NIR] : 9.227e-08       
##                                           
##                   Kappa : 0.8998          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9048          
##             Specificity : 0.9818          
##          Pos Pred Value : 0.9500          
##          Neg Pred Value : 0.9643          
##              Prevalence : 0.2763          
##          Detection Rate : 0.2500          
##    Detection Prevalence : 0.2632          
##       Balanced Accuracy : 0.9433          
##                                           
##        'Positive' Class : Yes             
##

# Convert confusion matrix table to a data frame for plotting
cm_table_dt <- as.data.frame(cm_dt$table)

# Rename columns for clarity
colnames(cm_table_dt) <- c("Prediction", "Reference", "Freq")

# Plot heatmap
ggplot(cm_table_dt, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Freq), size = 6, fontface = "bold") +
  scale_fill_gradient(low = "lightblue", high = "steelblue") +
  labs(title = "Confusion Matrix - Decision Tree",
       x = "Actual",
       y = "Predicted") +
  theme_minimal(base_size = 14)

5.4.3 ROC Curve & AUC

roc_obj_dt <- pROC::roc(
  response  = test_all$Recurred,
  predictor = pred_prob_dt[,"Yes"],
  levels    = c("No", "Yes"),   # first = controls, second = cases
  direction = "<"
)

# Plot ROC curve
plot(
  roc_obj_dt,
  col = "#4978C9",
  lwd = 2,
  main = "ROC Curve for Decision Tree"
)

# Print AUC
auc_val_dt <- pROC::auc(roc_obj_dt)
print(auc_val_dt)

## Area under the curve: 0.9117

5.4.4 Brier value

brier_val_dt <- ModelMetrics::brier(
  as.numeric(test_all$Recurred) - 1,   # convert factor ("No","Yes") → 0/1
  pred_prob_dt[,"Yes"]                 # probability for "Yes"
)
print(brier_val_dt)

## [1] 0.03088535

5.4.5 Variable Importance

vi_dt <- varImp(fit_dt, scale = TRUE)
print(vi_dt)

## rpart variable importance
## 
##   only 20 most important variables shown (out of 43)
## 
##                                                Overall
## ResponseStructural Incomplete                  100.000
## RiskLow                                         81.926
## AdenopathyNo                                    71.480
## ResponseExcellent                               70.247
## NN1b                                            55.835
## Age                                             15.508
## RiskIntermediate                                 4.979
## TT3a                                             2.722
## ResponseIndeterminate                            2.591
## TT2                                              2.329
## Physical.ExaminationMultinodular goiter          2.163
## Physical.ExaminationSingle nodular goiter-left   1.961
## PathologyMicropapillary                          0.000
## TT1b                                             0.000
## PathologyPapillary                               0.000
## Thyroid.FunctionEuthyroid                        0.000
## `PathologyHurthel cell`                          0.000
## `Thyroid.FunctionSubclinical Hyperthyroidism`    0.000
## Physical.ExaminationNormal                       0.000
## SmokingYes                                       0.000

plot(vi_dt, top = 20)

5.4.6 Conclusion

The decision tree model delivers a balanced performance between logistic regression and KNN. Like logistic regression, it achieves high overall accuracy (0.96) and strong predictive balance, with sensitivity (0.90) and specificity (0.98) both remaining high. Compared to logistic regression (AUC = 0.97, Brier = 0.035), the decision tree shows slightly lower discrimination (AUC = 0.91) but slightly better calibration (Brier = 0.031). In contrast, KNN sacrifices sensitivity (0.67) in favor of perfect specificity (1.00) and PPV (1.00), making it overly conservative and prone to missing true recurrences. Thus, while logistic regression remains the most discriminative model and KNN the most cautious, the decision tree provides a strong compromise—offering high accuracy and interpretability with reliable performance across both positive and negative predictions.

5.5 Random Forest

5.5.1 Training model

# Fit Random Forest with caret
set.seed(123)
fit_rf <- caret::train(
  Recurred ~ ., 
  data = train_all[, c(predictors, outcome)],
  method = "rf",
  metric = "ROC",
  trControl = ctrl,
  tuneLength = 5    # tries multiple mtry values
)

# Predictions
# Probabilities for positive class "Yes"
pred_prob_rf <- predict(fit_rf, newdata = test_all, type = "prob")

# Class predictions
pred_class_rf <- predict(fit_rf, newdata = test_all)

# Make sure both outcome and predictions are factors with the same levels
test_all$Recurred <- factor(test_all$Recurred, levels = c("No","Yes"))
pred_class_rf <- factor(pred_class_rf, levels = c("No","Yes"))

5.5.2 Confusion matrix

cm_rf <- caret::confusionMatrix(
  data      = pred_class_rf,
  reference = test_all$Recurred,
  positive  = "Yes"
)
print(cm_rf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  54   5
##        Yes  1  16
##                                          
##                Accuracy : 0.9211         
##                  95% CI : (0.836, 0.9705)
##     No Information Rate : 0.7237         
##     P-Value [Acc > NIR] : 1.818e-05      
##                                          
##                   Kappa : 0.7902         
##                                          
##  Mcnemar's Test P-Value : 0.2207         
##                                          
##             Sensitivity : 0.7619         
##             Specificity : 0.9818         
##          Pos Pred Value : 0.9412         
##          Neg Pred Value : 0.9153         
##              Prevalence : 0.2763         
##          Detection Rate : 0.2105         
##    Detection Prevalence : 0.2237         
##       Balanced Accuracy : 0.8719         
##                                          
##        'Positive' Class : Yes            
##

# Convert confusion matrix table to a data frame for plotting
cm_table_rf <- as.data.frame(cm_rf$table)

# Rename columns for clarity
colnames(cm_table_rf) <- c("Prediction", "Reference", "Freq")

# Plot heatmap
ggplot(cm_table_rf, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Freq), size = 6, fontface = "bold") +
  scale_fill_gradient(low = "lightblue", high = "steelblue") +
  labs(title = "Confusion Matrix - Random Forest",
       x = "Actual",
       y = "Predicted") +
  theme_minimal(base_size = 14)

5.5.3 ROC Curve & AUC

roc_obj_rf <- pROC::roc(
  response  = test_all$Recurred,
  predictor = pred_prob_rf[,"Yes"],
  levels    = c("No", "Yes"),   # first = controls, second = cases
  direction = "<"
)

# Plot ROC curve
plot(
  roc_obj_rf,
  col = "#4978C9",
  lwd = 2,
  main = "ROC Curve for Random Forest"
)

# Print AUC
auc_val_rf <- pROC::auc(roc_obj_rf)
print(auc_val_rf)

## Area under the curve: 0.9697

5.5.4 Brier value

brier_val_rf <- ModelMetrics::brier(
  as.numeric(test_all$Recurred) - 1,   # convert "No"=0, "Yes"=1
  pred_prob_rf[,"Yes"]                 # probability for "Yes"
)
print(brier_val_rf)

## [1] 0.05987721

5.5.5 Variable Importance

vi_rf <- varImp(fit_rf, scale = TRUE)
print(vi_rf)

## rf variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                                         Overall
## ResponseStructural Incomplete           100.000
## RiskLow                                  54.006
## ResponseExcellent                        48.486
## AdenopathyNo                             46.965
## NN1b                                     40.925
## RiskIntermediate                         28.758
## Age                                      18.675
## GenderM                                  13.067
## FocalityUni-Focal                        12.241
## AdenopathyRight                          11.741
## ResponseIndeterminate                    11.260
## SmokingYes                               10.432
## TT4a                                      9.718
## MM1                                       8.624
## StageII                                   8.462
## TT3b                                      7.815
## TT2                                       6.697
## StageIVB                                  5.789
## TT3a                                      4.774
## Physical.ExaminationMultinodular goiter   4.199

plot(vi_rf, top = 20)

5.5.6 Conclusion

The random forest model achieves solid performance, with high specificity (0.98) and good PPV (0.94), but its sensitivity drops to 0.76, meaning it misses more true recurrences than logistic regression (0.95) or decision trees (0.90). While its overall accuracy (0.92) is lower than both logistic regression and decision trees (0.96), its AUC (0.97) is among the highest, showing excellent ability to discriminate between recurrent and non-recurrent cases. Compared to KNN, random forest provides a better balance between sensitivity and specificity, though with slightly less calibration (Brier = 0.060 vs. 0.073 for KNN). In summary, random forest offers strong discriminative power and reliability, but logistic regression and decision trees maintain a better balance between sensitivity and specificity, whereas KNN remains more conservative but less sensitive.

5.6 Support Vector Machine (SVM)

5.6.1 Training model

# Fit SVM with radial basis kernel using caret
set.seed(123)
fit_svm <- caret::train(
  Recurred ~ ., 
  data = train_all[, c(predictors, outcome)],
  method = "svmRadial",
  metric = "ROC",
  trControl = ctrl,
  tuneLength = 10   # tune over cost & sigma
)

# Predictions
# Probabilities for positive class "Yes"
pred_prob_svm <- predict(fit_svm, newdata = test_all, type = "prob")

# Class predictions
pred_class_svm <- predict(fit_svm, newdata = test_all)

# Make sure both outcome and predictions are factors with the same levels
test_all$Recurred <- factor(test_all$Recurred, levels = c("No","Yes"))
pred_class_svm <- factor(pred_class_svm, levels = c("No","Yes"))

5.6.2 Confusion matrix

cm_svm <- caret::confusionMatrix(
  data      = pred_class_svm,
  reference = test_all$Recurred,
  positive  = "Yes"
)
print(cm_svm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  54   2
##        Yes  1  19
##                                           
##                Accuracy : 0.9605          
##                  95% CI : (0.8889, 0.9918)
##     No Information Rate : 0.7237          
##     P-Value [Acc > NIR] : 9.227e-08       
##                                           
##                   Kappa : 0.8998          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9048          
##             Specificity : 0.9818          
##          Pos Pred Value : 0.9500          
##          Neg Pred Value : 0.9643          
##              Prevalence : 0.2763          
##          Detection Rate : 0.2500          
##    Detection Prevalence : 0.2632          
##       Balanced Accuracy : 0.9433          
##                                           
##        'Positive' Class : Yes             
##

# Convert confusion matrix table to a data frame for plotting
cm_table_svm <- as.data.frame(cm_svm$table)

# Rename columns for clarity
colnames(cm_table_svm) <- c("Prediction", "Reference", "Freq")

# Plot heatmap
ggplot(cm_table_svm, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Freq), size = 6, fontface = "bold") +
  scale_fill_gradient(low = "lightblue", high = "steelblue") +
  labs(title = "Confusion Matrix - SVM",
       x = "Actual",
       y = "Predicted") +
  theme_minimal(base_size = 14)

5.6.3 ROC Curve & AUC

roc_obj_svm <- pROC::roc(
  response  = test_all$Recurred,
  predictor = pred_prob_svm[,"Yes"],
  levels    = c("No", "Yes"),   # first = controls, second = cases
  direction = "<"
)

# Plot ROC curve
plot(
  roc_obj_svm,
  col = "#4978C9",
  lwd = 2,
  main = "ROC Curve for SVM"
)

# Print AUC
auc_val_svm <- pROC::auc(roc_obj_svm)
print(auc_val_svm)

## Area under the curve: 0.9697

5.6.4 Brier value

brier_val_svm <- ModelMetrics::brier(
  as.numeric(test_all$Recurred) - 1,   # convert factor to 0/1
  pred_prob_svm[,"Yes"]                # predicted probabilities for "Yes"
)
print(brier_val_svm)

## [1] 0.03739232

5.6.5 Variable Importance

Not interpretable

5.6.6 Conclusion

The SVM model performs on par with logistic regression, achieving the same high accuracy (0.96) and nearly identical balance between sensitivity (0.90) and specificity (0.98). Both models also demonstrate excellent discrimination (AUC ≈ 0.97) and strong calibration (Brier ≈ 0.037 for SVM vs. 0.035 for logistic regression), making them equally reliable for recurrence prediction. In contrast, the random forest shows strong discriminative ability (AUC = 0.97) but lower accuracy (0.92) and reduced sensitivity (0.76), indicating it tends to under-detect recurrences despite high specificity. Overall, logistic regression and SVM stand out as the most balanced and dependable models, while random forest offers strong discrimination but at the cost of missing more true positive cases.

5.7 Artificial Neural Network (ANN)

5.7.1 Training model

# Fit ANN with caret (nnet)
set.seed(123)
fit_ann <- caret::train(
  Recurred ~ ., 
  data = train_all[, c(predictors, outcome)],
  method = "nnet",
  metric = "ROC",
  trControl = ctrl,
  tuneLength = 5,         # search over different hidden units/decay
  trace = FALSE           # suppress training output
)

# Predictions
# Probabilities for positive class "Yes"
pred_prob_ann <- predict(fit_ann, newdata = test_all, type = "prob")

# Class predictions
pred_class_ann <- predict(fit_ann, newdata = test_all)

# Make sure both outcome and predictions are factors with the same levels
test_all$Recurred <- factor(test_all$Recurred, levels = c("No","Yes"))
pred_class_ann <- factor(pred_class_ann, levels = c("No","Yes"))

5.7.2 Confusion matrix

cm_ann <- caret::confusionMatrix(
  data      = pred_class_ann,
  reference = test_all$Recurred,
  positive  = "Yes"
)
print(cm_ann)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  52   2
##        Yes  3  19
##                                           
##                Accuracy : 0.9342          
##                  95% CI : (0.8531, 0.9783)
##     No Information Rate : 0.7237          
##     P-Value [Acc > NIR] : 3.843e-06       
##                                           
##                   Kappa : 0.8379          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9048          
##             Specificity : 0.9455          
##          Pos Pred Value : 0.8636          
##          Neg Pred Value : 0.9630          
##              Prevalence : 0.2763          
##          Detection Rate : 0.2500          
##    Detection Prevalence : 0.2895          
##       Balanced Accuracy : 0.9251          
##                                           
##        'Positive' Class : Yes             
##

# Convert confusion matrix table to a data frame for plotting
cm_table_ann <- as.data.frame(cm_ann$table)

# Rename columns for clarity
colnames(cm_table_ann) <- c("Prediction", "Reference", "Freq")

# Plot heatmap
ggplot(cm_table_ann, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Freq), size = 6, fontface = "bold") +
  scale_fill_gradient(low = "lightblue", high = "steelblue") +
  labs(title = "Confusion Matrix - ANN",
       x = "Actual",
       y = "Predicted") +
  theme_minimal(base_size = 14)

5.7.3 ROC Curve & AUC

roc_obj_ann <- pROC::roc(
  response  = test_all$Recurred,
  predictor = pred_prob_ann[,"Yes"],
  levels    = c("No", "Yes"),   # first = controls, second = cases
  direction = "<"
)

# Plot ROC curve
plot(
  roc_obj_ann,
  col = "#4978C9",
  lwd = 2,
  main = "ROC Curve for ANN"
)

# Print AUC
auc_val_ann <- pROC::auc(roc_obj_ann)
print(auc_val_ann)

## Area under the curve: 0.9732

5.7.4 Brier value

brier_val_ann <- ModelMetrics::brier(
  as.numeric(test_all$Recurred) - 1,   # convert factor ("No","Yes") → 0/1
  pred_prob_ann[,"Yes"]                # probability for positive class
)
print(brier_val_ann)

## [1] 0.04275516

5.7.5 Variable Importance

vi_ann <- varImp(fit_ann, scale = TRUE)
print(vi_ann)

## nnet variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                                            Overall
## ResponseExcellent                           100.00
## Thyroid.FunctionClinical Hypothyroidism      79.24
## ResponseIndeterminate                        76.51
## RiskLow                                      52.74
## ResponseStructural Incomplete                48.05
## Thyroid.FunctionEuthyroid                    45.73
## StageII                                      45.35
## NN1b                                         43.45
## PathologyHurthel cell                        42.00
## FocalityUni-Focal                            39.57
## TT2                                          27.32
## AdenopathyLeft                               26.43
## TT4a                                         24.90
## StageIVA                                     24.48
## RiskIntermediate                             23.60
## Hx.SmokingYes                                23.54
## TT1b                                         21.52
## Thyroid.FunctionSubclinical Hypothyroidism   21.49
## Physical.ExaminationNormal                   20.93
## GenderM                                      20.56

plot(vi_ann, top = 20)

5.7.6 Conclusion

The artificial neural network (ANN) achieves strong performance with accuracy of 0.93, sensitivity of 0.90, and specificity of 0.95, supported by an excellent AUC of 0.97. Compared to logistic regression and SVM, which both reach higher accuracy (0.96), slightly better calibration (Brier ≈ 0.035–0.037 vs. 0.043), and equally high AUC, the ANN is competitive but somewhat less precise in positive predictions (PPV = 0.86 vs. ~0.91–0.95). The decision tree matches SVM and logistic regression in accuracy (0.96) but lags in discrimination (AUC = 0.91), though its calibration is very strong (Brier ≈ 0.031). The random forest delivers excellent discrimination (AUC = 0.97) but sacrifices sensitivity (0.76) and accuracy (0.92), tending to under-detect recurrences despite high specificity. Finally, KNN stands out as the most unbalanced: it achieves perfect specificity (1.00) and PPV (1.00), but at the cost of poor sensitivity (0.67), missing many true recurrences, and a higher Brier score (0.073).

6. Summary

# Helper function to extract metrics
extract_metrics <- function(cm, auc_val) {
  out <- data.frame(
    Accuracy    = cm$overall["Accuracy"],
    Sensitivity = cm$byClass["Sensitivity"],
    Specificity = cm$byClass["Specificity"],
    PPV         = cm$byClass["Pos Pred Value"],
    NPV         = cm$byClass["Neg Pred Value"],
    AUC         = as.numeric(auc_val)
  )
  return(out)
}

# Collect metrics for all models
results <- list(
  "Logistic Regression" = extract_metrics(cm_glm, auc_val_glm),
  "KNN"                 = extract_metrics(cm_knn, auc_val_knn),
  "Decision Tree"       = extract_metrics(cm_dt,  auc_val_dt),
  "Random Forest"       = extract_metrics(cm_rf,  auc_val_rf),
  "SVM"                 = extract_metrics(cm_svm, auc_val_svm),
  "ANN"                 = extract_metrics(cm_ann, auc_val_ann)
)

# Combine into one data frame
results_df <- do.call(rbind, results)

# Round for readability
results_df <- round(results_df, 3)

# Display
knitr::kable(results_df, caption = "Model Performance Summary")

Model Performance Summary
	Accuracy	Sensitivity	Specificity	PPV	NPV	AUC
Logistic Regression	0.961	0.952	0.964	0.909	0.981	0.967
KNN	0.908	0.667	1.000	1.000	0.887	0.958
Decision Tree	0.961	0.905	0.982	0.950	0.964	0.912
Random Forest	0.921	0.762	0.982	0.941	0.915	0.970
SVM	0.961	0.905	0.982	0.950	0.964	0.970
ANN	0.934	0.905	0.945	0.864	0.963	0.973

Across all six models, performance was generally strong, with accuracies above 90% and AUC values consistently high, indicating excellent discrimination between recurrent and non-recurrent cases.
Logistic regression and SVM provide the most balanced performance, with high accuracy (~0.96), strong sensitivity and specificity, excellent AUC (~0.97), and reliable calibration, making them the most dependable overall.
Decision trees achieve similar accuracy (0.96) and very high specificity (0.98), but their discrimination (AUC = 0.91) is weaker, though they remain interpretable and practical.
Random forest delivers excellent discrimination (AUC = 0.97) but lower sensitivity (0.76) and overall accuracy (0.92), showing a tendency to under-detect recurrences despite high specificity.
KNN and ANN sit at opposite ends: KNN is overly conservative with perfect specificity and PPV but very low sensitivity (0.67), while ANN achieves the highest AUC (0.97) but lower PPV (0.86), reflecting a higher false-positive rate.

Thank you for reading!