Clinical Data Interchange Standards Consortium (CDISC) Data Standardization with R and SAS

In this project, we provide a step-by-step walkthrough of a CDISC clinical data standardization workflow using R. Starting from raw data collected through electronic Case Report Forms (eCRFs) following the CDASH standard, the project demonstrates how these data are transformed into SDTM domains, then into ADaM analysis datasets, and finally summarized as Tables, Listings, and Figures (TLFs).

Project Objectives :

Simulate raw CDASH-like data (e.g., demographics, laboratory results)
Perform data cleaning and harmonization (variable renaming, visit mapping, missing value checks)
Derive SDTM-compliant domains (DM, LB)
Build ADaM datasets (ADSL, ADLB), including derived variables such as baseline, change from baseline, and analysis flags
Generate summary statistics and visualizations (mean change from baseline by treatment arm)
Demonstrate traceability from raw CDASH to final analytical results

Loading necessary R packages

library(tidyverse)

1. CDASH: Raw Data Collection

In this example, data are collected from six patients (IDs 1–6) enrolled in a clinical trial evaluating the effect of a new treatment, Drug A, compared to a placebo. Each participant is randomly assigned to either the treatment or placebo group, and their hemoglobin levels are measured at three scheduled visits: Screening, Week 4, and Week 8. These raw observations represent the initial data captured through electronic Case Report Forms (eCRFs) following the CDASH (Clinical Data Acquisition Standards Harmonization) standard, before any cleaning or transformation into SDTM format.

# Demographics form (raw CDASH-like)
raw_demographics <- data.frame(
  Subject_ID = c(1:6),
  Study_ID = "ABC123",
  Sex = c("M", "F", "M", "F", "M", "F"),
  Age_Years = c(54, 61, 49, 58, 62, 53),
  Treatment_Assigned = c("Drug A", "Placebo", "Drug A", "Placebo", "Drug A", "Placebo")
) |> print()

##   Subject_ID Study_ID Sex Age_Years Treatment_Assigned
## 1          1   ABC123   M        54             Drug A
## 2          2   ABC123   F        61            Placebo
## 3          3   ABC123   M        49             Drug A
## 4          4   ABC123   F        58            Placebo
## 5          5   ABC123   M        62             Drug A
## 6          6   ABC123   F        53            Placebo

# Lab form (raw CDASH-like)
raw_labs <- data.frame(
  Subject_ID = rep(1:6, each = 3),
  Visit_Name = rep(c("Screening", "Week 4", "Week 8"), times = 6),
  Test_Name = "Hemoglobin",
  Test_Result = c(13.2, 13.8, 14.0,
                  12.8, 13.1, 13.3,
                  14.1, 14.8, 15.0,
                  13.5, 13.6, 13.7,
                  13.0, 13.3, 13.5,
                  12.9, 13.0, 13.2),
  Units = "g/dL"
) |> print()

##    Subject_ID Visit_Name  Test_Name Test_Result Units
## 1           1  Screening Hemoglobin        13.2  g/dL
## 2           1     Week 4 Hemoglobin        13.8  g/dL
## 3           1     Week 8 Hemoglobin        14.0  g/dL
## 4           2  Screening Hemoglobin        12.8  g/dL
## 5           2     Week 4 Hemoglobin        13.1  g/dL
## 6           2     Week 8 Hemoglobin        13.3  g/dL
## 7           3  Screening Hemoglobin        14.1  g/dL
## 8           3     Week 4 Hemoglobin        14.8  g/dL
## 9           3     Week 8 Hemoglobin        15.0  g/dL
## 10          4  Screening Hemoglobin        13.5  g/dL
## 11          4     Week 4 Hemoglobin        13.6  g/dL
## 12          4     Week 8 Hemoglobin        13.7  g/dL
## 13          5  Screening Hemoglobin        13.0  g/dL
## 14          5     Week 4 Hemoglobin        13.3  g/dL
## 15          5     Week 8 Hemoglobin        13.5  g/dL
## 16          6  Screening Hemoglobin        12.9  g/dL
## 17          6     Week 4 Hemoglobin        13.0  g/dL
## 18          6     Week 8 Hemoglobin        13.2  g/dL

SAS

/* --- Demographics form (raw CDASH-like) --- */
data raw_demographics;
    length Study_ID $6 Sex $1 Treatment_Assigned $10;
    do Subject_ID = 1 to 6;
        Study_ID = "ABC123";
        if mod(Subject_ID,2)=1 then Sex = "M";
        else Sex = "F";
        select (Subject_ID);
            when (1,3,5) Treatment_Assigned = "Drug A";
            otherwise     Treatment_Assigned = "Placebo";
        end;
        select (Subject_ID);
            when (1) Age_Years = 54;
            when (2) Age_Years = 61;
            when (3) Age_Years = 49;
            when (4) Age_Years = 58;
            when (5) Age_Years = 62;
            when (6) Age_Years = 53;
        end;
        output;
    end;
run;

proc print data=raw_demographics;
    title "Raw CDASH-Like Demographics Data";
run;

/* --- Lab form (raw CDASH-like) --- */
data raw_labs;
    length Visit_Name $10 Test_Name $20 Units $5;
    array results[18] _temporary_ (13.2 13.8 14.0
                                   12.8 13.1 13.3
                                   14.1 14.8 15.0
                                   13.5 13.6 13.7
                                   13.0 13.3 13.5
                                   12.9 13.0 13.2);
    do i = 1 to 6;                     /* 6 subjects */
        Subject_ID = i;
        Units = "g/dL";
        Test_Name = "Hemoglobin";
        do j = 1 to 3;                 /* 3 visits per subject */
            select (j);
                when (1) Visit_Name = "Screening";
                when (2) Visit_Name = "Week 4";
                when (3) Visit_Name = "Week 8";
            end;
            Test_Result = results[(i-1)*3 + j];
            output;
        end;
    end;
    drop i j;
run;

proc print data=raw_labs;
    title "Raw CDASH-Like Laboratory Data";
run;

2. Cleaning and Quality Control (Pre-SDTM)

In this step, we clean and standardize the raw CDASH datasets to prepare them for SDTM mapping. The code below renames variables to align with CDISC SDTM naming conventions (e.g., Subject_ID → USUBJID, Age_Years → AGE), ensures consistent subject identifiers, and harmonizes visit labels by changing “Screening” to “Baseline.” This process produces clean, structured datasets for both demographics and laboratory results, ready for integration into SDTM domains.

# --- Clean Demographics ---
clean_DM <- raw_demographics |> 
  rename(USUBJID = Subject_ID,
         STUDYID = Study_ID,
         SEX = Sex,
         AGE = Age_Years,
         ARM = Treatment_Assigned) |> 
  mutate(USUBJID = paste0("SUBJ", USUBJID)) |> print()

##   USUBJID STUDYID SEX AGE     ARM
## 1   SUBJ1  ABC123   M  54  Drug A
## 2   SUBJ2  ABC123   F  61 Placebo
## 3   SUBJ3  ABC123   M  49  Drug A
## 4   SUBJ4  ABC123   F  58 Placebo
## 5   SUBJ5  ABC123   M  62  Drug A
## 6   SUBJ6  ABC123   F  53 Placebo

# --- Clean Labs ---
clean_LB <- raw_labs |> 
  rename(USUBJID = Subject_ID,
         VISIT = Visit_Name,
         LBTEST = Test_Name,
         LBSTRESN = Test_Result,
         LBSTRESU = Units) |> 
  mutate(USUBJID = paste0("SUBJ", USUBJID),
         VISIT = ifelse(VISIT == "Screening", "Baseline", VISIT)) |> print()

##    USUBJID    VISIT     LBTEST LBSTRESN LBSTRESU
## 1    SUBJ1 Baseline Hemoglobin     13.2     g/dL
## 2    SUBJ1   Week 4 Hemoglobin     13.8     g/dL
## 3    SUBJ1   Week 8 Hemoglobin     14.0     g/dL
## 4    SUBJ2 Baseline Hemoglobin     12.8     g/dL
## 5    SUBJ2   Week 4 Hemoglobin     13.1     g/dL
## 6    SUBJ2   Week 8 Hemoglobin     13.3     g/dL
## 7    SUBJ3 Baseline Hemoglobin     14.1     g/dL
## 8    SUBJ3   Week 4 Hemoglobin     14.8     g/dL
## 9    SUBJ3   Week 8 Hemoglobin     15.0     g/dL
## 10   SUBJ4 Baseline Hemoglobin     13.5     g/dL
## 11   SUBJ4   Week 4 Hemoglobin     13.6     g/dL
## 12   SUBJ4   Week 8 Hemoglobin     13.7     g/dL
## 13   SUBJ5 Baseline Hemoglobin     13.0     g/dL
## 14   SUBJ5   Week 4 Hemoglobin     13.3     g/dL
## 15   SUBJ5   Week 8 Hemoglobin     13.5     g/dL
## 16   SUBJ6 Baseline Hemoglobin     12.9     g/dL
## 17   SUBJ6   Week 4 Hemoglobin     13.0     g/dL
## 18   SUBJ6   Week 8 Hemoglobin     13.2     g/dL

# --- Quality checks ---
sum(is.na(clean_DM))  # missingness check

## [1] 0

sum(is.na(clean_LB))

## [1] 0

SAS


/* --- Clean Demographics --- */
data clean_DM;
    set raw_demographics;
    length USUBJID $8 STUDYID $6 SEX $1 ARM $10;
    
    /* Rename and create SDTM-compliant variables */
    USUBJID = cats("SUBJ", Subject_ID);
    STUDYID = Study_ID;
    AGE = Age_Years;
    ARM = Treatment_Assigned;
    keep STUDYID USUBJID SEX AGE ARM;
run;

title "Cleaned Demographics Dataset";
proc print data=clean_DM noobs;
run;

/* --- Clean Labs --- */
data clean_LB;
    set raw_labs;
    length USUBJID $8 VISIT $10 LBTEST $20 LBSTRESU $5;
    /* Rename and create SDTM-compliant variables */
    USUBJID = cats("SUBJ", Subject_ID);
    VISIT = strip(Visit_Name);
    if VISIT = "Screening" then VISIT = "Baseline";  /* Harmonize visit names */
    LBTEST = Test_Name;
    LBSTRESN = Test_Result;
    LBSTRESU = Units;
    keep USUBJID VISIT LBTEST LBSTRESN LBSTRESU;
run;

title "Cleaned Laboratory Dataset";
proc print data=clean_LB noobs;
run;

/* --- Quality Checks: Missing Value Summary --- */
title "Missing Value Check for Cleaned Datasets";
proc means data=clean_DM n nmiss;
run;

proc means data=clean_LB n nmiss;
run;

3. SDTM Construction

In this step, we organize the cleaned data into SDTM-compliant domains following CDISC standards. Each dataset is structured with standardized variable names and assigned a corresponding domain code. The Demographics (DM) domain contains one record per subject with key attributes such as study ID, subject ID, sex, age, and treatment arm. The Laboratory Tests (LB) domain captures repeated measures of laboratory results across study visits, including test name, result, and units. These structured datasets form the foundation for subsequent ADaM derivations and statistical analyses.

# a. Demographics (DM domain)
DM <- clean_DM |>
  select(STUDYID, USUBJID, SEX, AGE, ARM) |>
  mutate(DOMAIN = "DM") |> print()

##   STUDYID USUBJID SEX AGE     ARM DOMAIN
## 1  ABC123   SUBJ1   M  54  Drug A     DM
## 2  ABC123   SUBJ2   F  61 Placebo     DM
## 3  ABC123   SUBJ3   M  49  Drug A     DM
## 4  ABC123   SUBJ4   F  58 Placebo     DM
## 5  ABC123   SUBJ5   M  62  Drug A     DM
## 6  ABC123   SUBJ6   F  53 Placebo     DM

# b. Laboratory Tests (LB domain)
LB <- clean_LB |>
  select(USUBJID, VISIT, LBTEST, LBSTRESN, LBSTRESU) |>
  mutate(STUDYID = "ABC123",
         DOMAIN = "LB") |> print()

##    USUBJID    VISIT     LBTEST LBSTRESN LBSTRESU STUDYID DOMAIN
## 1    SUBJ1 Baseline Hemoglobin     13.2     g/dL  ABC123     LB
## 2    SUBJ1   Week 4 Hemoglobin     13.8     g/dL  ABC123     LB
## 3    SUBJ1   Week 8 Hemoglobin     14.0     g/dL  ABC123     LB
## 4    SUBJ2 Baseline Hemoglobin     12.8     g/dL  ABC123     LB
## 5    SUBJ2   Week 4 Hemoglobin     13.1     g/dL  ABC123     LB
## 6    SUBJ2   Week 8 Hemoglobin     13.3     g/dL  ABC123     LB
## 7    SUBJ3 Baseline Hemoglobin     14.1     g/dL  ABC123     LB
## 8    SUBJ3   Week 4 Hemoglobin     14.8     g/dL  ABC123     LB
## 9    SUBJ3   Week 8 Hemoglobin     15.0     g/dL  ABC123     LB
## 10   SUBJ4 Baseline Hemoglobin     13.5     g/dL  ABC123     LB
## 11   SUBJ4   Week 4 Hemoglobin     13.6     g/dL  ABC123     LB
## 12   SUBJ4   Week 8 Hemoglobin     13.7     g/dL  ABC123     LB
## 13   SUBJ5 Baseline Hemoglobin     13.0     g/dL  ABC123     LB
## 14   SUBJ5   Week 4 Hemoglobin     13.3     g/dL  ABC123     LB
## 15   SUBJ5   Week 8 Hemoglobin     13.5     g/dL  ABC123     LB
## 16   SUBJ6 Baseline Hemoglobin     12.9     g/dL  ABC123     LB
## 17   SUBJ6   Week 4 Hemoglobin     13.0     g/dL  ABC123     LB
## 18   SUBJ6   Week 8 Hemoglobin     13.2     g/dL  ABC123     LB

SAS


/* --- a. Demographics (DM Domain) --- */
data DM;
    set clean_DM;
    length DOMAIN $2;
    DOMAIN = "DM";                    /* SDTM domain code */
    keep STUDYID USUBJID SEX AGE ARM DOMAIN;
run;

title "SDTM Demographics (DM) Domain";
proc print data=DM noobs;
run;

/* --- b. Laboratory Tests (LB Domain) --- */
data LB;
    set clean_LB;
    length STUDYID $6 DOMAIN $2;
    STUDYID = "ABC123";               /* Assign study ID */
    DOMAIN = "LB";                    /* SDTM domain code */
    keep USUBJID VISIT LBTEST LBSTRESN LBSTRESU STUDYID DOMAIN;
run;

title "SDTM Laboratory Tests (LB) Domain";
proc print data=LB noobs;
run;

4. Deriving ADaM (Analysis-Ready Data)

In this step, we begin transforming the standardized SDTM domains into ADaM datasets, which are structured for statistical analysis.

4a. Merge Demographics and Labs

The first task is to merge the Demographics (DM) and Laboratory (LB) domains by their common identifiers, USUBJID and STUDYID. This merge combines subject-level information (such as age, sex, and treatment arm) with laboratory results, creating a unified dataset called ADLB_raw. This combined data serves as the foundation for deriving analysis variables such as baseline values, change from baseline, and analysis flags in subsequent steps.

ADLB_raw <- merge(LB, 
                  DM, 
                  by = c("USUBJID", "STUDYID"))|> print()

##    USUBJID STUDYID    VISIT     LBTEST LBSTRESN LBSTRESU DOMAIN.x SEX AGE
## 1    SUBJ1  ABC123 Baseline Hemoglobin     13.2     g/dL       LB   M  54
## 2    SUBJ1  ABC123   Week 4 Hemoglobin     13.8     g/dL       LB   M  54
## 3    SUBJ1  ABC123   Week 8 Hemoglobin     14.0     g/dL       LB   M  54
## 4    SUBJ2  ABC123 Baseline Hemoglobin     12.8     g/dL       LB   F  61
## 5    SUBJ2  ABC123   Week 4 Hemoglobin     13.1     g/dL       LB   F  61
## 6    SUBJ2  ABC123   Week 8 Hemoglobin     13.3     g/dL       LB   F  61
## 7    SUBJ3  ABC123 Baseline Hemoglobin     14.1     g/dL       LB   M  49
## 8    SUBJ3  ABC123   Week 4 Hemoglobin     14.8     g/dL       LB   M  49
## 9    SUBJ3  ABC123   Week 8 Hemoglobin     15.0     g/dL       LB   M  49
## 10   SUBJ4  ABC123 Baseline Hemoglobin     13.5     g/dL       LB   F  58
## 11   SUBJ4  ABC123   Week 4 Hemoglobin     13.6     g/dL       LB   F  58
## 12   SUBJ4  ABC123   Week 8 Hemoglobin     13.7     g/dL       LB   F  58
## 13   SUBJ5  ABC123 Baseline Hemoglobin     13.0     g/dL       LB   M  62
## 14   SUBJ5  ABC123   Week 4 Hemoglobin     13.3     g/dL       LB   M  62
## 15   SUBJ5  ABC123   Week 8 Hemoglobin     13.5     g/dL       LB   M  62
## 16   SUBJ6  ABC123 Baseline Hemoglobin     12.9     g/dL       LB   F  53
## 17   SUBJ6  ABC123   Week 4 Hemoglobin     13.0     g/dL       LB   F  53
## 18   SUBJ6  ABC123   Week 8 Hemoglobin     13.2     g/dL       LB   F  53
##        ARM DOMAIN.y
## 1   Drug A       DM
## 2   Drug A       DM
## 3   Drug A       DM
## 4  Placebo       DM
## 5  Placebo       DM
## 6  Placebo       DM
## 7   Drug A       DM
## 8   Drug A       DM
## 9   Drug A       DM
## 10 Placebo       DM
## 11 Placebo       DM
## 12 Placebo       DM
## 13  Drug A       DM
## 14  Drug A       DM
## 15  Drug A       DM
## 16 Placebo       DM
## 17 Placebo       DM
## 18 Placebo       DM

SAS


proc sort data=LB; by STUDYID USUBJID; run;
proc sort data=DM; by STUDYID USUBJID; run;

data ADLB_raw;
    merge LB(in=inLB) DM(in=inDM);
    by STUDYID USUBJID;
    if inLB and inDM;               /* Keep only matched subjects */
run;

title "ADaM Raw Dataset (ADLB_raw)";
proc print data=ADLB_raw noobs;
run;

4b. Derive Baseline, Change from Baseline, and Analysis Flags

After merging the SDTM domains, the next step is to create analysis-ready variables needed for statistical evaluation. For each subject, we identify the baseline laboratory value (the measurement recorded at the “Baseline” visit) and calculate the change from baseline (CHG) at subsequent visits. Additional variables are defined to describe the parameter (PARAM), the numeric analysis value (AVAL), and the visit label (AVISIT). Finally, an analysis flag (ANL01FL) is created to indicate which records should be included in the analysis—typically all post-baseline observations. This process ensures that the dataset is structured and traceable for consistent statistical analysis in the ADaM framework.

ADLB <- ADLB_raw |>
  group_by(USUBJID) |>
  mutate(
    BASE = LBSTRESN[VISIT == "Baseline"],
    CHG = LBSTRESN - BASE,
    PARAM = paste0(LBTEST, " (", LBSTRESU, ")"),
    AVAL = LBSTRESN,
    AVISIT = VISIT,
    ANL01FL = ifelse(VISIT != "Baseline", "Y", "N")
  ) |>
  ungroup() |> print()

## # A tibble: 18 × 17
##    USUBJID STUDYID VISIT    LBTEST  LBSTRESN LBSTRESU DOMAIN.x SEX     AGE ARM  
##    <chr>   <chr>   <chr>    <chr>      <dbl> <chr>    <chr>    <chr> <dbl> <chr>
##  1 SUBJ1   ABC123  Baseline Hemogl…     13.2 g/dL     LB       M        54 Drug…
##  2 SUBJ1   ABC123  Week 4   Hemogl…     13.8 g/dL     LB       M        54 Drug…
##  3 SUBJ1   ABC123  Week 8   Hemogl…     14   g/dL     LB       M        54 Drug…
##  4 SUBJ2   ABC123  Baseline Hemogl…     12.8 g/dL     LB       F        61 Plac…
##  5 SUBJ2   ABC123  Week 4   Hemogl…     13.1 g/dL     LB       F        61 Plac…
##  6 SUBJ2   ABC123  Week 8   Hemogl…     13.3 g/dL     LB       F        61 Plac…
##  7 SUBJ3   ABC123  Baseline Hemogl…     14.1 g/dL     LB       M        49 Drug…
##  8 SUBJ3   ABC123  Week 4   Hemogl…     14.8 g/dL     LB       M        49 Drug…
##  9 SUBJ3   ABC123  Week 8   Hemogl…     15   g/dL     LB       M        49 Drug…
## 10 SUBJ4   ABC123  Baseline Hemogl…     13.5 g/dL     LB       F        58 Plac…
## 11 SUBJ4   ABC123  Week 4   Hemogl…     13.6 g/dL     LB       F        58 Plac…
## 12 SUBJ4   ABC123  Week 8   Hemogl…     13.7 g/dL     LB       F        58 Plac…
## 13 SUBJ5   ABC123  Baseline Hemogl…     13   g/dL     LB       M        62 Drug…
## 14 SUBJ5   ABC123  Week 4   Hemogl…     13.3 g/dL     LB       M        62 Drug…
## 15 SUBJ5   ABC123  Week 8   Hemogl…     13.5 g/dL     LB       M        62 Drug…
## 16 SUBJ6   ABC123  Baseline Hemogl…     12.9 g/dL     LB       F        53 Plac…
## 17 SUBJ6   ABC123  Week 4   Hemogl…     13   g/dL     LB       F        53 Plac…
## 18 SUBJ6   ABC123  Week 8   Hemogl…     13.2 g/dL     LB       F        53 Plac…
## # ℹ 7 more variables: DOMAIN.y <chr>, BASE <dbl>, CHG <dbl>, PARAM <chr>,
## #   AVAL <dbl>, AVISIT <chr>, ANL01FL <chr>

SAS


proc sort data=ADLB_raw; by USUBJID VISIT; run;

data ADLB;
    set ADLB_raw;
    by USUBJID;
    retain BASE;
    
    /* Identify baseline value */
    if VISIT = "Baseline" then BASE = LBSTRESN;
    
    /* Calculate change from baseline */
    CHG = LBSTRESN - BASE;

    /* Define analysis variables */
    PARAM  = catx(" ", LBTEST, "(", LBSTRESU, ")");
    AVAL   = LBSTRESN;
    AVISIT = VISIT;
    if VISIT ne "Baseline" then ANL01FL = "Y";
    else ANL01FL = "N";
run;

title "ADaM Laboratory Dataset (ADLB)";
proc print data=ADLB noobs;
run;

4c. Create Subject-Level ADaM (ADSL)

The ADSL (Subject-Level Analysis Dataset) contains one record per subject and serves as the cornerstone of all ADaM datasets. It provides key demographic and treatment information used to define analysis populations. In this step, we derive flags such as the Intent-to-Treat (ITTFL) and Safety (SAFFL) indicators—both set to “Y” for subjects included in the analysis. We also define the planned (TRT01P) and actual (TRT01A) treatment arms based on the variable ARM from the Demographics dataset.

ADSL <- DM |>
  mutate(
    ITTFL = "Y",
    SAFFL = "Y",
    TRT01A = ARM,
    TRT01P = ARM
  ) |> print()

##   STUDYID USUBJID SEX AGE     ARM DOMAIN ITTFL SAFFL  TRT01A  TRT01P
## 1  ABC123   SUBJ1   M  54  Drug A     DM     Y     Y  Drug A  Drug A
## 2  ABC123   SUBJ2   F  61 Placebo     DM     Y     Y Placebo Placebo
## 3  ABC123   SUBJ3   M  49  Drug A     DM     Y     Y  Drug A  Drug A
## 4  ABC123   SUBJ4   F  58 Placebo     DM     Y     Y Placebo Placebo
## 5  ABC123   SUBJ5   M  62  Drug A     DM     Y     Y  Drug A  Drug A
## 6  ABC123   SUBJ6   F  53 Placebo     DM     Y     Y Placebo Placebo

SAS


data ADSL;
    set DM;
    length ITTFL SAFFL TRT01A TRT01P $1.;
    ITTFL  = "Y";          /* Intent-to-treat flag */
    SAFFL  = "Y";          /* Safety flag */
    TRT01A = ARM;          /* Actual treatment */
    TRT01P = ARM;          /* Planned treatment */
run;

title "ADaM Subject-Level Dataset (ADSL)";
proc print data=ADSL noobs;
run;

5. Perform Analysis (TLFs)

With the ADaM datasets prepared, the next step is to generate Tables, Listings, and Figures (TLFs) — the key outputs used in clinical study reports and regulatory submissions. In this example, we summarize the mean change from baseline in hemoglobin by treatment arm and visit, calculating the sample size, mean, and standard deviation for each group. The results are then visualized through a line plot with error bars representing the mean ± standard deviation (SD), allowing for an easy comparison of treatment effects over time.

5a. Summary Table: Mean Change from Baseline

In this step, we use the analysis-ready dataset (ADLB) to generate a summary table of hemoglobin changes from baseline. The data are filtered to include only records flagged for analysis (ANL01FL = "Y"), then grouped by treatment arm (ARM) and visit (AVISIT). For each group, we calculate the sample size (N), the mean change from baseline (Mean_Change), and its standard deviation (SD_Change).

summary_stats <- ADLB |>
  filter(ANL01FL == "Y") |>
  group_by(ARM, AVISIT) |>
  summarise(
    N = n(),
    Mean_Change = round(mean(CHG), 2),
    SD_Change = round(sd(CHG), 2)
  )

## `summarise()` has grouped output by 'ARM'. You can override using the `.groups`
## argument.

summary_stats

## # A tibble: 4 × 5
## # Groups:   ARM [2]
##   ARM     AVISIT     N Mean_Change SD_Change
##   <chr>   <chr>  <int>       <dbl>     <dbl>
## 1 Drug A  Week 4     3        0.53      0.21
## 2 Drug A  Week 8     3        0.73      0.21
## 3 Placebo Week 4     3        0.17      0.12
## 4 Placebo Week 8     3        0.33      0.15

SAS


proc sort data=ADLB; 
    by ARM AVISIT; 
run;

proc means data=ADLB n mean stddev maxdec=2;
    where ANL01FL = "Y";                /* Include only analysis records */
    class ARM AVISIT;
    var CHG;
    output out=summary_stats 
        n=N 
        mean=Mean_Change 
        stddev=SD_Change;
run;


/* Display summary table */
title "Summary Statistics: Mean Change from Baseline in Hemoglobin";
proc print data=summary_stats noobs;
    var ARM AVISIT N Mean_Change SD_Change;
run;

5b. Graphical Figure (Mean ± SD)

Now, we visualize the mean change from baseline in hemoglobin over time for each treatment arm. Using both R and SAS, line plots with markers and error bars are created to display the average change (mean) and the corresponding variability (± standard deviation) at each study visit. This visualization provides a clear and intuitive comparison of how hemoglobin levels evolve across visits between the treatment and placebo groups, effectively highlighting overall trends and treatment effects.

ggplot(summary_stats, aes(x = AVISIT, y = Mean_Change, group = ARM, color = ARM)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = Mean_Change - SD_Change, ymax = Mean_Change + SD_Change), width = 0.2) +
  labs(
    title = "Mean Change from Baseline in Hemoglobin",
    x = "Visit",
    y = "Mean Change (g/dL)",
    color = "Treatment Arm"
  ) +
  theme_classic(base_size = 13)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

SAS


data summary_plot;
    set summary_stats;
    where _TYPE_ = 3;                   /* Keep group-level means only */
    LowBar  = Mean_Change - SD_Change;
    HighBar = Mean_Change + SD_Change;
run;

ods graphics / width=5in height=3.5in imagename="MeanChange_Hb_NoGrid";

proc sgplot data=summary_plot noborder;
    /* ggplot2-style colors */
    styleattrs datacontrastcolors=(CXE41A1C CX377EB8)
                datasymbols=(circlefilled)
                datalinepatterns=(solid);

    /* Error bars (±1 SD) */
    highlow x=AVISIT low=LowBar high=HighBar /
             group=ARM 
             type=line 
             lineattrs=(thickness=1 pattern=solid)
             transparency=0.2 
             name="ErrBars";

    /* Mean line with markers */
    series x=AVISIT y=Mean_Change / group=ARM 
           lineattrs=(thickness=2)
           markers markerattrs=(symbol=circlefilled size=9)
           name="MeanLine";

    /* Legend for treatment arm */
    keylegend "MeanLine" / title="Treatment Arm" position=right across=1;

    /* Clean axes — no grid lines */
    xaxis label="Visit" discreteorder=data display=(noticks);
    yaxis label="Mean Change (g/dL)" display=(noticks);

    /* Title */
    title "Mean Change from Baseline in Hemoglobin";
run;

ods graphics off;

6. Traceability Table

Traceability from CDASH → SDTM → ADaM
ADaM Variable	Derived From	Description
`AVAL`	`LB.LBSTRESN`	Lab test numeric result
`BASE`	`LB.LBSTRESN` (Baseline visit)	Baseline value
`CHG`	`AVAL - BASE`	Change from baseline
`ARM`	`DM.ARM`	Treatment arm
`ANL01FL`	Derived	Analysis record flag (Y/N)

7. End-to-End Flow Recap

Step	Data Standard	Description	Example Output
Data Collection	CDASH	Raw CRF data collected at site	`raw_demographics`, `raw_labs`
Data Cleaning	Pre-SDTM QC	Check completeness, consistency	`clean_DM`, `clean_LB`
Data Tabulation	SDTM	Structured data by domain	`DM`, `LB`
Analysis Preparation	ADaM	Derived, merged, analysis-ready	`ADSL`, `ADLB`
Statistical Results	TLFs	Tables, Listings, Figures	`summary_stats`, plots

8. Key Learning Points

CDASH = standardized data collection
SDTM = standardized data organization for regulators
ADaM = standardized data preparation for analysts
TLFs = the final outputs (tables, listings, figures)
Each layer ensures traceability, quality, and regulatory compliance