This workflow utilizes several key functions from the mrgda
package to streamline the data assembly process. We’ll highlight three of them:
read_src_dir()
: Used for ingesting source data.assign_id()
: For assigning consistent subject identifiers.write_derived()
: For exporting the final analysis dataset along with its metadata.read_src_dir()
The read_src_dir()
function is a crucial first step in the data assembly pipeline. It is designed to read all data files (domains) from a specified source data directory, such as an SDTM or ADaM folder.
Key Features:
.path
argument..file_types
is set to "detect"
(the default). Alternatively, you can specify the type..read_domains
argument (e.g., c('dm', 'lb')
). By default, it loads all domains.mrgda_labels
(for data labels) and mrgda_src_meta
(containing metadata such as MD5 checksums, file types, and the source path).In this workflow, read_src_dir()
is used to load the SDTM source data.
# Prepare storage list for subject-level (sl) and time-varying (tv) data
out <- list(sl = list(), tv = list())
# Load PK specification file
target_spec <- ys_load(here("data/derived/pk.yaml"))
# Read SDTM source data from the '100' subdirectory
src_100 <- read_src_dir(here("data", "source", "100"))
First, we create a base dataframe from the DM domain for further processing.
# Source data for demographics processing.
# This dataframe will be used as the base for deriving SEX, RACE, and base demographic components.
dm_for_processing <- src_100$dm %>% filter(ACTARM != "NOT ASSIGNED")
Next, we process the SEX
variable.
# Process SEX variable, mapping character codes to numeric values.
# This creates a tibble with USUBJID and the processed SEX, stored in out$sl$sex.
out$sl$sex <-
dm_for_processing %>%
mutate(
USUBJID,
SEX = case_when(
SEX == "F" ~ 1, # Female to 1
SEX == "M" ~ 2, # Male to 2
TRUE ~ -99 # Other/Missing to -99
),
.keep = "none"
)
Then, we process the RACE
variable.
# Process RACE variable, mapping character codes to numeric values.
# This creates a tibble with USUBJID and the processed RACE, stored in out$sl$race.
out$sl$race <-
dm_for_processing %>%
mutate(
USUBJID,
RACE = case_when(
RACE == "WHITE" ~ 1,
RACE == "BLACK OR AFRICAN AMERICAN" ~ 2,
RACE == "AMERICAN INDIAN OR ALASKA NATIVE" ~ 3,
RACE == "OTHER" ~ 6,
TRUE ~ -99 # Other/Missing to -99
),
.keep = "none"
)
Finally, we create the age-related demographic information (STUDYID, BLAGE).
# Create the age-related demographic information table, including USUBJID, STUDYID,
# and BLAGE (derived from AGE). This is stored in out$sl$age.
out$sl$age <-
dm_for_processing %>%
mutate(
USUBJID,
STUDYID,
BLAGE = AGE, # Assign AGE to BLAGE
.keep = "none"
)
# Former assertion on a combined 'dm' table is removed as 'dm' is now assembled by 'reduce'.
# Specific assertions for SEX and RACE could be added in their respective processing chunks if needed.
This chunk processes the Vital Signs (VS) data to derive baseline weight and BMI.
# Step 1: Filter VS data for relevant baseline records (WEIGHT and HEIGHT)
# for subjects present in the demographic processing data.
vs_1 <-
src_100$vs %>%
filter(
USUBJID %in% dm_for_processing$USUBJID, # Subjects must be in the demographic data
VSBLFL == "Y", # Baseline flag must be 'Y'
VSTESTCD %in% c("WEIGHT", "HEIGHT") # Only include WEIGHT and HEIGHT tests
)
# Step 2: Select only the necessary columns: subject ID, test code,
# and standard result in numeric format.
vs_2 <-
vs_1 %>%
select(USUBJID, VSTESTCD, VSSTRESN)
# Step 3: Pivot the data from a long format to a wide format,
# so that WEIGHT and HEIGHT become separate columns.
vs_3 <-
vs_2 %>%
pivot_wider(names_from = VSTESTCD, values_from = VSSTRESN)
# Step 4: Calculate Baseline Weight (BLWT) and Baseline BMI (BLBMI),
# and assign the result to out$sl$vs.
out$sl$vs <-
vs_3 %>%
mutate(
USUBJID,
BLWT = WEIGHT, # Baseline Weight
BLBMI = WEIGHT / ((HEIGHT / 100)^2), # Baseline BMI (Height in meters)
.keep = "none"
)
# Assertions:
# 1. Check for duplicate USUBJIDs in the final baseline vital signs data.
# 2. Check for any NA values in the final baseline vital signs data.
assert_that(!any(duplicated(out$sl$vs$USUBJID)), !anyNA(out$sl$vs))
This chunk filters and transforms PK concentration data.
# Filter PK data (pc) for relevant subjects and tests, and exclude "NOT DONE" records.
pk_conc_01 <-
src_100$pc %>%
filter(
USUBJID %in% dm_for_processing$USUBJID, # Subject must be in demographics
PCTEST == "DRUG-X", # Specific drug test
PCSTAT != "NOT DONE" # Status is not "NOT DONE"
)
# Mutate to create analysis variables: DV, BLQ, EVID, MDV, DATETIME, LLOQ.
out$tv$pc <-
pk_conc_01 %>%
mutate(
USUBJID,
DV = PCSTRESN, # Dependent variable (concentration)
BLQ = if_else(PCORRES == "<LLOQ", 1, 0), # Below LLOQ flag
EVID = if_else(BLQ == 1, 2, 0), # Event ID (2 for BLQ observation, 0 otherwise)
MDV = if_else(is.na(DV), 1, 0), # Missing DV flag
DATETIME = ymd_hms(PCDTC), # Standardized date-time
LLOQ = PCLLOQ, # Lower Limit of Quantification
.keep = "none"
)
# Assertion: Check for duplicate records based on USUBJID and DATETIME.
assert_that(!any(duplicated(out$tv$pc[c("USUBJID", "DATETIME")])))
This chunk filters and transforms dosing event data from the EX domain.
# Filter dosing data (ex) for relevant subjects.
ex_filtered <-
src_100$ex %>%
filter(USUBJID %in% dm_for_processing$USUBJID) # Subject must be in demographics
# Mutate to create analysis variables: AMT, EVID, DATETIME, MDV.
out$tv$dosing <-
ex_filtered %>%
mutate(
USUBJID,
AMT = EXDOSE, # Dose amount
EVID = 1, # Event ID (1 for dosing event)
DATETIME = ymd_hms(EXSTDTC), # Standardized date-time of dosing
MDV = 1, # MDV is 1 for dosing records
.keep = "none"
)
# Assertion: Check for any placeholder -99 values.
assert_that(!any(out$tv$dosing == -99, na.rm = TRUE))
This chunk extracts time-varying body weight measurements from the VS domain.
# Filter VS data for "WEIGHT" tests to get time-varying weights.
vs_weight_records <-
src_100$vs %>%
filter(VSTEST == "WEIGHT")
# Mutate to create analysis variables: WT, DATETIME, EVID (placeholder).
out$tv$wt <-
vs_weight_records %>%
mutate(
USUBJID,
WT = VSSTRESN, # Body weight
DATETIME = ymd_hms(VSDTC), # Standardized date-time of measurement
EVID = -99, # Placeholder EVID, to be handled later
.keep = "none"
)
This chunk combines the subject-level and time-varying data into a single dataset. # The out$sl
list now contains age
, sex
, race
, and vs
. # The reduce
function will join these tables by USUBJID
.
# Step 1: Combine all time-varying datasets (PK, dosing, time-varying weights)
# from the 'out$tv' list by stacking them vertically.
tv_combined_data <- bind_rows(out$tv)
# Step 2: Combine all subject-level datasets (age, sex, race, baseline VS)
# from the 'out$sl' list by iteratively joining them by USUBJID.
sl_combined_data <- reduce(out$sl, full_join, by = "USUBJID")
# Step 3: Join the combined time-varying data with the combined subject-level data.
# This uses a left_join, keeping all rows from tv_combined_data.
pk_joined_intermediate <- left_join(tv_combined_data, sl_combined_data, by = "USUBJID")
# Step 4: Arrange the resulting dataset by USUBJID and then by DATETIME
# for chronological order within each subject.
pk_00 <- arrange(pk_joined_intermediate, USUBJID, DATETIME)
This chunk processes baseline weight and removes placeholder rows for time-varying weights.
# Step 1: Group by USUBJID and carry forward (and backward using "downup")
# the WT (weight) column. This populates missing weights with the last known value.
pk_01_filled <-
pk_00 %>%
group_by(USUBJID) %>%
fill(WT, .direction = "downup") %>%
ungroup()
# Step 2: Filter out rows that were placeholders for time-varying weights (EVID == -99).
pk_01 <-
pk_01_filled %>%
filter(EVID != -99)
This chunk derives columns commonly used in NONMEM modeling.
# Step 1: Calculate FIRST_DOSE_TIME, N_DOSES, and TIME.
# - FIRST_DOSE_TIME: The earliest DATETIME where EVID is 1 (a dose).
# - N_DOSES: Cumulative count of dosing events (EVID == 1), NA replaced with 0.
# - TIME: Time in hours since the FIRST_DOSE_TIME.
pk_02_timecalc <-
pk_01 %>%
group_by(USUBJID) %>%
mutate(
FIRST_DOSE_TIME = min(DATETIME[EVID == 1], na.rm = TRUE),
N_DOSES = replace_na(cumsum(EVID == 1), 0),
TIME = as.numeric(difftime(DATETIME, FIRST_DOSE_TIME, units = "hours"))
) %>%
ungroup()
# Step 2: Calculate TAD (Time After Dose) and DOSE_HAD_PK.
# These are calculated per subject and per dose number (N_DOSES).
# - TAD: Time in hours since the start of the current dose.
# - DOSE_HAD_PK: A flag indicating if any PK sample (EVID == 0) exists for the current dose.
pk_02_tadcalc <-
pk_02_timecalc %>%
group_by(USUBJID, N_DOSES) %>%
mutate(
TAD = as.numeric(difftime(DATETIME, min(DATETIME), units = "hours")),
DOSE_HAD_PK = any(EVID == 0, na.rm = TRUE)
) %>%
ungroup()
# Step 3: Calculate NEW_DOSE and OCC (Occasion).
# - NEW_DOSE: A flag indicating if the current N_DOSES is different from the previous row's.
# - OCC: An occasion number, incrementing each time a NEW_DOSE occurs AND that dose
# interval had a PK sample (DOSE_HAD_PK). It's 0 if N_DOSES is 0.
pk_02_occcalc <-
pk_02_tadcalc %>%
group_by(USUBJID) %>%
mutate(
NEW_DOSE = N_DOSES != lag(N_DOSES, default = first(N_DOSES)),
OCC = cumsum(NEW_DOSE & DOSE_HAD_PK) * (N_DOSES != 0)
) %>%
ungroup()
# Assign the final prepared dataset for this stage
pk_02 <- pk_02_occcalc
This chunk ensures that dose amounts (AMT) are carried forward to observation records.
# Step 1: Initialize a DOSE column with the AMT (dose amount).
# AMT is typically present only on dosing records (EVID == 1).
pk_03_amtinit <-
pk_02 %>%
mutate(DOSE = AMT)
# Step 2: Group by USUBJID and carry forward (and backward using "downup") the DOSE amount.
# This populates the DOSE column for observation records based on the last known dose.
pk_03 <-
pk_03_amtinit %>%
group_by(USUBJID) %>%
fill(DOSE, .direction = "downup") %>%
ungroup()
assign_id()
The assign_id()
function is designed to create a new column in your dataset, typically named ID
, which will contain unique numerical identifiers for each individual subject.
Here’s a breakdown of its core functionality:
Identifying Subjects: The function first needs to know which column in your current dataset already holds the unique subject identifiers (e.g., USUBJID
). You specify this using the .subject_col
argument.
Leveraging Previous Identifiers (for consistency):
.previously_derived_path
argument.assign_id()
ensures they receive the exact same numerical ID
they had in the previous version. This is vital for maintaining consistency and traceability of subjects across different iterations of your analysis dataset.Assigning New Identifiers:
.previously_derived_path
dataset, or if no previous dataset path was given), assign_id()
will assign them a brand new, unique numerical ID
.Outputting the ID
Column: After determining the appropriate numerical ID for every subject, the function adds a new column named ID
(by default) to your dataset, populating it with these unique numerical identifiers.
# Adds an integer ID column using the assign_id function.
pk_04 <- assign_id(pk_03, .subject_col = "USUBJID")
This chunk adds final columns (NUM, C) and reorders all columns according to the target specification.
This chunk validates the final dataset (pk_final
) against the target_spec
.
ys_check(pk_final, target_spec)
write_derived()
The write_derived()
function is responsible for exporting your final, processed dataset and creating a comprehensive set of associated metadata. This promotes reproducibility and makes it easier to track dataset versions and understand their contents.
Key Features:
.data
: The R data frame to be written..spec
: A yspec
object that defines the data specifications (column names, types, labels, etc.). The function checks the data against this specification before writing..file
: The path and filename for the output CSV file.haven::write_xpt
).spec-list.yml
: A YAML file containing the data specification (derived from the yspec
object).yspec::render_fda_define
).subject-columns.yml
: A YAML file identifying subject-level columns.history.csv
: Tracks changes to the dataset across versions, including user, date/time, comments, and revision numbers (if applicable).sys-info.yml
: System information at the time of writing (user, R version, OS).dependencies.yml
: If run within an RStudio project, it attempts to find scripts that use this derived file.diffs.csv
& subject-diffs.csv
: If a previous version of the dataset is available (either locally or from SVN, controlled by .compare_from_svn
), it generates diff reports highlighting changes between versions..prev_file
: Path to the previous version of the CSV file (defaults to .file
if not specified)..compare_from_svn
: Logical; if TRUE
, it compares against the latest SVN version of .prev_file
.NULL
invisibly. If .return_base_compare
is TRUE
, it returns a list containing the current and previous versions of the datasets used for comparison.This function ensures that derived datasets are not just saved as simple CSVs but are accompanied by rich metadata that is essential for regulatory submissions, collaboration, and long-term project maintenance.
# Write the final PK dataset and associated metadata.
write_derived(pk_final, target_spec, here("data/derived/pk.csv"))
# Display session info for reproducibility.
sessionInfo()