This is a vignette to help you become familiar with using assign_id(). Check out the function documentation for additional background and examples.

assign_id() helps you assign a unique numeric identifier to each subject in your data set. Its purpose is to ensure the same numeric identifier is consistently assigned to each subject throughout a project.

Below, we will go through the two main use cases of assign_id(). These are when you are assigning ID either for the first time or according to an existing file.

Load packages

Setup

We will use the example derived PK data set from mrgda.

For the purposes of this vignette, a subject identifier column USUBJID is created and the ID column is removed.

pk_data <-
  readr::read_csv(system.file("derived/pk.csv", package = "mrgda")) %>%
  mutate(USUBJID = paste0(STUDYID, "-", ID)) %>%
  select(-ID)

Initial ID assignment

When creating a new data set where IDs have not been assigned previously, assign_id() only requires two arguments:

  • the data to append the ID column to (.data)
  • the subject identifier column name (.subject_col)
data_w_id <-
  pk_data %>%
  assign_id(.subject_col = "USUBJID")
## ┌ ID Summary ───────────────────────────────────────────┐
## │                                                       │
## │   Number of subjects detected and assigned IDs: 178   │
## │                                                       │
## └───────────────────────────────────────────────────────┘

A summary is shown in the R console, detailing how many unique subjects were assigned an ID. Notice as well that the only change to pk_data is one additional column.

## [1] "Size of pk_data 2534 x 28"
## [1] "Size of data_w_id 2534 x 29"

Existing ID assignment

When data sets are updated or new subsets are created, assign_id() ensures the ID mapping is consistent throughout. To do this, we can leverage the .previously_derived_path argument.

First we will create a new data set pk_data_reordered which has the same contents as pk_data but is arranged in a different order.

pk_data_reordered <-
  pk_data %>%
  arrange(USUBJID)

head(pk_data_reordered %>% distinct(USUBJID))
## # A tibble: 6 × 1
##   USUBJID   
##   <chr>     
## 1 STUDY-X-1 
## 2 STUDY-X-10
## 3 STUDY-X-11
## 4 STUDY-X-12
## 5 STUDY-X-13
## 6 STUDY-X-14
head(pk_data %>% distinct(USUBJID))
## # A tibble: 6 × 1
##   USUBJID  
##   <chr>    
## 1 STUDY-X-1
## 2 STUDY-X-2
## 3 STUDY-X-3
## 4 STUDY-X-4
## 5 STUDY-X-5
## 6 STUDY-X-6

However, when we inform assign_id() we want the ID assignment to match that of data_w_id, then the mapping is consistent.

data_w_id_reordered <-
  pk_data_reordered %>%
  assign_id(.previously_derived_path = lookup_path, .subject_col = "USUBJID")
## ┌ ID Summary ───────────────────────────────────────────┐
## │                                                       │
## │   Number of subjects detected and assigned IDs: 178   │
## │                                                       │
## └───────────────────────────────────────────────────────┘

Notice that despite the sorting differences, both data_w_id and data_w_id_reordered have the same ID assignment.

head(data_w_id %>% distinct(USUBJID, ID) %>% arrange(ID))
## # A tibble: 6 × 2
##   USUBJID      ID
##   <chr>     <dbl>
## 1 STUDY-X-1     1
## 2 STUDY-X-2     2
## 3 STUDY-X-3     3
## 4 STUDY-X-4     4
## 5 STUDY-X-5     5
## 6 STUDY-X-6     6
head(data_w_id_reordered %>% distinct(USUBJID, ID) %>% arrange(ID))
## # A tibble: 6 × 2
##   USUBJID      ID
##   <chr>     <int>
## 1 STUDY-X-1     1
## 2 STUDY-X-2     2
## 3 STUDY-X-3     3
## 4 STUDY-X-4     4
## 5 STUDY-X-5     5
## 6 STUDY-X-6     6

New subjects

In the event that there are new subjects in your data that were not present in the previously derived data set, then a new unique ID will be assigned to them.

assign_id() does this by finding the maximum ID present in the previous data and setting all new ID’s higher than it. An example of this is shown below.

old_data <- dplyr::tibble(USUBJID = c("A", "B", "C", "D"), ID = c(1, 2, 4, 19))
lookup_path <- paste0(tempfile(), ".csv")
mrgda:::write_csv_dots(
  x = old_data,
  file = lookup_path
)

data <- dplyr::tibble(USUBJID = c("E", "F", "G"))
assign_id(.data = data, .subject_col = "USUBJID", .previously_derived_path = lookup_path)
## ┌ ID Summary ─────────────────────────────────────────┐
## │                                                     │
## │   Number of subjects detected and assigned IDs: 3   │
## │                                                     │
## └─────────────────────────────────────────────────────┘
## # A tibble: 3 × 2
##   USUBJID    ID
##   <chr>   <dbl>
## 1 E          20
## 2 F          21
## 3 G          22

Notice how the largest ID in the previous data set was 19. Since subjects E, F and G were not present in old_data their ID’s were all set to be larger than 19.