This is a vignette to help you become familiar with using
assign_id()
. Check out the
function documentation for additional background and
examples.
assign_id()
helps you assign a unique numeric identifier
to each subject in your data set. Its purpose is to ensure the same
numeric identifier is consistently assigned to each subject throughout a
project.
Below, we will go through the two main use cases of
assign_id()
. These are when you are assigning ID either for
the first time or according to an existing file.
We will use the example derived PK data set from
mrgda
.
For the purposes of this vignette, a subject identifier column
USUBJID
is created and the ID
column is
removed.
When creating a new data set where IDs have not been assigned
previously, assign_id()
only requires two arguments:
ID
column to (.data)## ┌ ID Summary ───────────────────────────────────────────┐
## │ │
## │ Number of subjects detected and assigned IDs: 178 │
## │ │
## └───────────────────────────────────────────────────────┘
A summary is shown in the R console, detailing how many unique
subjects were assigned an ID. Notice as well that the only change to
pk_data
is one additional column.
## [1] "Size of pk_data 2534 x 28"
## [1] "Size of data_w_id 2534 x 29"
When data sets are updated or new subsets are created,
assign_id()
ensures the ID mapping is consistent
throughout. To do this, we can leverage the
.previously_derived_path
argument.
First we will create a new data set pk_data_reordered
which has the same contents as pk_data
but is arranged in a
different order.
## # A tibble: 6 × 1
## USUBJID
## <chr>
## 1 STUDY-X-1
## 2 STUDY-X-10
## 3 STUDY-X-11
## 4 STUDY-X-12
## 5 STUDY-X-13
## 6 STUDY-X-14
## # A tibble: 6 × 1
## USUBJID
## <chr>
## 1 STUDY-X-1
## 2 STUDY-X-2
## 3 STUDY-X-3
## 4 STUDY-X-4
## 5 STUDY-X-5
## 6 STUDY-X-6
However, when we inform assign_id()
we want the ID
assignment to match that of data_w_id
, then the mapping is
consistent.
data_w_id_reordered <-
pk_data_reordered %>%
assign_id(.previously_derived_path = lookup_path, .subject_col = "USUBJID")
## ┌ ID Summary ───────────────────────────────────────────┐
## │ │
## │ Number of subjects detected and assigned IDs: 178 │
## │ │
## └───────────────────────────────────────────────────────┘
Notice that despite the sorting differences, both
data_w_id
and data_w_id_reordered
have the
same ID assignment.
## # A tibble: 6 × 2
## USUBJID ID
## <chr> <dbl>
## 1 STUDY-X-1 1
## 2 STUDY-X-2 2
## 3 STUDY-X-3 3
## 4 STUDY-X-4 4
## 5 STUDY-X-5 5
## 6 STUDY-X-6 6
## # A tibble: 6 × 2
## USUBJID ID
## <chr> <int>
## 1 STUDY-X-1 1
## 2 STUDY-X-2 2
## 3 STUDY-X-3 3
## 4 STUDY-X-4 4
## 5 STUDY-X-5 5
## 6 STUDY-X-6 6
In the event that there are new subjects in your data that were not present in the previously derived data set, then a new unique ID will be assigned to them.
assign_id()
does this by finding the maximum ID present
in the previous data and setting all new ID’s higher than it. An example
of this is shown below.
old_data <- dplyr::tibble(USUBJID = c("A", "B", "C", "D"), ID = c(1, 2, 4, 19))
lookup_path <- paste0(tempfile(), ".csv")
mrgda:::write_csv_dots(
x = old_data,
file = lookup_path
)
data <- dplyr::tibble(USUBJID = c("E", "F", "G"))
assign_id(.data = data, .subject_col = "USUBJID", .previously_derived_path = lookup_path)
## ┌ ID Summary ─────────────────────────────────────────┐
## │ │
## │ Number of subjects detected and assigned IDs: 3 │
## │ │
## └─────────────────────────────────────────────────────┘
## # A tibble: 3 × 2
## USUBJID ID
## <chr> <dbl>
## 1 E 20
## 2 F 21
## 3 G 22
Notice how the largest ID in the previous data set was 19. Since
subjects E
, F
and G
were not
present in old_data
their ID’s were all set to be larger
than 19.