7  Make factors

yspec lets you turn discrete data items into factors with nicely-formed names using the ys_add_factors() function. This is my most-used function in the yspec workflow so I wanted to make a separate page to highlight this functionality.

7.1 Example spec

Here’s a very simple example of a data specification file with two variables, WT and FORM

# inline/factors-1.yml
 
WT: 
  short: weight
FORM:
  short: formulation
  values: [3, 2, 1]
  decode: [tablet, capsule, troche]

FORM is formulation, a discrete data item. The values in the data set are either 1, 2, or 3 and these decode to troche, capsule and tablet, respectively.

We can confirm the coding after reading this into our R session

spec <- ys_load("inline/factors-1.yml")
spec$FORM
 name  value      
 col   FORM       
 type  numeric    
 short formulation
 value 3 : tablet 
       2 : capsule
       1 : troche 

7.2 Example data

We also have a data set that goes along with this data specification object

7.3 Add factors to a data set

Working with this data set, I’d like to be able to see the decodes that I specified for FORM in addition to the numbers. We can use the spec object to add the factors to the data frame using ys_add_factors()

data <- ys_add_factors(data, spec)

Now we have an additional column

head(data)
    WT FORM  FORM_f
1 61.1    1  troche
2 50.3    3  tablet
3 80.2    2 capsule
4 91.8    1  troche
5 70.0    1  troche

which includes a factor version of FORM

count(data, FORM_f, FORM)
   FORM_f FORM n
1  tablet    3 1
2 capsule    2 1
3  troche    1 3

7.4 Looking for factors

Here, yspec looked through the spec object for data items that could be turned into a factor. It found FORM because FORM had the values field populated (we listed every possible value for FORM there). yspec turned that into a factor with a new name derived from the original name but with a _f suffix.

See the .suffix argument to customize how the new column name is formed.

7.5 Factor ordering

Notice that we coded FORM to be 3, 2 or 1, in that order. When ys_add_factors() creates FORM_f from FORM, it respects the order in which you added when creating the factor. This is is very important when you need to gain control over, for examples, the order in which you want data to appear in plots or tables

library(ggplot2)

ggplot(data = data, aes(x = FORM_f)) + geom_bar() +theme_bw()

7.6 Forcing factors

Any column that yspec finds to have the values filled in can / will be turned into a factor. You can also force the data item to be turned into a factor by using the make_factor field. For example

# inline/factors-2.yml
 
foo: 
  short: just for illustration
  make_factor: true
spec <- ys_load("inline/factors-2.yml")

From this data set

  foo
1  11
2  99
3 120
4   5

we can also add factors even though the values field was not populated

data <- ys_add_factors(data, spec)
str(data)
'data.frame':   4 obs. of  2 variables:
 $ foo  : num  11 99 120 5
 $ foo_f: Factor w/ 4 levels "5","11","99",..: 2 3 4 1

7.7 Selectively adding factors

By default, yspec will add every factor that it can find. You can override this behavior by specifying the columns that you want processed to factors.

This example yspec file and data set come with yspec

# ?ys_help
data <- ys_help$data()
spec <- ys_help$spec()

We can pull just the discrete columns

ys_filter(spec, discrete)
 name  info unit short                         source       
 C     cd-  .    comment character             ysdb_internal
 SEQ   -d-  .    SEQ                           .            
 EVID  -d-  .    event ID                      ysdb_internal
 CP    -d-  .    Child-Pugh score              look         
 MDV   -d-  .    MDV                           ysdb_internal
 BLQ   -d-  .    below limit of quantification .            
 PHASE ---  .    study phase indicator         .            
 STUDY -d-  .    study number                  .            
 RF    cd-  .    renal function stage          .            

Let’s only make factors for EVID and BLQ

data <- ys_add_factors(data, spec, EVID, BLQ)

head(data, n = 3)
   C NUM ID SUBJ TIME SEQ CMT EVID AMT     DV   AGE    WT   CRCL ALB   BMI
1 NA   1  1    1 0.00   0   1    1   5  0.000 28.03 55.16 114.45 4.4 21.67
2 NA   2  1    1 0.61   1   2    0  NA 61.005 28.03 55.16 114.45 4.4 21.67
3 NA   3  1    1 1.15   1   2    0  NA 90.976 28.03 55.16 114.45 4.4 21.67
     AAG  SCR   AST   ALT     HT CP TAFD  TAD LDOS MDV BLQ PHASE STUDY   RF
1 106.36 1.14 11.88 12.66 159.55  0 0.00 0.00    5   1   0     1     1 norm
2 106.36 1.14 11.88 12.66 159.55  0 0.61 0.61    5   0   0     1     1 norm
3 106.36 1.14 11.88 12.66 159.55  0 1.15 1.15    5   0   0     1     1 norm
       EVID_f    BLQ_f
1        dose above QL
2 observation above QL
3 observation above QL