7 Make factors – The yspec Book

yspec lets you turn discrete data items into factors with nicely-formed names using the ys_factors() function. This is my most-used function in the yspec workflow so I wanted to make a separate page to highlight this functionality.

7.1 Example spec

Here’s a very simple example of a data specification file with two variables, WT and FORM

# inline/factors-1.yml
 
WT: 
  short: weight
FORM:
  short: formulation
  values: [3, 2, 1]
  decode: [tablet, capsule, troche]

FORM is formulation, a discrete data item. The values in the data set are either 1, 2, or 3 and these decode to troche, capsule and tablet, respectively.

We can confirm the coding after reading this into our R session

spec <- ys_load("inline/factors-1.yml")
spec$FORM

 name  value      
 col   FORM       
 type  numeric    
 short formulation
 value 3 : tablet 
       2 : capsule
       1 : troche

7.2 Example data

We also have a data set that goes along with this data specification object

7.3 Add factors to a data set

Working with this data set, I’d like to be able to see the decodes that I specified for FORM in addition to the numbers. We can use the spec object to add the factors to the data frame using ys_factors()

data <- ys_factors(data, spec)

Now, FORM is a factor and we have one additional column, FORM_v which is the original value of FORM

head(data)

    WT    FORM FORM_v
1 61.1  troche      1
2 50.3  tablet      3
3 80.2 capsule      2
4 91.8  troche      1
5 70.0  troche      1

count(data, FORM, FORM_v)

     FORM FORM_v n
1  tablet      3 1
2 capsule      2 1
3  troche      1 3

7.4 Looking for factors

Here, yspec looked through the spec object for data items that could be turned into a factor. It found FORM because FORM had the values field populated (we listed every possible value for FORM there). yspec turned that into a factor and saved out the original value with the _v suffix.

7.5 Factor ordering

Notice that we coded FORM to be 3, 2 or 1, in that order. When ys_factors() turns FORM into a factor, it respects the order in which you added when creating the factor. This is is very important when you need to gain control over, for examples, the order in which you want data to appear in plots or tables

library(ggplot2)

ggplot(data = data, aes(x = FORM)) + geom_bar() +theme_bw()

7.6 Forcing factors

Any column that yspec finds to have the values filled in can / will be turned into a factor. You can also force the data item to be turned into a factor by using the make_factor field. For example

# inline/factors-2.yml
 
foo: 
  short: just for illustration
  make_factor: true

spec <- ys_load("inline/factors-2.yml")

From this data set

we can also add factors even though the values field was not populated

data <- ys_add_factors(data, spec)

str(data)

'data.frame':   4 obs. of  2 variables:
 $ foo  : num  11 99 120 5
 $ foo_f: Factor w/ 4 levels "5","11","99",..: 2 3 4 1

7.7 Selectively adding factors

By default, yspec will add every factor that it can find. You can override this behavior by specifying the columns that you want processed to factors.

This example yspec file and data set come with yspec

# ?ys_help
data <- ys_help$data()
spec <- ys_help$spec()

We can pull just the discrete columns

ys_filter(spec, discrete)

 name  info unit short                         source       
 C     cd-  .    comment character             ysdb_internal
 SEQ   -d-  .    SEQ                           .            
 EVID  -d-  .    event ID                      ysdb_internal
 CP    -d-  .    Child-Pugh score              look         
 MDV   -d-  .    MDV                           ysdb_internal
 BLQ   -d-  .    below limit of quantification .            
 PHASE ---  .    study phase indicator         .            
 STUDY -d-  .    study number                  .            
 RF    cd-  .    renal function stage          .

Let’s only make factors for EVID and BLQ

data <- ys_factors(data, spec, EVID, BLQ)

head(data, n = 3)

   C NUM ID SUBJ TIME SEQ CMT        EVID AMT     DV   AGE    WT   CRCL ALB
1 NA   1  1    1 0.00   0   1        dose   5  0.000 28.03 55.16 114.45 4.4
2 NA   2  1    1 0.61   1   2 observation  NA 61.005 28.03 55.16 114.45 4.4
3 NA   3  1    1 1.15   1   2 observation  NA 90.976 28.03 55.16 114.45 4.4
    BMI    AAG  SCR   AST   ALT     HT CP TAFD  TAD LDOS MDV      BLQ PHASE
1 21.67 106.36 1.14 11.88 12.66 159.55  0 0.00 0.00    5   1 above QL     1
2 21.67 106.36 1.14 11.88 12.66 159.55  0 0.61 0.61    5   0 above QL     1
3 21.67 106.36 1.14 11.88 12.66 159.55  0 1.15 1.15    5   0 above QL     1
  STUDY   RF EVID_v BLQ_v
1     1 norm      1     0
2     1 norm      0     0
3     1 norm      0     0

7.8 Alternative interface

If you want to retain data columns as the are in the data but add the factors as new columns, you can use ys_add_factors(). For example

ys_add_factors(data, spec)

    WT FORM  FORM_f
1 61.1    1  troche
2 50.3    3  tablet
3 80.2    2 capsule
4 91.8    1  troche
5 70.0    1  troche

This did essentially the same thing as ys_factors() … we have both factor and value versions of FORM in the data set, but now the factor has this _f suffix (FORM_f) and the values are still under FORM.

Prior to ys_factors() we always used ys_add_factors(), but frequently wanted the original variable name to be the factor version. So either function will get you the factor, but we think ys_factors() is closer to what we want most of the time.