Introduction

This vignette takes you through some basic scenarios for modeling with Stan using bbr.bayes, introducing you to its standard workflow and functionality.

bbr.bayes is an extension to bbr which focuses on traceable and reproducible modeling using Bayesian methods. Note that you will also need to load bbr to use most bbr.bayes functionality.

This vignette focuses on models defined in Stan. The “Getting Started with bbr.bayes and NONMEM Bayes” vignette focuses on modeling in NONMEM using Bayesian estimation methods.

For details on modeling in NONMEM with non-Bayesian estimation methods, see the “Getting Started with bbr and NONMEM” vignette.

Setup

Installing cmdstan

bbr.bayes uses cmdstan and the R package cmdstanr to run Stan models. Please see the cmdstanr docs for instructions on installing cmdstan. (Note that, when modeling with Stan rather than NONMEM, bbi is not required.)

Creating a model object

The bbr.bayes workflow is organized around the concept of model objects. These are S3 objects which you will pass to bbr and bbr.bayes functions that submit, summarize, or manipulate models. A model object points to a path on disk; the set of files that define a model is derived from this path. For a Stan model, the basic files are:

  • <run>.yml - The bbr YAML file which stores model metadata like a description and tags.

  • <run>/<run>.stan - The Stan file.

  • <run>/<run>-stanargs.R - Contains a named list with all of the arguments that will be passed through to the $sample() method of ?cmdstanr::CmdStanModel.

  • <run>/<run>-standata.R - Contains all necessary R code to read in any source data and transform them to a Stan-ready data object (list).

  • <run>/<run>-init.R - This file contains all necessary R code to create the initial values passed to the cmdstanr’s $sample() method.

The Stan models in bbr reference page provides detailed information about the files necessary for running a Stan model using bbr.

bbr model objects can either be created de novo or copied from an existing model object. We outline the differences between these two approaches below.

Creating a model object de novo

To create a model object de novo, you use the new_model() function.

The new_model() function will look for (and, if not found, create) the files necessary to run a Stan model. The first argument (.path) must be the path to the directory where you want to write these files. For Stan models, you will also have to add .model_type = "stan" to the new_model() call.

mod1 <- new_model(file.path(MODEL_DIR, "mod1"), .model_type = "stan")
#> 
#> The following files, which are necessary to run a `bbi_stan_model`, are missing from mod1:
#>  * /data/home/kylem/src/github/metrumresearchgroup/bbr.bayes/inst/model/stan/mod1/mod1.stan
#>  * /data/home/kylem/src/github/metrumresearchgroup/bbr.bayes/inst/model/stan/mod1/mod1-standata.R
#>  * /data/home/kylem/src/github/metrumresearchgroup/bbr.bayes/inst/model/stan/mod1/mod1-init.R
#>  * /data/home/kylem/src/github/metrumresearchgroup/bbr.bayes/inst/model/stan/mod1/mod1-stanargs.R
#> See `?add_file_to_model_dir` for helper functions to add them.
#> Automatically adding scaffolded .stan file
#> Automatically adding scaffolded -standata.R file
#> Automatically adding scaffolded -init.R file
#> Automatically adding scaffolded -stanargs.R file

The call above will look for a directory called {MODEL_DIR}/mod1. Inside this directory it will expect to find the files mod1.stan and mod1-standata.R. These “required files” will be discussed below, and you can find even more detail in ?bbr_stan.

For now notice that, if any of them are not found when you initialize the model, new_model() will create scaffolds (files with boilerplate, non-working code) of the missing files, and print a message to the console telling you it has done so.

You can use the helpers described in ?open_stan_file to open any of these scaffolded files for editing.

# Create Stan data
#
# This function must return the list that will be passed to `data` argument
#   of `cmdstanr::sample()`
#
# The `.dir` argument represents the absolute path to the directory containing
#   this file. This is useful for building file paths to the input files you will
#   load. Note: you _don't_ need to pass anything to this argument, you only use
#   it within the function. `bbr` will pass in the correct path when it calls
#   `make_standata()` under the hood.
make_standata <- function(.dir) {
  # read in any input data
  in_data <- readr::read_csv(file.path(.dir, '..', 'my_data.csv'))
  # do any transformations
  # return the list that can be passed to cmdstanr
}

The purpose of this scaffold is to provide a helpful framework to begin writing the necessary code for submitting a model with bbr.bayes. Below (in the copy_model_from() section) you will see how, once you have defined a model, you can use that model as your starting point for new models, instead of relying on writing new code from scaffolds.

If you already have the relevant code for your new model, you can manually create these files in the output directory before calling new_model(). In that case, you won’t see any of the messages shown above and the new_model() call will simply create the .yaml file (described below) next to your directory.

Alternatively, you can create the model with the scaffolds and use helpers like add_stanmod_file() to add your files to the model directory.

Reading in an existing model

Now assume that you have previously created a model and written all the necessary code. You can read this model into memory with read_model().

fxa <- read_model(file.path(MODEL_DIR, "fxa"))

If you take a look at this model’s -standata.R file, you can see an example of how this script might look in a real model.

make_standata <- function(.dir) {
  ## get data file
  xdata <- readr::read_csv(file.path(.dir, "..", "..", "..", "extdata", "fxa.data.csv"))

  ## create data set
  data <- with(
    xdata,
    list(
      nSubjects = max(subject),
      nObs = nrow(xdata),
      subject = subject,
      cObs = cobs,
      fxa = fxa.inh.obs,
      cmin = 0,
      cmax = 1600,
      nsim = 201
    )
  )
}

Creating a model by copy_model_from()

Much of the benefit of bbr is leveraged in the model iteration workflow, and the run log summaries that can be created afterwards.

The copy_model_from() function creates a new model object based on a parent model, copying and renaming the necessary files in the process. The copy_model_from() call below will create an fxaNew directory and fill it as described in the new_model() section above. However, instead of filling it with scaffolds, it will copy (and rename) all of the necessary files from the parent model’s directory into the new directory.

fxaNew <- copy_model_from(
  .parent_mod = fxa,
  .new_model = "fxaNew",
  .inherit_tags = TRUE
)

Note that, while the .path argument in new_model() and read_model() is relative to your working directory, the .new_model argument of copy_model_from() is relative to the directory containing the parent model. This means that, assuming you would like to create the new model in the same directory as its parent, you only have to pass a filename (without extension) for the new model.

You can now open the relevant files and edit them with the desired changes, starting from where you left off with the parent model.

Comparing model files

Imagine that, after opening the fxaNew.stan file above, you decided to change the emax parameter from a uniform to a Cauchy distribution. The model_diff() function makes it easy to compare models to each other. For Stan models, model_diff() defaults to comparing the <run>.stan files.

model_diff(fxa, fxaNew)
@@ 32,5 @@
@@ 32,5 @@
 
 
 
 
 
model{
 
model{
<
emax ~ uniform(0, 100);
>
emax ~ cauchy(0, 100);
 
ec50Hat ~ normal(0, 250);
 
ec50Hat ~ normal(0, 250);
 
gamma ~ normal(0, 5);
 
gamma ~ normal(0, 5);

You can also pass any of c("standata", "init", "stanargs") to have it compare the specified file instead.

# modify the iter_sampling arg and see the difference
set_stanargs(fxaNew, list(iter_sampling = 500))
model_diff(fxa, fxaNew, .file = "stanargs")
@@ 1 @@
@@ 1 @@
<
list(iter_sampling = 100, iter_warmup = 100, seed = 123456)
>
list(iter_sampling = 500, iter_warmup = 100, seed = 123456)
# compare the data files and see that they are identical
model_diff(fxa, fxaNew, .file = "standata")
#> Relevant model files are identical

The bbr YAML file

This new_model() and copy_model_from() calls will create the YAML file associated with the model object (e.g., mod1.yaml). This YAML file stores model metadata like the description, notes, and tags (discussed below). The YAML file from a model created using new_model() will be very bare, while the copy_model_from() call will copy the YAML file from the parent model. The purpose of the YAML file will become more clear when we begin comparing models and generating model run logs.

Submitting a model

For Stan models, model submission wraps the cmdstanr::sample() method.

Passing arguments to $sample()

You can use set_stanargs to control which arguments are passed to cmdstanr::sample().

fxa <- fxa %>%
  set_stanargs(list(seed = 123456,
                    iter_warmup = 300,
                    iter_sampling = 300))

bbr.bayes stores these arguments with the model files, making it easy to reproduce your results later, by ensuring that you can run your model with the same arguments that were previously used. You can use get_stanargs() at any time to inspect the currently set arguments.

get_stanargs(fxa)
#>  iter_sampling = 300
#>  iter_warmup = 300
#>  seed = 123456

Using check_stan_model()

You can check whether all the required files are present at any time by calling the check_stan_model() function. By default this will also check the syntax of your .stan file, so that you can correct any errors before actually submitting the model.

Model submission

To submit a model to be run, you simply pass the model object to submit_model().

fxa_res <- fxa %>%
  submit_model()
#> Compiling Stan program...
#> 
#> Running MCMC with 4 sequential chains...
#> 
#> Chain 1 Iteration:   1 / 600 [  0%]  (Warmup)
#> Chain 1 Iteration: 100 / 600 [ 16%]  (Warmup)
#> Chain 1 Iteration: 200 / 600 [ 33%]  (Warmup)
#> Chain 1 Iteration: 300 / 600 [ 50%]  (Warmup)
#> Chain 1 Iteration: 301 / 600 [ 50%]  (Sampling)
#> Chain 1 Iteration: 400 / 600 [ 66%]  (Sampling)
#> Chain 1 Iteration: 500 / 600 [ 83%]  (Sampling)
#> Chain 1 Iteration: 600 / 600 [100%]  (Sampling)
#> Chain 1 finished in 19.8 seconds.
#> Chain 2 Iteration:   1 / 600 [  0%]  (Warmup)
#> Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
#>   Chain 2 Exception: fxaInc_model_namespace::log_prob: fxaHat[sym1__] is -nan, but must be greater than or equal to 0 (in '/tmp/RtmpoWCmag/model-1e697474b655.stan', line 24, column 2 to column 33)
#> Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
#> Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
#> Chain 2
#> Chain 2 Iteration: 100 / 600 [ 16%]  (Warmup)
#> Chain 2 Iteration: 200 / 600 [ 33%]  (Warmup)
#> Chain 2 Iteration: 300 / 600 [ 50%]  (Warmup)
#> Chain 2 Iteration: 301 / 600 [ 50%]  (Sampling)
#> Chain 2 Iteration: 400 / 600 [ 66%]  (Sampling)
#> Chain 2 Iteration: 500 / 600 [ 83%]  (Sampling)
#> Chain 2 Iteration: 600 / 600 [100%]  (Sampling)
#> Chain 2 finished in 15.8 seconds.
#> Chain 3 Iteration:   1 / 600 [  0%]  (Warmup)
#> Chain 3 Iteration: 100 / 600 [ 16%]  (Warmup)
#> Chain 3 Iteration: 200 / 600 [ 33%]  (Warmup)
#> Chain 3 Iteration: 300 / 600 [ 50%]  (Warmup)
#> Chain 3 Iteration: 301 / 600 [ 50%]  (Sampling)
#> Chain 3 Iteration: 400 / 600 [ 66%]  (Sampling)
#> Chain 3 Iteration: 500 / 600 [ 83%]  (Sampling)
#> Chain 3 Iteration: 600 / 600 [100%]  (Sampling)
#> Chain 3 finished in 14.5 seconds.
#> Chain 4 Iteration:   1 / 600 [  0%]  (Warmup)
#> Chain 4 Iteration: 100 / 600 [ 16%]  (Warmup)
#> Chain 4 Iteration: 200 / 600 [ 33%]  (Warmup)
#> Chain 4 Iteration: 300 / 600 [ 50%]  (Warmup)
#> Chain 4 Iteration: 301 / 600 [ 50%]  (Sampling)
#> Chain 4 Iteration: 400 / 600 [ 66%]  (Sampling)
#> Chain 4 Iteration: 500 / 600 [ 83%]  (Sampling)
#> Chain 4 Iteration: 600 / 600 [100%]  (Sampling)
#> Chain 4 finished in 13.7 seconds.
#> 
#> All 4 chains finished successfully.
#> Mean chain execution time: 15.9 seconds.
#> Total execution time: 64.4 seconds.

As you can see, this also prints all of the diagnostic output from $sample() as it runs.

Overwriting output from a previously run model

It is common to run a model, make some tweaks to it, and then run it again. However, to avoid accidentally deleting model outputs, bbr will error by default if it sees existing output when trying to submit a model. To automatically overwrite any previous model output, just pass .overwrite = TRUE to submit_model().

Summarizing model output

The submit_model() call will return the fit model object from cmdstanr. You can read all about this object in the ?cmdstanr::CmdStanMCMC object docs. It has several methods for diagnostics and model summaries.

diagnostics from cmdstanr

# this is just diagnostics
fxa_res$sampler_diagnostics() %>% str()
#>  'draws_array' num [1:100, 1:4, 1:6] 7 7 7 7 7 7 7 7 7 7 ...
#>  - attr(*, "dimnames")=List of 3
#>   ..$ iteration: chr [1:100] "1" "2" "3" "4" ...
#>   ..$ chain    : chr [1:4] "1" "2" "3" "4"
#>   ..$ variable : chr [1:6] "treedepth__" "divergent__" "energy__" "accept_stat__" ...
# this pulls in the actual posteriors
.s <- fxa_res$summary()
print(dim(.s))
#> [1] 7731   10
.s %>%
  slice_head(n = 5) %>%
  knitr::kable()
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ -3680.2814083 -3680.0200000 8.6471405 8.6064930 -3694.8520000 -3666.4655000 1.0105079 303.0671 486.5886
emax 98.6297981 98.7957500 0.9983639 0.9986052 96.6787050 99.8746150 0.9999083 668.7926 708.7078
ec50Hat 98.1662662 98.1924000 3.8890264 3.9245905 91.8921650 104.6708500 1.0031086 595.0784 777.1432
gamma 0.9891423 0.9868940 0.0252873 0.0243814 0.9514651 1.0339120 1.0062567 781.3768 805.9938
sigma 10.4642216 10.4564500 0.2041405 0.2097138 10.1424650 10.7990600 1.0014663 2031.5236 988.9422
omegaEc50 0.2095165 0.2087590 0.0281342 0.0291323 0.1645385 0.2579202 1.0035262 535.3836 757.2863
eta[1] -0.6250711 -0.6201160 0.7675739 0.7734544 -1.8850430 0.6527469 1.0029982 1867.7706 907.5572
eta[2] -0.1153561 -0.0963942 0.8554481 0.8512741 -1.4946310 1.3240880 1.0055980 1743.3984 933.1333
eta[3] -0.8548926 -0.8518780 0.8032995 0.8149496 -2.1302315 0.4580917 1.0059036 2280.0438 787.9234
eta[4] -0.2127920 -0.2252210 0.8624235 0.8322746 -1.5880420 1.2248895 1.0045113 1825.3802 758.5875
eta[5] -0.0151050 -0.0544589 0.9292352 0.8987395 -1.4938490 1.5579505 1.0063809 1729.7763 719.3638
eta[6] -0.1180454 -0.1490070 0.8218678 0.8478478 -1.4854690 1.2749045 1.0014590 2242.0783 778.1252
eta[7] 0.2474624 0.2238200 0.7733967 0.7866735 -0.9878888 1.5436155 0.9996281 1736.8845 878.6668
eta[8] 0.0880659 0.0778770 0.7961751 0.7591082 -1.1910545 1.3971225 1.0049626 1689.6422 857.3040
eta[9] 0.0322074 0.0314970 0.5920293 0.5559127 -0.9368224 1.0589770 1.0042437 1411.5189 981.4068

posterior::draws methods

Many cmdstanr users will also be familiar with the posterior package and its draws objects and methods. bbr.bayes has also implemented the as_draws() method and its variants, which can be used on either a model object or the ?cmdstanr::CmdStanMCMC object.

d_array <- as_draws(fxa)
print(dim(d_array))
#> [1]  300   4   7731
d_df <- as_draws_df(fxa)
print(dim(d_df))
#> [1] 1200 7734

Using read_fit_model()

The submit_model() call automatically saves this fit object to disk as well. You can pass either a path or model object to read_fit_model() and it will re-load the ?cmdstanr::CmdStanMCMC object into memory.

new_res <- read_fit_model(file.path(MODEL_DIR, "fxa"))
new_res$sampler_diagnostics() %>% str()
#>  'draws_array' num [1:100, 1:4, 1:6] 7 7 7 7 7 7 7 7 7 7 ...
#>  - attr(*, "dimnames")=List of 3
#>   ..$ iteration: chr [1:100] "1" "2" "3" "4" ...
#>   ..$ chain    : chr [1:4] "1" "2" "3" "4"
#>   ..$ variable : chr [1:6] "treedepth__" "divergent__" "energy__" "accept_stat__" ...

model_summary() function

We have not yet implemented model_summary() for bbi_stan_model objects. This is a work in progress, but the ?cmdstanr::CmdStanMCMC returned by read_fit_model() has various methods for summarizing the outputs.

Model annotation

Another useful feature of bbr is that it allows you to easily annotate and organize your models, for increased traceability and reproducibilty of your analysis. We demonstrate this a bit in this section, but it is described in more detail in both the “Model Management” page of MeRGE Expo 1 and the “Getting Started with bbr vignette.

Tags

Our first model already had two tags attached to it.

fxa$tags
#> [1] "fxaInhibit1Ncp" "unif emax"

Note in the call above that we told copy_model_from() to inherit the tags from the parent model and automatically add them to the new model. You can use replace_tag() (or any of the other helpers in ?modify_tags) to modify these at any time.

fxaNew <- fxaNew %>%
        replace_tag("unif emax", "cauchy emax")

print(fxaNew$tags)
#> [1] "fxaInhibit1Ncp" "cauchy emax"

You can also use tags_diff() to quickly compare two models.

tags_diff(fxa, fxaNew)
#> In fxa but not fxaNew:   unif emax
#> In fxaNew but not fxa:   cauchy emax

Notes

In addition to tags, you can add notes that can be referenced later. While tags are intended to be structured and consistent, such that you can filter or organize models based on them (shown below), notes can be more “free form” narrative comments.

fxaNew <- fxaNew %>%
  add_notes("Changed emax from uniform to cauchy distribution")

While both annotations can be easily accessed from the model object with fxaNew$tags or fxaNew$notes, notice that they are also displayed, along with other information, when you print the model object in your console

fxaNew
#> 
#> ── Status ──────────────────────────────────────────────────────────────────────
#> • Not Run
#> 
#> ── Absolute Model Path ─────────────────────────────────────────────────────────
#> • /data/home/kylem/src/github/metrumresearchgroup/bbr.bayes/inst/model/stan/fxaNew
#> 
#> ── YAML & Model Files ──────────────────────────────────────────────────────────
#> • /data/home/kylem/src/github/metrumresearchgroup/bbr.bayes/inst/model/stan/fxaNew.yaml
#> • /data/home/kylem/src/github/metrumresearchgroup/bbr.bayes/inst/model/stan/fxaNew/fxaNew.stan
#> • /data/home/kylem/src/github/metrumresearchgroup/bbr.bayes/inst/model/stan/fxaNew/fxaNew-standata.R
#> • /data/home/kylem/src/github/metrumresearchgroup/bbr.bayes/inst/model/stan/fxaNew/fxaNew-init.R
#> 
#> ── Tags ────────────────────────────────────────────────────────────────────────
#> • fxaInhibit1Ncp
#> • cauchy emax
#> 
#> ── Notes ───────────────────────────────────────────────────────────────────────
#> • 1: Changed emax from uniform to cauchy distribution

Aside on reassignment

Importantly, you must reassign to the model object when calling any of these functions that modify it (notice the fxaNew <- fxaNew ... above). You can see a list of similar functions in the “See Also” section of ?modify_tags or ?modify_notes.

Run log

At any point, you can easily construct a “run log” tibble to summarize all models run up to this point. You can see much more detail about using run logs in both the “Model Summary” page of MeRGE Expo 1 and the “Getting Started with bbr vignette.

The run_log() returns a tibble which can be manipulated like any other tibble. You can also use the .include argument to filter to only specific tags or run names, and use formatting functions like collapse_to_string(), knitr::kable(), or many of the functions in pmtables.

log_df <- run_log(MODEL_DIR, .include = c("fxa", "fxaNew"))

log_df %>%
  collapse_to_string(tags, notes) %>%
  select(run, tags, notes) %>%
  knitr::kable()
run tags notes
fxa fxaInhibit1Ncp, unif emax NA
fxaNew fxaInhibit1Ncp, cauchy emax Changed emax from uniform to cauchy distribution

Summary log

It is often useful to add some basic diagnostic information to a run log table. For Stan models, this can be done with stan_summary_log() or stan_add_summary(). In addition to some sampling diagnostics and metadata, this function will summarize the specified variables.

log_df <- log_df %>%
  stan_add_summary(variables = "emax")
names(log_df)
 [1] "absolute_model_path" "run"                 "yaml_md5"           
 [4] "model_type"          "description"         "bbi_args"           
 [7] "based_on"            "tags"                "notes"              
[10] "star"                "method"              "stan_version"       
[13] "nchains"             "iter_warmup"         "iter_sampling"      
[16] "num_divergent"       "num_max_treedepth"   "bad_ebfmi"          
[19] "emax_mean"           "emax_median"         "emax_q5"            
[22] "emax_q95"            "emax_rhat"           "emax_ess_bulk"      
[25] "emax_ess_tail"      

There are default functions used for the summarization, which are shown below. You can also pass your own summarization functions via the summary_fns argument. See ?stan_summary_log for more details.

log_df %>%
  select(run, num_divergent, starts_with("emax")) %>%
  knitr::kable()
run num_divergent emax_mean emax_median emax_q5 emax_q95 emax_rhat emax_ess_bulk emax_ess_tail
fxa 1 98.59362 98.85785 96.74060 99.87313 1.009106 410.7121 700.1974
fxaNew 7 98.65914 98.89980 96.62748 99.90333 1.007200 901.7857 724.6135