vignettes/getting-started-stan.Rmd
getting-started-stan.Rmd
This vignette takes you through some basic scenarios for modeling
with Stan using bbr.bayes
, introducing you to its standard
workflow and functionality.
bbr.bayes
is an extension to bbr
which focuses on traceable and reproducible modeling using Bayesian
methods. Note that you will also need to load bbr
to use
most bbr.bayes
functionality.
This vignette focuses on models defined in Stan. The “Getting Started with bbr.bayes and NONMEM Bayes” vignette focuses on modeling in NONMEM using Bayesian estimation methods.
For details on modeling in NONMEM with non-Bayesian estimation methods, see the “Getting Started with bbr and NONMEM” vignette.
bbr.bayes
uses cmdstan and the R package
cmdstanr
to run Stan models. Please see the
cmdstanr docs for instructions on installing cmdstan. (Note that,
when modeling with Stan rather than NONMEM, bbi
is not required.)
The bbr.bayes
workflow is organized around the concept
of model objects. These are S3 objects which you will
pass to bbr
and bbr.bayes
functions that
submit, summarize, or manipulate models. A model object points to a path
on disk; the set of files that define a model is derived from this path.
For a Stan model, the basic files are:
<run>.yml
- The bbr
YAML file
which stores model metadata like a description and tags.
<run>/<run>.stan
- The Stan
file.
<run>/<run>-stanargs.R
- Contains a
named list with all of the arguments that will be passed through to the
$sample()
method of ?cmdstanr::CmdStanModel
.
<run>/<run>-standata.R
- Contains all
necessary R code to read in any source data and transform them to a
Stan-ready data object (list).
<run>/<run>-init.R
- This file contains
all necessary R code to create the initial values passed to the
cmdstanr’s $sample()
method.
The Stan
models in bbr
reference page provides detailed
information about the files necessary for running a Stan model using
bbr
.
bbr
model objects can either be created de novo
or copied from an existing model object. We outline the differences
between these two approaches below.
To create a model object de novo, you use the
new_model()
function.
The new_model()
function will look for (and, if not
found, create) the files necessary to run a Stan model. The first
argument (.path
) must be the path to the directory where
you want to write these files. For Stan models, you will also have to
add .model_type = "stan"
to the new_model()
call.
mod1 <- new_model(file.path(MODEL_DIR, "mod1"), .model_type = "stan")
#>
#> The following files, which are necessary to run a `bbi_stan_model`, are missing from mod1:
#> * /data/home/kylem/src/github/metrumresearchgroup/bbr.bayes/inst/model/stan/mod1/mod1.stan
#> * /data/home/kylem/src/github/metrumresearchgroup/bbr.bayes/inst/model/stan/mod1/mod1-standata.R
#> * /data/home/kylem/src/github/metrumresearchgroup/bbr.bayes/inst/model/stan/mod1/mod1-init.R
#> * /data/home/kylem/src/github/metrumresearchgroup/bbr.bayes/inst/model/stan/mod1/mod1-stanargs.R
#> See `?add_file_to_model_dir` for helper functions to add them.
#> Automatically adding scaffolded .stan file
#> Automatically adding scaffolded -standata.R file
#> Automatically adding scaffolded -init.R file
#> Automatically adding scaffolded -stanargs.R file
The call above will look for a directory called
{MODEL_DIR}/mod1
. Inside this directory it will
expect to find the files mod1.stan
and
mod1-standata.R
. These “required files” will be discussed
below, and you can find even more detail in ?bbr_stan
.
For now notice that, if any of them are not found when you
initialize the model, new_model()
will create scaffolds
(files with boilerplate, non-working code) of the missing files, and
print a message to the console telling you it has done so.
You can use the helpers described in ?open_stan_file
to
open any of these scaffolded files for editing.
open_standata_file(mod1)
# Create Stan data
#
# This function must return the list that will be passed to `data` argument
# of `cmdstanr::sample()`
#
# The `.dir` argument represents the absolute path to the directory containing
# this file. This is useful for building file paths to the input files you will
# load. Note: you _don't_ need to pass anything to this argument, you only use
# it within the function. `bbr` will pass in the correct path when it calls
# `make_standata()` under the hood.
make_standata <- function(.dir) {
# read in any input data
in_data <- readr::read_csv(file.path(.dir, '..', 'my_data.csv'))
# do any transformations
# return the list that can be passed to cmdstanr
}
The purpose of this scaffold is to provide a helpful framework to
begin writing the necessary code for submitting a model with
bbr.bayes
. Below (in the copy_model_from()
section) you will see how, once you have defined a model, you can use
that model as your starting point for new models, instead of relying on
writing new code from scaffolds.
If you already have the relevant code for your new model, you can
manually create these files in the output directory before
calling new_model()
. In that case, you won’t see any of the
messages shown above and the new_model()
call will simply
create the .yaml
file (described below) next to your
directory.
Alternatively, you can create the model with the scaffolds and use
helpers like add_stanmod_file()
to add your files to the
model directory.
Now assume that you have previously created a model and written all
the necessary code. You can read this model into memory with
read_model()
.
fxa <- read_model(file.path(MODEL_DIR, "fxa"))
If you take a look at this model’s -standata.R
file, you can see an example of how this script might look in a real
model.
open_standata_file(fxa)
make_standata <- function(.dir) {
## get data file
xdata <- readr::read_csv(file.path(.dir, "..", "..", "..", "extdata", "fxa.data.csv"))
## create data set
data <- with(
xdata,
list(
nSubjects = max(subject),
nObs = nrow(xdata),
subject = subject,
cObs = cobs,
fxa = fxa.inh.obs,
cmin = 0,
cmax = 1600,
nsim = 201
)
)
}
Much of the benefit of bbr
is leveraged in the model
iteration workflow, and the run log summaries that can be created
afterwards.
The copy_model_from()
function creates a new model
object based on a parent model, copying and renaming the necessary files
in the process. The copy_model_from()
call below will
create an fxaNew
directory and fill it as described in the
new_model()
section above. However, instead of
filling it with scaffolds, it will copy (and rename) all of the
necessary files from the parent model’s directory into the new
directory.
fxaNew <- copy_model_from(
.parent_mod = fxa,
.new_model = "fxaNew",
.inherit_tags = TRUE
)
Note that, while the
.path
argument innew_model()
andread_model()
is relative to your working directory, the.new_model
argument ofcopy_model_from()
is relative to the directory containing the parent model. This means that, assuming you would like to create the new model in the same directory as its parent, you only have to pass a filename (without extension) for the new model.
You can now open the relevant files and edit them with the desired changes, starting from where you left off with the parent model.
open_stanmod_file(fxaNew)
Imagine that, after opening the fxaNew.stan
file above,
you decided to change the emax
parameter from a uniform to
a Cauchy distribution. The model_diff()
function makes it
easy to compare models to each other. For Stan models,
model_diff()
defaults to comparing the
<run>.stan
files.
model_diff(fxa, fxaNew)
@@ 32,5 @@@@ 32,5 @@model{model{<emax ~ uniform(0, 100);>emax ~ cauchy(0, 100);ec50Hat ~ normal(0, 250);ec50Hat ~ normal(0, 250);gamma ~ normal(0, 5);gamma ~ normal(0, 5);
You can also pass any of
c("standata", "init", "stanargs")
to have it compare the
specified file instead.
# modify the iter_sampling arg and see the difference
set_stanargs(fxaNew, list(iter_sampling = 500))
model_diff(fxa, fxaNew, .file = "stanargs")
@@ 1 @@@@ 1 @@<list(iter_sampling = 100, iter_warmup = 100, seed = 123456)>list(iter_sampling = 500, iter_warmup = 100, seed = 123456)
# compare the data files and see that they are identical
model_diff(fxa, fxaNew, .file = "standata")
#> Relevant model files are identical
bbr
YAML file
This new_model()
and copy_model_from()
calls will create the YAML file associated with the model object (e.g.,
mod1.yaml
). This YAML file stores model metadata like the
description, notes, and tags (discussed below). The YAML file from a
model created using new_model()
will be very bare, while
the copy_model_from()
call will copy the YAML file from the
parent model. The purpose of the YAML file will become more clear when
we begin comparing models and generating model run logs.
For Stan models, model submission wraps the cmdstanr::sample() method.
$sample()
You can use set_stanargs
to control which arguments are
passed to cmdstanr::sample()
.
fxa <- fxa %>%
set_stanargs(list(seed = 123456,
iter_warmup = 300,
iter_sampling = 300))
bbr.bayes
stores these arguments with the model files,
making it easy to reproduce your results later, by ensuring that you can
run your model with the same arguments that were previously used. You
can use get_stanargs()
at any time to inspect the currently
set arguments.
get_stanargs(fxa)
#> • iter_sampling = 300
#> • iter_warmup = 300
#> • seed = 123456
check_stan_model()
You can check whether all the required files are present at any time
by calling the check_stan_model()
function. By default this
will also check the syntax of your .stan
file, so that you
can correct any errors before actually submitting the model.
check_stan_model(fxa)
To submit a model to be run, you simply pass the model object to submit_model()
.
fxa_res <- fxa %>%
submit_model()
#> Compiling Stan program...
#>
#> Running MCMC with 4 sequential chains...
#>
#> Chain 1 Iteration: 1 / 600 [ 0%] (Warmup)
#> Chain 1 Iteration: 100 / 600 [ 16%] (Warmup)
#> Chain 1 Iteration: 200 / 600 [ 33%] (Warmup)
#> Chain 1 Iteration: 300 / 600 [ 50%] (Warmup)
#> Chain 1 Iteration: 301 / 600 [ 50%] (Sampling)
#> Chain 1 Iteration: 400 / 600 [ 66%] (Sampling)
#> Chain 1 Iteration: 500 / 600 [ 83%] (Sampling)
#> Chain 1 Iteration: 600 / 600 [100%] (Sampling)
#> Chain 1 finished in 19.8 seconds.
#> Chain 2 Iteration: 1 / 600 [ 0%] (Warmup)
#> Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
#> Chain 2 Exception: fxaInc_model_namespace::log_prob: fxaHat[sym1__] is -nan, but must be greater than or equal to 0 (in '/tmp/RtmpoWCmag/model-1e697474b655.stan', line 24, column 2 to column 33)
#> Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
#> Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
#> Chain 2
#> Chain 2 Iteration: 100 / 600 [ 16%] (Warmup)
#> Chain 2 Iteration: 200 / 600 [ 33%] (Warmup)
#> Chain 2 Iteration: 300 / 600 [ 50%] (Warmup)
#> Chain 2 Iteration: 301 / 600 [ 50%] (Sampling)
#> Chain 2 Iteration: 400 / 600 [ 66%] (Sampling)
#> Chain 2 Iteration: 500 / 600 [ 83%] (Sampling)
#> Chain 2 Iteration: 600 / 600 [100%] (Sampling)
#> Chain 2 finished in 15.8 seconds.
#> Chain 3 Iteration: 1 / 600 [ 0%] (Warmup)
#> Chain 3 Iteration: 100 / 600 [ 16%] (Warmup)
#> Chain 3 Iteration: 200 / 600 [ 33%] (Warmup)
#> Chain 3 Iteration: 300 / 600 [ 50%] (Warmup)
#> Chain 3 Iteration: 301 / 600 [ 50%] (Sampling)
#> Chain 3 Iteration: 400 / 600 [ 66%] (Sampling)
#> Chain 3 Iteration: 500 / 600 [ 83%] (Sampling)
#> Chain 3 Iteration: 600 / 600 [100%] (Sampling)
#> Chain 3 finished in 14.5 seconds.
#> Chain 4 Iteration: 1 / 600 [ 0%] (Warmup)
#> Chain 4 Iteration: 100 / 600 [ 16%] (Warmup)
#> Chain 4 Iteration: 200 / 600 [ 33%] (Warmup)
#> Chain 4 Iteration: 300 / 600 [ 50%] (Warmup)
#> Chain 4 Iteration: 301 / 600 [ 50%] (Sampling)
#> Chain 4 Iteration: 400 / 600 [ 66%] (Sampling)
#> Chain 4 Iteration: 500 / 600 [ 83%] (Sampling)
#> Chain 4 Iteration: 600 / 600 [100%] (Sampling)
#> Chain 4 finished in 13.7 seconds.
#>
#> All 4 chains finished successfully.
#> Mean chain execution time: 15.9 seconds.
#> Total execution time: 64.4 seconds.
As you can see, this also prints all of the diagnostic output from
$sample()
as it runs.
It is common to run a model, make some tweaks to it, and then run it
again. However, to avoid accidentally deleting model outputs,
bbr
will error by default if it sees existing output when
trying to submit a model. To automatically overwrite any previous model
output, just pass .overwrite = TRUE
to submit_model()
.
The submit_model()
call will return the fit model object from cmdstanr
. You
can read all about this object in the
?cmdstanr::CmdStanMCMC
object docs. It has several methods
for diagnostics and model summaries.
# this is just diagnostics
fxa_res$sampler_diagnostics() %>% str()
#> 'draws_array' num [1:100, 1:4, 1:6] 7 7 7 7 7 7 7 7 7 7 ...
#> - attr(*, "dimnames")=List of 3
#> ..$ iteration: chr [1:100] "1" "2" "3" "4" ...
#> ..$ chain : chr [1:4] "1" "2" "3" "4"
#> ..$ variable : chr [1:6] "treedepth__" "divergent__" "energy__" "accept_stat__" ...
#> [1] 7731 10
.s %>%
slice_head(n = 5) %>%
knitr::kable()
variable | mean | median | sd | mad | q5 | q95 | rhat | ess_bulk | ess_tail |
---|---|---|---|---|---|---|---|---|---|
lp__ | -3680.2814083 | -3680.0200000 | 8.6471405 | 8.6064930 | -3694.8520000 | -3666.4655000 | 1.0105079 | 303.0671 | 486.5886 |
emax | 98.6297981 | 98.7957500 | 0.9983639 | 0.9986052 | 96.6787050 | 99.8746150 | 0.9999083 | 668.7926 | 708.7078 |
ec50Hat | 98.1662662 | 98.1924000 | 3.8890264 | 3.9245905 | 91.8921650 | 104.6708500 | 1.0031086 | 595.0784 | 777.1432 |
gamma | 0.9891423 | 0.9868940 | 0.0252873 | 0.0243814 | 0.9514651 | 1.0339120 | 1.0062567 | 781.3768 | 805.9938 |
sigma | 10.4642216 | 10.4564500 | 0.2041405 | 0.2097138 | 10.1424650 | 10.7990600 | 1.0014663 | 2031.5236 | 988.9422 |
omegaEc50 | 0.2095165 | 0.2087590 | 0.0281342 | 0.0291323 | 0.1645385 | 0.2579202 | 1.0035262 | 535.3836 | 757.2863 |
eta[1] | -0.6250711 | -0.6201160 | 0.7675739 | 0.7734544 | -1.8850430 | 0.6527469 | 1.0029982 | 1867.7706 | 907.5572 |
eta[2] | -0.1153561 | -0.0963942 | 0.8554481 | 0.8512741 | -1.4946310 | 1.3240880 | 1.0055980 | 1743.3984 | 933.1333 |
eta[3] | -0.8548926 | -0.8518780 | 0.8032995 | 0.8149496 | -2.1302315 | 0.4580917 | 1.0059036 | 2280.0438 | 787.9234 |
eta[4] | -0.2127920 | -0.2252210 | 0.8624235 | 0.8322746 | -1.5880420 | 1.2248895 | 1.0045113 | 1825.3802 | 758.5875 |
eta[5] | -0.0151050 | -0.0544589 | 0.9292352 | 0.8987395 | -1.4938490 | 1.5579505 | 1.0063809 | 1729.7763 | 719.3638 |
eta[6] | -0.1180454 | -0.1490070 | 0.8218678 | 0.8478478 | -1.4854690 | 1.2749045 | 1.0014590 | 2242.0783 | 778.1252 |
eta[7] | 0.2474624 | 0.2238200 | 0.7733967 | 0.7866735 | -0.9878888 | 1.5436155 | 0.9996281 | 1736.8845 | 878.6668 |
eta[8] | 0.0880659 | 0.0778770 | 0.7961751 | 0.7591082 | -1.1910545 | 1.3971225 | 1.0049626 | 1689.6422 | 857.3040 |
eta[9] | 0.0322074 | 0.0314970 | 0.5920293 | 0.5559127 | -0.9368224 | 1.0589770 | 1.0042437 | 1411.5189 | 981.4068 |
posterior::draws
methods
Many cmdstanr users will also be familiar with the
posterior package and its draws
objects and methods. bbr.bayes
has also implemented the
as_draws()
method and its variants, which can be used on either a model object or
the ?cmdstanr::CmdStanMCMC
object.
#> [1] 300 4 7731
#> [1] 1200 7734
read_fit_model()
The submit_model()
call automatically saves this fit object to disk as well. You can pass
either a path or model object to read_fit_model()
and it
will re-load the ?cmdstanr::CmdStanMCMC
object into
memory.
new_res <- read_fit_model(file.path(MODEL_DIR, "fxa"))
new_res$sampler_diagnostics() %>% str()
#> 'draws_array' num [1:100, 1:4, 1:6] 7 7 7 7 7 7 7 7 7 7 ...
#> - attr(*, "dimnames")=List of 3
#> ..$ iteration: chr [1:100] "1" "2" "3" "4" ...
#> ..$ chain : chr [1:4] "1" "2" "3" "4"
#> ..$ variable : chr [1:6] "treedepth__" "divergent__" "energy__" "accept_stat__" ...
model_summary()
function
We have not yet implemented
model_summary()
for bbi_stan_model
objects.
This is a work in progress, but the ?cmdstanr::CmdStanMCMC
returned by read_fit_model()
has various methods for
summarizing the outputs.
Another useful feature of bbr
is that it allows you to
easily annotate and organize your models, for increased traceability and
reproducibilty of your analysis. We demonstrate this a bit in this
section, but it is described in more detail in both the “Model
Management” page of MeRGE Expo 1 and the “Getting
Started with bbr
” vignette.
Our first model already had two tags attached to it.
fxa$tags
#> [1] "fxaInhibit1Ncp" "unif emax"
Note in the call above that we told copy_model_from()
to
inherit the tags from the parent model and automatically add them to the
new model. You can use replace_tag()
(or any of the other
helpers in ?modify_tags
) to modify these at any time.
fxaNew <- fxaNew %>%
replace_tag("unif emax", "cauchy emax")
print(fxaNew$tags)
#> [1] "fxaInhibit1Ncp" "cauchy emax"
You can also use tags_diff()
to quickly compare two
models.
tags_diff(fxa, fxaNew)
#> In fxa but not fxaNew: unif emax
#> In fxaNew but not fxa: cauchy emax
In addition to tags, you can add notes that can be referenced later. While tags are intended to be structured and consistent, such that you can filter or organize models based on them (shown below), notes can be more “free form” narrative comments.
While both annotations can be easily accessed from the model object
with fxaNew$tags
or fxaNew$notes
, notice that
they are also displayed, along with other information, when you print
the model object in your console
fxaNew
#>
#> ── Status ──────────────────────────────────────────────────────────────────────
#> • Not Run
#>
#> ── Absolute Model Path ─────────────────────────────────────────────────────────
#> • /data/home/kylem/src/github/metrumresearchgroup/bbr.bayes/inst/model/stan/fxaNew
#>
#> ── YAML & Model Files ──────────────────────────────────────────────────────────
#> • /data/home/kylem/src/github/metrumresearchgroup/bbr.bayes/inst/model/stan/fxaNew.yaml
#> • /data/home/kylem/src/github/metrumresearchgroup/bbr.bayes/inst/model/stan/fxaNew/fxaNew.stan
#> • /data/home/kylem/src/github/metrumresearchgroup/bbr.bayes/inst/model/stan/fxaNew/fxaNew-standata.R
#> • /data/home/kylem/src/github/metrumresearchgroup/bbr.bayes/inst/model/stan/fxaNew/fxaNew-init.R
#>
#> ── Tags ────────────────────────────────────────────────────────────────────────
#> • fxaInhibit1Ncp
#> • cauchy emax
#>
#> ── Notes ───────────────────────────────────────────────────────────────────────
#> • 1: Changed emax from uniform to cauchy distribution
Importantly, you must reassign to the model object
when calling any of these functions that modify it (notice the
fxaNew <- fxaNew ...
above). You can see a list of
similar functions in the “See Also” section of ?modify_tags
or ?modify_notes
.
At any point, you can easily construct a “run log” tibble to
summarize all models run up to this point. You can see much more detail
about using run logs in both the “Model
Summary” page of MeRGE Expo 1 and the “Getting
Started with bbr
” vignette.
The run_log()
returns a tibble which can be manipulated
like any other tibble. You can also use the .include
argument to filter to only specific tags or run names, and use
formatting functions like collapse_to_string()
,
knitr::kable()
, or many of the functions
in pmtables
.
log_df <- run_log(MODEL_DIR, .include = c("fxa", "fxaNew"))
log_df %>%
collapse_to_string(tags, notes) %>%
select(run, tags, notes) %>%
knitr::kable()
run | tags | notes |
---|---|---|
fxa | fxaInhibit1Ncp, unif emax | NA |
fxaNew | fxaInhibit1Ncp, cauchy emax | Changed emax from uniform to cauchy distribution |
It is often useful to add some basic diagnostic information to a run
log table. For Stan models, this can be done with
stan_summary_log()
or stan_add_summary()
. In
addition to some sampling diagnostics and metadata, this function will
summarize the specified variables.
log_df <- log_df %>%
stan_add_summary(variables = "emax")
names(log_df)
[1] "absolute_model_path" "run" "yaml_md5"
[4] "model_type" "description" "bbi_args"
[7] "based_on" "tags" "notes"
[10] "star" "method" "stan_version"
[13] "nchains" "iter_warmup" "iter_sampling"
[16] "num_divergent" "num_max_treedepth" "bad_ebfmi"
[19] "emax_mean" "emax_median" "emax_q5"
[22] "emax_q95" "emax_rhat" "emax_ess_bulk"
[25] "emax_ess_tail"
There are default functions used for the summarization, which are
shown below. You can also pass your own summarization functions via the
summary_fns
argument. See ?stan_summary_log
for more details.
log_df %>%
select(run, num_divergent, starts_with("emax")) %>%
knitr::kable()
run | num_divergent | emax_mean | emax_median | emax_q5 | emax_q95 | emax_rhat | emax_ess_bulk | emax_ess_tail |
---|---|---|---|---|---|---|---|---|
fxa | 1 | 98.59362 | 98.85785 | 96.74060 | 99.87313 | 1.009106 | 410.7121 | 700.1974 |
fxaNew | 7 | 98.65914 | 98.89980 | 96.62748 | 99.90333 | 1.007200 | 901.7857 | 724.6135 |