This notebook contains several examples of how to use Stan in R with brms. This notebook assumes basic knowledge of Bayesian inference and MCMC. The examples are related to Bayesian data analysis course.
Load packages
Code
library(tidyr)library(dplyr)library(tibble)library(pillar)library(stringr)library(brms)options(brms.backend ="cmdstanr", mc.cores =2)library(posterior)options(posterior.num_args=list(digits=2))options(pillar.negative =FALSE)library(loo)library(priorsense)#options(priorsense.use_plot_theme=FALSE)library(ggplot2)library(bayesplot)theme_set(bayesplot::theme_default(base_family ="sans", base_size=14))library(tidybayes)library(ggdist)library(patchwork)library(RColorBrewer)library(tinytable)options(tinytable_format_num_fmt ="significant_cell", tinytable_format_digits =2, tinytable_tt_digits=2)SEED <-48927# set random seed for reproducibility
2 Bernoulli model
Toy data with sequence of failures (0) and successes (1). We would like to learn about the unknown probability of success. brms wants the data in a data frame format (tibble which is a variant of data frame can be used, too).
As usual in case of generalized linear models, (GLMs) brms defines the priors on the latent model parameters. With Bernoulli the default link function is logit, and thus the prior is set on logit(theta). As there are no covariates logit(theta)=Intercept. The brms default prior for Intercept is student_t(3, 0, 2.5), but we use student_t(7, 0, 1.5) which is close to logistic distribution, and thus makes the prior near-uniform for theta. We can simulate from these priors to check the implied prior on theta. We next compare the result to using normal(0, 1) prior on logit probability. We visualize the implied priors by sampling from the priors.
plogis is the cumulative density function for logistic distribution, which is equal to inverse-logit transformation. Base-R’s Student’s t-distribution does not have mu and sigma arguments, and thus we use a function from ggdist package.
Almost uniform prior on theta could be obtained also with normal(0,1.5)
data.frame(theta =plogis(rnorm(n=20000, mean=0, sd=1.5))) |>mcmc_hist() +xlim(c(0,1)) +labs(title='normal(0, 1.5) for Intercept')
Formula y ~ 1 corresponds to a model \mathrm{logit}(\theta) = \alpha\times 1 = \alpha. brms denotes the \alpha as Intercept.
fit_bern <-brm(y ~1, family =bernoulli(), data = data_bern,prior =prior(student_t(7, 0, 1.5), class='Intercept'),seed = SEED, refresh =0)
Check the summary of the posterior and inference diagnostics.
fit_bern
Family: bernoulli
Links: mu = logit
Formula: y ~ 1
Data: data_bern (Number of observations: 10)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.76 0.64 -0.45 2.05 1.00 1602 1574
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Extract the posterior draws in data frame (df) format
draws <-as_draws_df(fit_bern)
We can select subset of stored variables with subset_draws() and get summary information using summarise_draws()
draws |>subset_draws(variable='b_Intercept') |>summarise_draws()
which makes the table to show only 2 significant digits for each value, which is sufficient accuracy and having fewer digits makes the table easier to read.
draws |>subset_draws(variable='b_Intercept') |>summarise_draws() |>tt()
tinytable_ll7cc0oh3p0c2jqc3w8f
variable
mean
median
sd
mad
q5
q95
rhat
ess_bulk
ess_tail
b_Intercept
0.76
0.73
0.64
0.62
-0.27
1.9
1
1602
1574
We can compute the probability of success by using plogis which is equal to inverse-logit function
draws <- draws |>mutate_variables(theta=plogis(b_Intercept))
Summary of theta
draws |>subset_draws(variable='theta') |>summarise_draws() |>tt()
Prior and likelihood sensitivity plot shows posterior density estimate depending on amount of power-scaling. Overlapping lines indicate low sensitivity and wider gaps between lines indicate greater sensitivity.
Instead of sequence of 0’s and 1’s, we can summarize the data with the number of trials and the number successes and use Binomial model. The prior is specified in the ‘latent space’. The actual probability of success, theta = plogis(alpha), where plogis is the inverse of the logistic function.
Binomial model with the same data and prior
data_bin <-data.frame(N =c(10), y =c(7))
Formula y | trials(N) ~ 1 corresponds to a model \mathrm{logit}(\theta) = \alpha, and the number of trials for each observation is provided by | trials(N)
fit_bin <-brm(y |trials(N) ~1, family =binomial(), data = data_bin,prior =prior(student_t(7, 0,1.5), class='Intercept'),seed = SEED, refresh =0)
Check the summary of the posterior and inference diagnostics.
fit_bin
Family: binomial
Links: mu = logit
Formula: y | trials(N) ~ 1
Data: data_bin (Number of observations: 1)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.75 0.63 -0.44 2.05 1.00 1330 1429
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Extract the posterior draws in data frame format
draws <-as_draws_df(fit_bin)
Summary of latent intercept
draws |>subset_draws(variable='b_Intercept') |>summarise_draws() |>tt()
tinytable_5z4ibh0ix944csk9k02e
variable
mean
median
sd
mad
q5
q95
rhat
ess_bulk
ess_tail
b_Intercept
0.75
0.73
0.63
0.62
-0.25
1.8
1
1330
1429
We can compute the probability of success by using plogis() which is equal to inverse-logit function
draws <- draws |>mutate_variables(theta=plogis(b_Intercept))
Summary of theta
draws |>subset_draws(variable='theta') |>summarise_draws() |>tt(digits=2)
Re-run the model with a new data dataset without recompiling using argument newdata.
data_bin <-data.frame(N =c(5), y =c(4))fit_bin <-update(fit_bin, newdata = data_bin)
Check the summary of the posterior and inference diagnostics.
fit_bin
Family: binomial
Links: mu = logit
Formula: y | trials(N) ~ 1
Data: data_bin (Number of observations: 1)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 1.07 0.92 -0.54 2.95 1.00 1317 1454
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Extract the posterior draws in data frame format
draws <-as_draws_df(fit_bin)
Summary of latent intercept
draws |>subset_draws(variable='b_Intercept') |>summarise_draws() |>tt(digits=2)
tinytable_0drmw8ap6dqsdz6qkzym
variable
mean
median
sd
mad
q5
q95
rhat
ess_bulk
ess_tail
b_Intercept
1.1
1
0.92
0.86
-0.29
2.6
1
1317
1454
We can compute the probability of success by using plogis() which is equal to inverse-logit function
draws <- draws |>mutate_variables(theta=plogis(b_Intercept))
Summary of theta
draws |>subset_draws(variable='theta') |>summarise_draws() |>tt(digits=2)
An experiment was performed to estimate the effect of beta-blockers on mortality of cardiac patients. A group of patients were randomly assigned to treatment and control groups:
out of 674 patients receiving the control, 39 died
out of 680 receiving the treatment, 22 died
Data, where grp2 is an indicator variable defined as a factor type (using factor() function), which is useful for categorical variables.
To analyse whether the treatment is useful, we can use Binomial model for both groups and compute odds-ratio. To recreate the model as two independent (separate) binomial models, we use formula y | trials(N) ~ 0 + grp2, which corresponds to a model \mathrm{logit}(\theta) = \alpha \times 0 +
\beta_\mathrm{control}\times x_\mathrm{control} +
\beta_\mathrm{treatment}\times x_\mathrm{treatment} =
\beta_\mathrm{control}\times x_\mathrm{control} +
\beta_\mathrm{treatment}\times x_\mathrm{treatment}, where x_\mathrm{control} is a vector with 1 for control and 0 for treatment, and x_\mathrm{treatment} is a vector with 1 for treatment and 0 for control. As only of the vectors have 1, this corresponds to separate models \mathrm{logit}(\theta_\mathrm{control}) = \beta_\mathrm{control} and \mathrm{logit}(\theta_\mathrm{treatment}) =
\beta_\mathrm{treatment}. We can provide the same prior for all \beta’s by setting the prior with class='b'. With prior student_t(7, 0,1.5), both \beta’s are shrunk towards 0, but independently.
Check the summary of the posterior and inference diagnostics. With ~ 0 + grp2 there is no Intercept and and are presented as grp2control and grp2treatment.
fit_bin2
Family: binomial
Links: mu = logit
Formula: y | trials(N) ~ 0 + grp2
Data: data_bin2 (Number of observations: 2)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
grp2control -2.78 0.16 -3.11 -2.48 1.00 3187 2678
grp2treatment -3.37 0.21 -3.80 -2.97 1.00 2525 2312
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Compute theta for each group and the odds-ratio. brms uses variable names b_grp2control and b_grp2treatment for \beta_\mathrm{control} and \beta_\mathrm{treatment} respectively.
Compute probability that the oddsratio<1 and associated Monte Carlo standard error (MCSE). To compute the probability we use comparison oddsratio<1 to find out for which posterior draws this is true, and when we compute mean over TRUE and FALSE values, these values are converted to 1 and 0, and the mean() is then equal to the proportion of 1’s. The usual MCSE estimate for mean can be use to get the MCSE for this proportion.
Make prior sensitivity analysis by power-scaling both prior and likelihood. Focus on odds-ratio which is the quantity of interest. We see that the likelihood is much more informative than the prior, and we would expect to see a different posterior only with a highly informative prior (possibly based on previous similar experiments). Prior and likelihood sensitivity plot shows posterior density estimate depending on amount of power-scaling. Overlapping lines indicate low sensitivity and wider gaps between lines indicate greater sensitivity.
We can summarise the prior and likelihood sensitivity using cumulative Jensen-Shannon distance. Here we prefer to show 2 decimal digits (instead of the 2 significant digits we used before)
Above we used formula y | trials(N) ~ 0 + grp2 to have separate model for control and treatment group. An alternative model y | trials(N) ~ grp2 which is equal to y | trials(N) ~ 1 + grp2, would correspond to a model \mathrm{logit}(\theta) = \alpha \times
1 + \beta_\mathrm{treatment}\times x_\mathrm{treatment} = \alpha +
\beta_\mathrm{treatment}\times x_\mathrm{treatment}. Now \alpha models the probability of death (via logistic link) in the control group and \alpha + \beta_\mathrm{treatment} models the probability of death (via logistic link) in the treatment group. Now the models for the groups are connected. Furthermore, if we set independent student_t(7, 0, 1.5) priors on \alpha and \beta_\mathrm{treatment}, the implied priors on \theta_\mathrm{control} and \theta_\mathrm{treatment} are different. We can verify this with a prior simulation.
data.frame(theta_control =plogis(ggdist::rstudent_t(n=20000, df=7, mu=0, sigma=1.5))) |>mcmc_hist() +xlim(c(0,1)) +labs(title='student_t(7, 0, 1.5) for Intercept') +data.frame(theta_treatment =plogis(ggdist::rstudent_t(n=20000, df=7, mu=0, sigma=1.5))+plogis(ggdist::rstudent_t(n=20000, df=7, mu=0, sigma=1.5))) |>mcmc_hist() +xlim(c(0,1)) +labs(title='student_t(7, 0, 1.5) for Intercept and b_grp2treatment')
In this case, with relatively big treatment and control group, the likelihood is informative, and the difference between using y | trials(N) ~ 0 + grp2 or y | trials(N) ~ grp2 is negligible.
Third option would be a hierarchical model with formula y | trials(N) ~ 1 + (1 | grp2), which is equivalent to y | trials(N) ~ 1 + (1 | grp2), and corresponds to a model \mathrm{logit}(\theta) = \alpha \times 1 +
\beta_\mathrm{control}\times x_\mathrm{control} +
\beta_\mathrm{treatment}\times x_\mathrm{treatment}, but now the prior on \beta_\mathrm{control} and \beta_\mathrm{treatment} is \mathrm{normal}(0, \sigma_\mathrm{grp}). The default brms prior for \sigma_\mathrm{grp} is student_t(3, 0, 2.5). Now \alpha models the overall probability of death (via logistic link), and \beta_\mathrm{control} and \beta_\mathrm{treatment} model the difference from that having the same prior. Prior for \beta_\mathrm{control} and \beta_\mathrm{treatment} includes unknown scale \sigma_\mathrm{grp}. If the there is not difference between control and treatment groups, the posterior of \sigma_\mathrm{grp} has more mass near 0, and bigger the difference between control and treatment groups are, more mass there is away from 0. With just two groups, there is not much information about \sigma_\mathrm{grp}, and unless there is a informative prior on \sigma_\mathrm{grp}, two group hierarchical model is not that useful. Hierarchical models are more useful with more than two groups. In the following, we use the previously used student_t(7, 0,1.5) prior on intercept and the default brms prior student_t(3, 0, 2.5) on \sigma_\mathrm{grp}.
Check the summary of the posterior and inference diagnostics. The summary reports that there are Group-Level Effects: ~grp2 with 2 levels (control and treatment), with sd(Intercept) denoting \sigma_\mathrm{grp}. In addition, the summary lists Population-Level Effects: Intercept (\alpha) as in the previous non-hierarchical models.
fit_bin2
Family: binomial
Links: mu = logit
Formula: y | trials(N) ~ 1 + (1 | grp2)
Data: data_bin2 (Number of observations: 2)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Multilevel Hyperparameters:
~grp2 (Number of levels: 2)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 1.63 1.58 0.12 5.95 1.01 480 752
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -2.24 1.21 -3.90 0.74 1.01 542 813
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
We can also look at the variable names brms uses internally. We exclude variables lprior (log prior density) and lp__ (log posterior density).
To analyse whether there has been change in the average summer month temperature we use a linear model with Gaussian model for the unexplained variation. By default brms uses uniform prior for the coefficients.
Formula temp ~ year corresponds to model \mathrm{temp} ~
\mathrm{normal}(\alpha + \beta \times \mathrm{temp}, \sigma). The model could also be defined as temp ~ 1 + year which explicitly shows the intercept (\alpha) part. Using the variable names brms uses the model can be written also as
\mathrm{temp} \sim \mathrm{normal}(\mathrm{b\_Intercept}*1 + \mathrm{b\_year}*\mathrm{year}, \mathrm{sigma})
We start with the default priors to see some tricks that brms does behind the curtain.
fit_lin <-brm(temp ~ year, data = data_lin, family =gaussian(),seed = SEED, refresh =0)
Check the summary of the posterior and inference diagnostics.
fit_lin
Family: gaussian
Links: mu = identity; sigma = identity
Formula: temp ~ year
Data: data_lin (Number of observations: 71)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -34.91 12.75 -60.01 -9.65 1.00 3675 2945
year 0.02 0.01 0.01 0.03 1.00 3679 2859
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 1.08 0.09 0.92 1.28 1.00 3346 2610
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Convergence diagnostics look good. We see that posterior mean of Intercept is -34.7, which may sound strange, but that is the intercept at year 0, that is, very far from the data range, and thus doesn’t have meaningful interpretation directly. The posterior mean of year coefficient is 0.02, that is, we estimate that the summer temperature is increasing 0.02°C per year (which would make 1°C in 50 years).
We can check R^2 which corresponds to the proportion of variance explained by the model. The linear model explains 0.16=16% of the total data variance.
bayes_R2(fit_lin) |>as_tibble() |>tt()
tinytable_w1flu3wijlh370ct6jgs
Estimate
Est.Error
Q2.5
Q97.5
0.16
0.071
0.032
0.31
We can check the all the priors used with prior_summary()
prior_summary(fit_lin) |>tt()
tinytable_o87c81echz0pn8c6z5s2
prior
class
coef
group
resp
dpar
nlpar
lb
ub
source
b
default
b
year
default
student_t(3, 9.5, 2.5)
Intercept
default
student_t(3, 0, 2.5)
sigma
0
default
We see that class=b and coef=year have prior flat, that is, improper uniform prior, Intercept has student_t(3, 9.5, 2.5), and sigma has student_t(3, 0, 2.5) prior. In general it is good to use proper priors, but sometimes flat priors are fine and produce proper posterior (like in this case). Important part here is that by default, brms sets the prior on Intercept after centering the covariate values (design matrix). In this case, brms centers the covariate year by subracting the mean year (1987), and does the regression on centered covariate year-1987. This in general improves the sampling efficiency. The Intercept is now defined at the mean year, and by default brms uses a weak proper prior on Intercept centered on median of the target (year). If we would like to set informative priors, we need to set the informative prior on Intercept given the centered covariate values.
Sometimes we prefer to turn of the automatic centering, and we can do that by replacing the formula with bf(temp ~ year, center=FALSE). Or we can set the prior on original intercept by using a formula temp ~ 0 + Intercept + year.
In this case, we are happy with the default prior for the intercept. In this specific case, the flat prior on coefficient is also fine, but we add an weakly informative prior just for the illustration. Let’s assume we expect the temperature to change less than 1°C in 10 years. With student_t(3, 0, 0.03) about 95% prior mass has less than 0.1°C change in year, and with low degrees of freedom (3) we have thick tails making the likelihood dominate in case of prior-data conflict. In real life, we do have much more information about the temperature change, and naturally a hierarchical spatio-temporal model with all temperature measurement locations would be even better.
fit_lin <-brm(temp ~ year, data = data_lin, family =gaussian(),prior =prior(student_t(3, 0, 0.03), class='b'),seed = SEED, refresh =0)
Check the summary of the posterior and inference diagnostics.
fit_lin
Family: gaussian
Links: mu = identity; sigma = identity
Formula: temp ~ year
Data: data_lin (Number of observations: 71)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -32.31 12.33 -56.76 -7.39 1.00 4687 2713
year 0.02 0.01 0.01 0.03 1.00 4687 2714
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 1.08 0.09 0.91 1.28 1.00 3363 2640
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Make prior sensitivity analysis by power-scaling both prior and likelihood.
Our weakly informative proper prior has negligible sensitivity, and the likelihood is informative. Extract the posterior draws and check the summaries. We exclude variables lprior (log prior density) and lp__ (log posterior density).
Plot posterior draws of the linear function values at each year. add_linpred_draws() takes the years from the data passed via pipe, and uses fit_lin to make the linear model predictions. add_linpred_draws() corresponds to brms::posterior_linpred()
data_lin |>add_linpred_draws(fit_lin) |># plot dataggplot(aes(x=year, y=temp)) +geom_point(color=2) +# plot lineribbon for the linear modelstat_lineribbon(aes(y = .linpred), .width =c(.95), alpha =1/2, color=brewer.pal(5, "Blues")[[5]]) +# decorationscale_fill_brewer()+labs(x="Year", y ='Summer temp. @Kilpisjärvi') +theme(legend.position="none")+scale_x_continuous(breaks=seq(1950,2020,by=10))
Plot a spaghetti plot for 100 posterior draws.
data_lin |>add_linpred_draws(fit_lin, ndraws=100) |># plot dataggplot(aes(x=year, y=temp)) +geom_point(color=2) +# plot a line for each posterior drawgeom_line(aes(y=.linpred, group=.draw), alpha =1/2, color =brewer.pal(5, "Blues")[[3]])+# decorationscale_fill_brewer()+labs(x="Year", y ='Summer temp. @Kilpisjärvi') +theme(legend.position="none")+scale_x_continuous(breaks=seq(1950,2020,by=10))
Plot posterior predictive distribution at each year until 2030. add_predicted_draws() takes the years from the data and uses fit_lin to draw from the posterior predictive distribution. add_predicted_draws() corresponds to brms::posterior_predict().
data_lin |>add_row(year=2023:2030) |>add_predicted_draws(fit_lin) |># plot dataggplot(aes(x=year, y=temp)) +geom_point(color=2) +# plot lineribbon for the linear modelstat_lineribbon(aes(y = .prediction), .width =c(.95), alpha =1/2, color=brewer.pal(5, "Blues")[[5]]) +# decorationscale_fill_brewer()+labs(x="Year", y ='Summer temp. @Kilpisjärvi') +theme(legend.position="none")+scale_x_continuous(breaks=seq(1950,2030,by=10))
Warning: Removed 32000 rows containing missing values or values outside the scale range
(`geom_point()`).
Posterior predictive check with density overlays examines the whole temperature distribution. We generate replicate data using 20 different posterior draws (with argument ndraws).
pp_check(fit_lin, type='dens_overlay', ndraws=20)
LOO-PIT check is good for checking whether the normal distribution is well describing the variation as it examines the calibration of LOO predictive distributions conditionally on each year. LOO-PIT plot looks good. We use all posterior draws to estimate LOO predictive distributions.
pp_check(fit_lin, type='loo_pit_qq', ndraws=4000)
6 Linear Student’s t model
The temperatures used in the above analyses are averages over three months, which makes it more likely that they are normally distributed, but there can be extreme events in the feather and we can check whether more robust Student’s t observation model would give different results (although LOO-PIT check did already indicate that the normal would be good).
Check the summary of the posterior and inference diagnostics. The b_year posterior looks similar as before and the posterior for degrees of freedom nu has most of the posterior mass for quite large values indicating there is no strong support for thick tailed variation in average summer temperatures.
fit_lin_t
Family: student
Links: mu = identity; sigma = identity; nu = identity
Formula: temp ~ year
Data: data_lin (Number of observations: 71)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -33.84 12.36 -58.32 -9.51 1.00 3748 2631
year 0.02 0.01 0.01 0.03 1.00 3753 2631
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 1.03 0.10 0.86 1.25 1.00 3228 2900
nu 24.40 13.97 6.41 59.17 1.00 3496 2620
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
We can use leave-one-out cross-validation to compare the expected predictive performance.
LOO comparison shows normal and Student’s t model have similar performance. As loo_compare() returns it’s own specific object type, we need to do some manipulation to change it to a data frame suitable for tt().
Heteroscedasticity assumes that the variation around the linear mean can also vary. We can allow sigma to depend on year, too. brms supports multiple models using bf() function, and sigma is a special keyword for the residual standard deviation. Although the additional component is written as sigma ~ year, the log link function is used and the model is for log(sigma).
Check the summary of the posterior and inference diagnostics. The b_year posterior looks similar as before. The posterior for sigma_year looks like having most of the mass for negative values, indicating decrease in temperature variation around the mean.
fit_lin_h
Family: gaussian
Links: mu = identity; sigma = log
Formula: temp ~ year
sigma ~ year
Data: data_lin (Number of observations: 71)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -36.23 12.41 -60.50 -11.96 1.00 3533 2896
sigma_Intercept 18.64 8.82 1.30 35.92 1.00 3566 3006
year 0.02 0.01 0.01 0.04 1.00 3545 2896
sigma_year -0.01 0.00 -0.02 -0.00 1.00 3562 3028
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
As log(x) is almost linear when x is close to zero, we can see that the sigma is decreasing about 1% per year (95% interval from 0% to 2%).
Plot the posterior predictive distribution at each year until 2030 add_predicted_draws() takes the years from the data and uses fit_lin_h to draw from the posterior predictive distribution.
data_lin |>add_row(year=2023:2030) |>add_predicted_draws(fit_lin_h) |># plot dataggplot(aes(x=year, y=temp)) +geom_point(color=2) +# plot lineribbon for the linear modelstat_lineribbon(aes(y = .prediction), .width =c(.95), alpha =1/2, color=brewer.pal(5, "Blues")[[5]]) +# decorationscale_fill_brewer()+labs(x="Year", y ='Summer temp. @Kilpisjärvi') +theme(legend.position="none")+scale_x_continuous(breaks=seq(1950,2030,by=10))
Warning: Removed 32000 rows containing missing values or values outside the scale range
(`geom_point()`).
Make prior sensitivity analysis by power-scaling both prior and likelihood.
We can test the linearity assumption by using non-linear spline functions, by using s(year) terms. Sampling is slower as the posterior gets more complex.
We get warnings about divergences, and try rerunning with higher adapt_delta, which leads to using smaller step sizes. Often adapt_delta=0.999 leads to very slow sampling and is not generally recommended, but with this small data, this is not an issue.
fit_spline_h <-update(fit_spline_h, control =list(adapt_delta=0.999))
Check the summary of the posterior and inference diagnostics. We’re not anymore able to make interpretation of the temperature increase based on this summary. For splines, we see prior scales sds for the spline coefficients.
fit_spline_h
Family: gaussian
Links: mu = identity; sigma = log
Formula: temp ~ s(year)
sigma ~ s(year)
Data: data_lin (Number of observations: 71)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Smoothing Spline Hyperparameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sds(syear_1) 1.02 0.95 0.03 3.61 1.00 1562 1917
sds(sigma_syear_1) 0.90 0.90 0.04 3.31 1.00 1265 1981
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 9.43 0.13 9.16 9.69 1.00 4663 2963
sigma_Intercept 0.04 0.09 -0.12 0.21 1.00 4392 2522
syear_1 2.93 2.59 -2.47 8.41 1.00 1930 1744
sigma_syear_1 -1.05 2.27 -6.27 3.47 1.00 1803 1165
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
We can still plot posterior predictive distribution at each year until 2030 add_predicted_draws() takes the years from the data and uses fit_lin_h to draw from the posterior predictive distribution.
data_lin |>add_row(year=2023:2030) |>add_predicted_draws(fit_spline_h) |># plot dataggplot(aes(x=year, y=temp)) +geom_point(color=2) +# plot lineribbon for the linear modelstat_lineribbon(aes(y = .prediction), .width =c(.95), alpha =1/2, color=brewer.pal(5, "Blues")[[5]]) +# decorationscale_fill_brewer()+labs(x="Year", y ='Summer temp. @Kilpisjärvi') +theme(legend.position="none")+scale_x_continuous(breaks=seq(1950,2030,by=10))
Warning: Removed 32000 rows containing missing values or values outside the scale range
(`geom_point()`).
And we can use LOO-CV to compare the expected predictive performance. LOO comparison shows homoskedastic normal linear and heteroskedastic normal spline models have similar performances. There are not enough observations to make clear difference between the models.
For spline and other non-parametric models, we can use predictive estimates and predictions to get interpretable quantities. Let’s examine the difference of estimated average temperature in years 1952 and 2022.
10 Comparison of k groups with hierarchical normal models
Load factory data, which contain 5 quality measurements for each of 6 machines. We’re interested in analysing are the quality differences between the machines.
fit_pooled <-brm(quality ~1, data = factory, refresh=0)
Check the summary of the posterior and inference diagnostics.
fit_pooled
Family: gaussian
Links: mu = identity; sigma = identity
Formula: quality ~ 1
Data: factory (Number of observations: 30)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 93.09 3.23 86.65 99.54 1.00 3162 2409
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 18.42 2.49 14.19 23.92 1.00 3011 2034
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
10.2 Separate model
As comparison make also separate model. To make it completely separate we need to have different sigma for each machine, too.
Check the summary of the posterior and inference diagnostics.
fit_hier
Warning: There were 1 divergent transitions after warmup. Increasing
adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Family: gaussian
Links: mu = identity; sigma = identity
Formula: quality ~ 1 + (1 | machine)
Data: factory (Number of observations: 30)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Multilevel Hyperparameters:
~machine (Number of levels: 6)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 12.81 6.07 3.20 27.65 1.00 942 1214
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 92.81 5.80 80.78 103.97 1.00 1303 1254
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 15.06 2.32 11.31 20.27 1.00 1520 2044
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
LOO comparison shows the hierarchical model is the best. The differences are small as the number of observations is small and there is a considerable prediction (aleatoric) uncertainty.
Warning: Found 4 observations with a pareto_k > 0.7 in model 'fit_separate'. We
recommend to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.
tinytable_tpwvanpa4nuesooidsfm
model
elpd_diff
se_diff
fit_hier
0
0
fit_separate
-3.8
2.7
fit_pooled
-4
2
Different model posterior distributions for the mean quality. Pooled model ignores the variation between machines. Separate model doesn’t take benefit from the similarity of the machines and has higher uncertainty.
Each study is using 2–6 different dose levels. Three studies that include only two dose levels (200 and 400) are likely to provide weak information on slope.
Pooled model assumes all studies have the same dose effect (reminder: ~ dose is equivalent to ~ 1 + dose). We use similar priors as in earlier binomial models.
Check the summary of the posterior and inference diagnostics.
fit_pooled
Family: binomial
Links: mu = logit
Formula: events | trials(total) ~ dose
Data: dat.ursino2021 (Number of observations: 49)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -3.16 0.37 -3.87 -2.46 1.00 1420 2308
dose 0.00 0.00 0.00 0.01 1.00 2951 2809
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Dose coefficient seems to be very small. Looking at the posterior, we see that it is positive with high probability.
The dose was reported in mg, and most values are in hundreds. It is often sensible to switch to a scale in which the range of values is closer to unit range. In this case it is natural to use g instead of mg.
Check the summary of the posterior and inference diagnostics.
fit_pooled
Family: binomial
Links: mu = logit
Formula: events | trials(total) ~ doseg
Data: dat.ursino2021 (Number of observations: 49)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -2.59 0.29 -3.17 -2.02 1.00 2321 2255
doseg 2.42 0.58 1.28 3.56 1.00 2649 2500
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Now it is easier to interpret the presented values.
Prior and likelihood sensitivity plot shows posterior density estimate depending on amount of power-scaling. Overlapping lines indicate low sensitivity and wider gaps between lines indicate greater sensitivity.
Comparing the posterior of b_doseg (90%-interval [1.3, 3.6]) to the prior normal(0,1), we see that when we scaled the covariate, we forgot to check that the prior still makes sense.
We make prior predictive checking by fixing the intercept=0 (or otherwise prior variation from the intercept would hide the prior variation from the b_doseg and centering the covariate (this is what brms does by default).
Separate model assumes all studies have different dose effect. It would be a bit complicated to set a different prior on study specific intercepts and other coefficients, so we use the same prior for all.
We build two different hierarchical models. The first one has hierarchical model for the intercept, that is, each study has a parameter telling how much that study differs from the common population intercept.
We seem some divergences due to highly varying posterior curvature. We repeat the sampling with higher adapt_delta, which adjust the step size to be smaller. Higher adapt_delta makes the computation slower, but that is not an issue in this case. If you get divergences with adapt_delta=0.99, it is likely that even larger values don’t help, and you need to consider different parameterization, different model, or more informative priors.
Warning: Found 10 observations with a pareto_k > 0.7 in model 'fit_separate'.
We recommend to set 'moment_match = TRUE' in order to perform moment matching
for problematic observations.
Warning: Found 1 observations with a pareto_k > 0.7 in model 'fit_hier1'. We
recommend to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.
Warning: Found 2 observations with a pareto_k > 0.7 in model 'fit_hier2'. We
recommend to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.
tinytable_k5frmvdctvtdv6bk5vb1
model
elpd_diff
se_diff
fit_hier1
0
0
fit_hier2
-0.79
0.45
fit_pooled
-2
2.5
fit_separate
-14
4.1
We get warnings about several Pareto k’s > 0.7 in PSIS-LOO for separate model, but as in that case the LOO-CV estimate is usually overoptimistic and the separate model is the worst, there is no need to use more accurate computation for the separate model.
We get warnings about a few Pareto k’s > 0.7 in PSIS-LOO for both hierarchical models. We can improve the accuracy be running MCMC for these LOO folds. We use add_criterion() function to store the LOO computation results as they take a bit longer now. We get some divergences in case of the second hierarchical model, as leaving out an observation for a study that has only two dose levels is making the posterior having a difficult shape.
The results did not change much. The first hierarchical model is slightly better than other models, but for predictive purposes there is not much difference (there is high aleatoric uncertainty in the predictions). Adding hierarchical model for the slope, decreased the predictive performance and thus it is likely that there is not enough information about the variation in slopes between studies.
Posterior predictive checking showing the observed and predicted number of events. Rootogram uses square root of counts on y-axis for better scaling. Rootogram is useful for count data when the range of counts is small or moderate.
pp_check(fit_pooled, type ="rootogram") +labs(title='Pooled model')
pp_check(fit_hier1, type ="rootogram") +labs(title='Hierarchical model 1')
pp_check(fit_hier2, type ="rootogram") +labs(title='Hierarchical model 2')
We see that the hierarchical models have higher probability for future counts that are bigger than maximum observed count and longer predictive distribution tail. This is natural as uncertainty in the variation between studies increases predictive uncertainty, too, especially as the number of studies is relatively small.
The population level coefficient posterior given pooled model
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_segment()`).
All models agree that the slope is very likely positive. The hierarchical models have more uncertainty, but also higher posterior mean.
When we look at the study specific parameters, we see that the Miller study has slightly higher intercept (leading to higher theta).
(mcmc_areas(as_draws_df(fit_hier1), regex_pars='r_study\\[.*Intercept') +labs(title='Hierarchical model 1')) / (mcmc_areas(as_draws_df(fit_hier2), regex_pars='r_study\\[.*Intercept') +labs(title='Hierarchical model 2'))
There are no clear differences in slopes.
mcmc_areas(as_draws_df(fit_hier2), regex_pars='r_study\\[.*doseg') +labs(title='Hierarchical model 2')
Based on LOO comparison we could continue with any of the models except the separate one, but if we want to take into account the unknown possible study variations, it is best to continue with the hierarchical model 2. We could reduce the uncertainty by spending some effort to elicit a more informative priors for the between study variation, by searching open study databases for similar studies. In this example, we skip that and continue with other parts of the workflow. We check the prior sensitivity in hierarchical model 2
If we plot individual posterior draws, we see that there is a lot of uncertainty about the overall probability (explained by the variation in Intercept in different studies), but less uncertainty about the slope.
data.frame(study='new',doseg=seq(0.1,1,by=0.1),dose=seq(100,1000,by=100),total=1) |>add_linpred_draws(fit_hier2, transform=TRUE, allow_new_levels=TRUE, ndraws=100) |>ggplot(aes(x=dose, y=.linpred)) +geom_line(aes(group=.draw), alpha =1/2, color =brewer.pal(5, "Blues")[[3]])+scale_fill_brewer()+labs(x="Dose (g)", y ='Probability of event') +theme(legend.position="none") +geom_hline(yintercept=0) +scale_x_continuous(breaks=seq(100,1000,by=100))
The dataset comes from a systematic review of randomized controlled trials on pharmacologic treatments for chronic obstructive pulmonary disease (COPD) (Baker et al., 2009.
The primary outcome, occurrence of one or more episodes of COPD exacerbation, is binary (yes / no). For this outcome, five drug treatments (fluticasone, budesonide, salmeterol, formoterol, tiotropium) and two combinations (fluticasone + salmeterol, budesonide + formoterol) were compared to placebo.
Load data
load(url('https://github.com/wviechtb/metadat/raw/master/data/dat.baker2009.rda'))# force character strings to factors for easier plottingdat.baker2009 <- dat.baker2009 |>mutate(study =factor(study),treatment =factor(treatment),id =factor(id))
Total number of patients in each study varies a lot
dat.baker2009 |>group_by(study) |>summarise(N =sum(total)) |>ggplot(aes(x=N, y=study)) +geom_col(fill=4) +labs(x='Number of patients per study', y='Study')
None of the treatments is included in every study, and each study includes 2--4 treatments.
crosstab <-with(dat.baker2009,table(study, treatment))#plot_treatments <-data.frame(number_of_studies=colSums(crosstab), treatment=colnames(crosstab)) |>ggplot(aes(x=number_of_studies,y=treatment)) +geom_col(fill=4) +labs(x='Number of studies with a treatment X', y='Treatment') +geom_vline(xintercept=nrow(crosstab), linetype='dashed') +scale_x_continuous(breaks=c(0,10,20,30,39))#plot_studies <-data.frame(number_of_treatments=rowSums(crosstab), study=rownames(crosstab)) |>ggplot(aes(x=number_of_treatments,y=study)) +geom_col(fill=4) +labs(x='Number of treatments in a study Y', y='Study') +geom_vline(xintercept=ncol(crosstab), linetype='dashed') +scale_x_continuous(breaks=c(0,2,4,6,8))#plot_treatments + plot_studies
The following plot shows which treatments were in which studies.
We see a big variation between treatments and two treatments seem to be harmful, which is suspicious. Looking at the data we see that not all studies included all treatments, and thus if some of the studies had more events, then the above estimates can be wrong.
The target is discrete count, but as the range of counts is big, a rootogram would look messy, and density overlay plot is a better choice. Posterior predictive checking with kernel density estimates for the data and 10 posterior predictive replicates shows clear discrepancy.
pp_check(fit_pooled, type='dens_overlay') +labs(x="Number of individuals with COPD exacerbation(s)")
Posterior predictive checking with PIT values and ECDF difference plot with envelope shows clear discrepancy.
Warning: Found 22 observations with a pareto_k > 0.7 in model '.x1'. We
recommend to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.
Some PIT values larger than 1! Largest: 1
Rounding PIT > 1 to 1.
Warning in .loo_pit(y = y, yrep = object, lw = lw):
The second model uses a hierarchical model both for treatment effects and study effects.
Warning: Found 22 observations with a pareto_k > 0.7 in model 'fit_pooled'. We
recommend to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.
Warning: Found 20 observations with a pareto_k > 0.7 in model 'fit_hier'. We
recommend to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.
tinytable_kgjavoz9kgx12v89ld70
model
elpd_diff
se_diff
fit_hier
0
0
fit_pooled
-1950
301
We get warnings about Pareto k’s > 0.7 in PSIS-LOO, but as the difference between the models is huge, we can be confident that the order would the same if we fixed the computation, and the hierarchical model is much better and there is high variation between studies. Clearly there are many highly influential observations.
Posterior predictive checking with kernel density estimates for the data and 10 posterior predictive replicates looks good (although with this many parameters, this check is likely to be optimistic).
pp_check(fit_hier, type='dens_overlay') +labs(x="Number of individuals with COPD exacerbation(s)")
Posterior predictive checking with PIT values and ECDF difference plot with envelope looks good (although with this many parameters, this check is likely to be optimistic).
pp_check(fit_hier, type='pit_ecdf', ndraws=4000)
Posterior predictive checking with LOO-PIT values look good (although as there are Pareto-khat warnings, it is possible that this diagnostic is optimistic).
Warning: Found 20 observations with a pareto_k > 0.7 in model '.x1'. We
recommend to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.
Treatment effect posteriors have now much less variation.
fit_hier |>spread_rvars(b_Intercept, r_treatment[treatment,]) |>mutate(theta_treatment =rfun(plogis)(b_Intercept + r_treatment)) |>ggplot(aes(xdist=theta_treatment, y=treatment)) +stat_halfeye() +labs(x='theta', y='Treatment', title='Hierarchical over studies, hierarchical over treatments')
Study effect posteriors show the expected high variation.
fit_hier |>spread_rvars(b_Intercept, r_study[study,]) |>mutate(theta_study =rfun(plogis)(b_Intercept + r_study)) |>ggplot(aes(xdist=theta_study, y=study)) +stat_halfeye() +labs(x='theta', y='Study', title='Hierarchical over studies, hierarchical over treatments')
Treatment effect odds-ratios look now more reasonable. As now all treatments were compared to placebo, there is less overlap in the distributions as when looking at the thetas, as all thetas include similar uncertainty about the overall theta due to high variation between studies. We check the prior sensitivity with focus in odds-ratios in hierarchical model
Warning: Found 20 observations with a pareto_k > 0.7 in model 'fit_hier'. We
recommend to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.
Warning: Found 41 observations with a pareto_k > 0.7 in model 'fit_hier2'. We
recommend to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.
tinytable_325d1rd7y9ij0r1d1o6v
model
elpd_diff
se_diff
fit_hier2
0
0
fit_hier
-3.4
3.1
We get warnings about Pareto k’s > 0.7 in PSIS-LOO, but as the models are similar, and the difference is small, we can be relatively confident that the more complex model is not better. In this case, the likely reason is that the data do not have enough information to learn about the interactions and adding them just increases the posterior uncertainty.