if(!require(aaltobda)){
install.packages("aaltobda", repos = c("https://avehtari.github.io/BDA_course_Aalto/", getOption("repos")))
library(aaltobda)
}
Loading required package: aaltobda
This assignment is related to Lecture 6 and Chapter 12.
We recommend using JupyterHub (which has all the needed packages pre-installed).
Reading instructions:
The following installs and loads the aaltobda
package:
if(!require(aaltobda)){
install.packages("aaltobda", repos = c("https://avehtari.github.io/BDA_course_Aalto/", getOption("repos")))
library(aaltobda)
}
Loading required package: aaltobda
The following installs and loads the latex2exp
package, which allows us to use LaTeX in plots:
Loading required package: latex2exp
The following installs and loads the posterior
package which imports the rhat_basic()
function:
Loading required package: posterior
This is posterior version 1.6.0
Attaching package: 'posterior'
The following object is masked from 'package:aaltobda':
mcse_quantile
The following objects are masked from 'package:stats':
mad, sd, var
The following objects are masked from 'package:base':
%in%, match
The following installs and loads the ggplot2
package and the bayesplot
package
Loading required package: ggplot2
Loading required package: bayesplot
This is bayesplot version 1.11.1
- Online documentation and vignettes at mc-stan.org/bayesplot
- bayesplot theme set to bayesplot::theme_default()
* Does _not_ affect other ggplot2 plots
* See ?bayesplot_theme_set for details on theme setting
Attaching package: 'bayesplot'
The following object is masked from 'package:posterior':
rhat
Loading required package: dplyr
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Loading required package: tidyr
The following installs and loads the cmdstanr
package and tries to install cmdstan
.
if(!require(cmdstanr)){
install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
library(cmdstanr)
}
Loading required package: cmdstanr
This is cmdstanr version 0.8.1
- CmdStanR documentation and vignettes: mc-stan.org/cmdstanr
- CmdStan path: /home/runner/.cmdstan/cmdstan-2.35.0
- CmdStan version: 2.35.0
From 2018 to 2022, we have been keeping track of assignment submissions for the BDA course given the number of submissions for the 1st assignment. We will fit a simple linear model to answer two questions of interest:
The author has given you the broken Stan code below, which they intend to encode the following linear model: \[ \begin{aligned} p(\alpha,\beta,\sigma) &= \mathrm{const.} & \text{(improper flat prior)}&\text{ and}\\ p(y|x,\alpha,\beta,\sigma) &= p_\mathrm{normal}(y|\alpha + \beta x, \sigma) & \text{(normal likelihood)} &\text{.} \end{aligned} \] In both the statistical model above and in the Stan model below, \(x \in \mathbb{R}^N\) and \(y \in \mathbb{R}^N\) are vectors of the covariates / predictors (the assignment number) and vectors of the observation (proportions of students who have handed in the respective assignment). \(\alpha \in \mathbb{R}\) is the unknown scalar intercept, \(\beta \in \mathbb{R}\) is the unknown scalar slope and \(\sigma \in \mathbb{R}_{>0}\) is the unknown scalar observation standard deviation. The statistical model further implies \[ p(y_\mathrm{pred.}|x_\mathrm{pred.},\alpha,\beta,\sigma) = p_\mathrm{normal}(y_\mathrm{pred.}|\alpha + \beta x_\mathrm{pred.}, \sigma) \] as the predictive distribution for a new observation \(y_\mathrm{pred.}\) at a given new covariate value \(x_\mathrm{pred.}\).
You can download the broken stan file from github.
data {
// number of data points
int<lower=0> N;
// covariate / predictor
vector[N] x;
// observations
vector[N] y;
// number of covariate values to make predictions at
int<lower=0> no_predictions;
// covariate values to make predictions at
vector[no_predictions] x_predictions;
}
parameters {
// intercept
real alpha;
// slope
real beta;
// the standard deviation should be constrained to be positive
real<upper=0> sigma;
}
transformed parameters {
// deterministic transformation of parameters and data
vector[N] mu = alpha + beta * x // linear model
}
model {
// observation model / likelihood
y ~ normal(mu, sigma);
}
generated quantities {
// compute the means for the covariate values at which to make predictions
vector[no_predictions] mu_pred = alpha + beta * x_predictions;
// sample from the predictive distribution, a normal(mu_pred, sigma).
array[no_predictions] real y_pred = normal_rng(mu, sigma);
}
This is Stan’s data block: “The data
block is for the declaration of variables that are read in as data. […] Each variable’s value is validated against its declaration as it is read. For example, if a variable sigma is declared as real<lower=0>
, then trying to assign it a negative value will raise an error. As a result, data type errors will be caught as early as possible. Similarly, attempts to provide data of the wrong size for a compound data structure will also raise an error.” For more information, follow the link.
This is Stan’s parameters block: “The variables declared in the parameters
program block correspond directly to the variables being sampled by Stan’s samplers (HMC and NUTS). From a user’s perspective, the parameters in the program block are the parameters being sampled by Stan.” For more information, follow the link.
This is Stan’s transformed parameters block: “The transformed parameters
program block consists of optional variable declarations followed by statements. After the statements are executed, the constraints on the transformed parameters are validated. Any variable declared as a transformed parameter is part of the output produced for draws.” For more information, follow the link.
This is Stan’s model block: “The model
program block consists of optional variable declarations followed by statements. The variables in the model
block are local variables and are not written as part of the output. […] The statements in the model
block typically define the model. This is the block in which probability (sampling notation) statements are allowed.” For more information, follow the link.
This is Stan’s generated quantities block: “The generated quantities
program block is rather different than the other blocks. Nothing in the generated quantities
block affects the sampled parameter values. The block is executed only after a sample has been generated.” For more information, follow the link.
Data assembly happens here:
# These are our observations y: the proportion of students handing in each assignment (1-8),
# sorted by year (row-wise) and assignment (column-wise).
# While the code suggest a matrix structure,
# the result will actually be a vector of length N = no_years * no_assignments
propstudents<-c(c(176, 174, 158, 135, 138, 129, 126, 123)/176,
c(242, 212, 184, 177, 174, 172, 163, 156)/242,
c(332, 310, 278, 258, 243, 242, 226, 224)/332,
c(301, 269, 231, 232, 217, 208, 193, 191)/301,
c(245, 240, 228, 217, 206, 199, 191, 182)/245,
c(264, 249, 215, 221, 215, 206, 192, 186)/264)
# These are our predictors x: for each observation, the corresponding assignment number.
assignment <- rep(1:8, 6)
# These are in some sense our test data: the proportion of students handing in the last assignment (9),
# sorted by year.
# Usually, we would not want to split our data like that and instead
# use e.g. Leave-One-Out Cross-Validation (LOO-CV, see e.g. http://mc-stan.org/loo/index.html)
# to evaluate model performance.
propstudents9 = c(121/176, 153/242, 218/332, 190/301, 175/245, 179/264)
# The total number of assignments
no_assignments = 9
# The assignment numbers for which we want to generate predictions
x_predictions = 1:no_assignments
# (Cmd)Stan(R) expects the data to be passed in the below format:
model_data = list(N=length(assignment),
x=assignment,
y=propstudents,
no_predictions=no_assignments,
x_predictions=x_predictions)
Sampling from the posterior distribution happens here:
# This reads the file at the specified path and tries to compile it.
# If it fails, an error is thrown.
retention_model = cmdstan_model("assignment6_linear_model.stan")
# This "out <- capture.output(...)" construction suppresses output from cmdstanr
# See also https://github.com/stan-dev/cmdstanr/issues/646
out <- capture.output(
# Sampling from the posterior distribution happens here:
fit <- retention_model$sample(data=model_data, refresh=0, show_messages=FALSE)
)
Draws postprocessing happens here:
# This extracts the draws from the sampling result as a data.frame.
draws_df = fit$draws(format="draws_df")
# This does some data/draws wrangling to compute the 5, 50 and 95 percentiles of
# the mean at the specified covariate values (x_predictions).
# It can be instructive to play around with each of the data processing steps
# to find out what each step does, e.g. by removing parts from the back like "|> gather(pct,y,-x)"
# and printing the resulting data.frame.
mu_quantiles_df = draws_df |>
subset_draws(variable = c("mu_pred")) |>
summarise_draws(~quantile2(.x, probs = c(0.05, .5, 0.95))) |>
mutate(x = 1:9) |>
pivot_longer(c(q5, q50, q95), names_to = c("pct"))
# Same as above, but for the predictions.
y_quantiles_df = draws_df |>
subset_draws(variable = c("y_pred")) |>
summarise_draws(~quantile2(.x, probs = c(0.05, .5, 0.95))) |>
mutate(x = 1:9) |>
pivot_longer(c(q5, q50, q95), names_to = c("pct"))
Plotting happens here:
ggplot() +
# scatter plot of the training data:
geom_point(
aes(x, y, color=assignment),
data=data.frame(x=assignment, y=propstudents, assignment="1-8")
) +
# scatter plot of the test data:
geom_point(
aes(x, y, color=assignment),
data=data.frame(x=no_assignments, y=propstudents9, assignment="9")
) +
# you have to tell us what this plots:
geom_line(aes(x,y=value,linetype=pct), data=mu_quantiles_df, color='grey', linewidth=1.5) +
# you have to tell us what this plots:
geom_line(aes(x,y=value,linetype=pct), data=y_quantiles_df, color='red') +
# adding xticks for each assignment:
scale_x_continuous(breaks=1:no_assignments) +
# adding labels to the plot:
labs(y="assignment submission %", x="assignment number") +
# specifying that line types repeat:
scale_linetype_manual(values=c(2,1,2)) +
# Specify colours of the observations:
scale_colour_manual(values = c("1-8"="black", "9"="blue")) +
# remove the legend for the linetypes:
guides(linetype="none")
If your model is correctly implemented, sampling from the posterior distribution should have been successful. You can check whether Stan thinks that sampling succeeded by inspecting the output of the below command, which you should be able to interpret with a little help from the CmdStan User’s Guide.
An unbounded likelihood without a proper prior can lead to an improper posterior. We recommend to always use proper priors (integral over a proper distribution is finite) to guarantee proper posteriors.
A commonly used model that can have unbounded likelihood is logistic regression with complete separation in data.
Univariate continous predictor \(x\), binary target \(y\), and the two classes are completely separable, which leads to unbounded likelihood.
# Get the summary diagnostics using [fit object name]]$diagnostic_summary()
# Get summary statistics from your chains using summarize_draws() (hint: look at how we used the function last week)
# Plot histograms and bivariate scatter plots using mcmc_pairs(). See for guidance on how to use the function here: https://mc-stan.org/bayesplot/reference/MCMC-scatterplots.html
# To plot the tracplot of individual chains, use the mcmc_trace() function. See for guidance on how to use the function here: https://mc-stan.org/bayesplot/reference/MCMC-traces.html
Replicate the computations for the bioassay example of section 3.7 (BDA3) using Stan.
Write down the model for the bioassay data in Stan syntax. Use the slightly modified Gaussian prior from Assignment 4 and 5, that is \[ \alpha \sim normal(0,2), \quad \beta \sim normal(10,10) \]
Write your Stan code here
# Load data
data =list(n = bioassay$n, x = bioassay$x, y = bioassay$y, N = nrow(bioassay))
# Compile Model
mod <- cmdstan_model(# your stan model location)
# Estimate model
out <- capture.output(
# Sampling from the posterior distribution happens here:
fit <- mod$sample(data=data, refresh=0, show_messages=FALSE, seed = 4711, chains = 4, iter_warmup = 1000,
iter_sampling = 3000)
)
# This extracts the draws from the sampling result as a data.frame.
draws_df = fit$draws(format="draws_df")
# This is what you'll need for convergence diagnostics
summarise_draws(draws_df, Rhat=rhat_basic, ESS= ess_mean, ~ess_quantile(.x, probs = 0.05))
# Scatter plot of the draws
mcmc_scatter(draws_df, pars=c("alpha", "beta"))
# Plot the autocorrelation function and compare to last week
mcmc_acf(draws_df, pars = c("alpha","beta"))