Notebook for Assignment 6

Author

Aki Vehtari et al.

1 General information

This assignment is related to Lecture 6 and Chapter 12.

We recommend using JupyterHub (which has all the needed packages pre-installed).

General Instructions for Answering the Assignment Questions
  • Questions below are exact copies of the text found in the MyCourses quiz and should serve as a notebook where you can store notes and code.
  • We recommend opening these notebooks in the Aalto JupyterHub, see how to use R and RStudio remotely.
  • For inspiration for code, have a look at the BDA R Demos and the specific Assignment code notebooks
  • Recommended additional self study exercises for each chapter in BDA3 are listed in the course web page. These will help to gain deeper understanding of the topic.
  • Common questions and answers regarding installation and technical problems can be found in Frequently Asked Questions (FAQ).
  • Deadlines for all assignments can be found on the course web page and in MyCourses.
  • You are allowed to discuss assignments with your friends, but it is not allowed to copy solutions directly from other students or from internet.
  • Do not share your answers publicly.
  • Do not copy answers from the internet or from previous years. We compare the answers to the answers from previous years and to the answers from other students this year.
  • Use of AI is allowed on the course, but the most of the work needs to by the student, and you need to report whether you used AI and in which way you used them (See points 5 and 6 in Aalto guidelines for use of AI in teaching).
  • All suspected plagiarism will be reported and investigated. See more about the Aalto University Code of Academic Integrity and Handling Violations Thereof.
  • If you have any suggestions or improvements to the course material, please post in the course chat feedback channel, create an issue, or submit a pull request to the public repository!
Setup

The following installs and loads the aaltobda package:

if(!require(aaltobda)){
    install.packages("aaltobda", repos = c("https://avehtari.github.io/BDA_course_Aalto/", getOption("repos")))
    library(aaltobda)
}
Loading required package: aaltobda

The following installs and loads the latex2exp package, which allows us to use LaTeX in plots:

if(!require(latex2exp)){
    install.packages("latex2exp")
    library(latex2exp)
}
Loading required package: latex2exp

The following installs and loads the posterior package which imports the rhat_basic() function:

if(!require(posterior)){
    install.packages("posterior")
    library(posterior)
}
Loading required package: posterior
This is posterior version 1.6.0

Attaching package: 'posterior'
The following object is masked from 'package:aaltobda':

    mcse_quantile
The following objects are masked from 'package:stats':

    mad, sd, var
The following objects are masked from 'package:base':

    %in%, match

The following installs and loads the ggplot2 package and the bayesplot package

if(!require(ggplot2)){
    install.packages("ggplot2")
    library(ggplot2)
}
Loading required package: ggplot2
if(!require(bayesplot)){
    install.packages("bayesplot")
    library(bayesplot)
}
Loading required package: bayesplot
This is bayesplot version 1.11.1
- Online documentation and vignettes at mc-stan.org/bayesplot
- bayesplot theme set to bayesplot::theme_default()
   * Does _not_ affect other ggplot2 plots
   * See ?bayesplot_theme_set for details on theme setting

Attaching package: 'bayesplot'
The following object is masked from 'package:posterior':

    rhat
if(!require(dplyr)){
    install.packages("dplyr")
    library(dplyr)
}
Loading required package: dplyr

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
if(!require(tidyr)){
    install.packages("tidyr")
    library(tidyr)
}
Loading required package: tidyr
# Some additional set-up to make plots legible
ggplot2::theme_set(theme_minimal(base_size = 14))
bayesplot::bayesplot_theme_set(theme_minimal(base_size = 14))

The following installs and loads the cmdstanr package and tries to install cmdstan.

if(!require(cmdstanr)){
    install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
    library(cmdstanr)
}
Loading required package: cmdstanr
This is cmdstanr version 0.8.1
- CmdStanR documentation and vignettes: mc-stan.org/cmdstanr
- CmdStan path: /home/runner/.cmdstan/cmdstan-2.35.0
- CmdStan version: 2.35.0
cmdstan_installed <- function(){
  res <- try(out <- cmdstanr::cmdstan_path(), silent = TRUE)
  !inherits(res, "try-error")
}
if(!cmdstan_installed()){
    install_cmdstan()
}

2 Stan warm-up: linear model of BDA retention with Stan

From 2018 to 2022, we have been keeping track of assignment submissions for the BDA course given the number of submissions for the 1st assignment. We will fit a simple linear model to answer two questions of interest:

  • What is the trend of student retention as measured by assignment submissions?
  • Given the submission rates for assignments 1–8, how many students will complete the final 9th assignment (and potentially pass the course)?

The author has given you the broken Stan code below, which they intend to encode the following linear model: \[ \begin{aligned} p(\alpha,\beta,\sigma) &= \mathrm{const.} & \text{(improper flat prior)}&\text{ and}\\ p(y|x,\alpha,\beta,\sigma) &= p_\mathrm{normal}(y|\alpha + \beta x, \sigma) & \text{(normal likelihood)} &\text{.} \end{aligned} \] In both the statistical model above and in the Stan model below, \(x \in \mathbb{R}^N\) and \(y \in \mathbb{R}^N\) are vectors of the covariates / predictors (the assignment number) and vectors of the observation (proportions of students who have handed in the respective assignment). \(\alpha \in \mathbb{R}\) is the unknown scalar intercept, \(\beta \in \mathbb{R}\) is the unknown scalar slope and \(\sigma \in \mathbb{R}_{>0}\) is the unknown scalar observation standard deviation. The statistical model further implies \[ p(y_\mathrm{pred.}|x_\mathrm{pred.},\alpha,\beta,\sigma) = p_\mathrm{normal}(y_\mathrm{pred.}|\alpha + \beta x_\mathrm{pred.}, \sigma) \] as the predictive distribution for a new observation \(y_\mathrm{pred.}\) at a given new covariate value \(x_\mathrm{pred.}\).

You can download the broken stan file from github.

data {
    // number of data points
    int<lower=0> N; 
    // covariate / predictor
    vector[N] x; 
    // observations
    vector[N] y; 
    // number of covariate values to make predictions at
    int<lower=0> no_predictions;
    // covariate values to make predictions at
    vector[no_predictions] x_predictions; 
} 
parameters {
    // intercept
    real alpha; 
    // slope
    real beta; 
    // the standard deviation should be constrained to be positive
    real<upper=0> sigma; 
} 
transformed parameters {
    // deterministic transformation of parameters and data
    vector[N] mu = alpha + beta * x // linear model
} 
model {
    // observation model / likelihood
    y ~ normal(mu, sigma); 
} 
generated quantities {
    // compute the means for the covariate values at which to make predictions
    vector[no_predictions] mu_pred = alpha + beta * x_predictions;
    // sample from the predictive distribution, a normal(mu_pred, sigma).
    array[no_predictions] real y_pred = normal_rng(mu, sigma);
} 
  1. This is Stan’s data block: “The data block is for the declaration of variables that are read in as data. […] Each variable’s value is validated against its declaration as it is read. For example, if a variable sigma is declared as real<lower=0>, then trying to assign it a negative value will raise an error. As a result, data type errors will be caught as early as possible. Similarly, attempts to provide data of the wrong size for a compound data structure will also raise an error.” For more information, follow the link.

  2. This is Stan’s parameters block: “The variables declared in the parameters program block correspond directly to the variables being sampled by Stan’s samplers (HMC and NUTS). From a user’s perspective, the parameters in the program block are the parameters being sampled by Stan.” For more information, follow the link.

  3. This is Stan’s transformed parameters block: “The transformed parameters program block consists of optional variable declarations followed by statements. After the statements are executed, the constraints on the transformed parameters are validated. Any variable declared as a transformed parameter is part of the output produced for draws.” For more information, follow the link.

  4. This is Stan’s model block: “The model program block consists of optional variable declarations followed by statements. The variables in the model block are local variables and are not written as part of the output. […] The statements in the model block typically define the model. This is the block in which probability (sampling notation) statements are allowed.” For more information, follow the link.

  5. This is Stan’s generated quantities block: “The generated quantities program block is rather different than the other blocks. Nothing in the generated quantities block affects the sampled parameter values. The block is executed only after a sample has been generated.” For more information, follow the link.

2.1 Data preparation and sampling from the posterior

Data assembly happens here:

# These are our observations y: the proportion of students handing in each assignment (1-8),
# sorted by year (row-wise) and assignment (column-wise).
# While the code suggest a matrix structure, 
# the result will actually be a vector of length N = no_years * no_assignments
propstudents<-c(c(176, 174, 158, 135, 138, 129, 126, 123)/176,
                c(242, 212, 184, 177, 174, 172, 163, 156)/242,
                c(332, 310, 278, 258, 243, 242, 226, 224)/332,
                c(301, 269, 231, 232, 217, 208, 193, 191)/301,
                c(245, 240, 228, 217, 206, 199, 191, 182)/245,
                c(264, 249, 215, 221, 215, 206, 192, 186)/264)
# These are our predictors x: for each observation, the corresponding assignment number.
assignment <- rep(1:8, 6)
# These are in some sense our test data: the proportion of students handing in the last assignment (9),
# sorted by year. 
# Usually, we would not want to split our data like that and instead
# use e.g. Leave-One-Out Cross-Validation (LOO-CV, see e.g. http://mc-stan.org/loo/index.html)
# to evaluate model performance.
propstudents9 = c(121/176, 153/242, 218/332, 190/301, 175/245, 179/264)
# The total number of assignments
no_assignments = 9
# The assignment numbers for which we want to generate predictions
x_predictions = 1:no_assignments
# (Cmd)Stan(R) expects the data to be passed in the below format:
model_data = list(N=length(assignment),
                 x=assignment,
                 y=propstudents,
                 no_predictions=no_assignments,
                 x_predictions=x_predictions)

Sampling from the posterior distribution happens here:

# This reads the file at the specified path and tries to compile it. 
# If it fails, an error is thrown.
retention_model = cmdstan_model("assignment6_linear_model.stan")
# This "out <- capture.output(...)" construction suppresses output from cmdstanr
# See also https://github.com/stan-dev/cmdstanr/issues/646
out <- capture.output(
    # Sampling from the posterior distribution happens here:
    fit <- retention_model$sample(data=model_data, refresh=0, show_messages=FALSE)
)

Draws postprocessing happens here:

# This extracts the draws from the sampling result as a data.frame.
draws_df = fit$draws(format="draws_df")

# This does some data/draws wrangling to compute the 5, 50 and 95 percentiles of 
# the mean at the specified covariate values (x_predictions). 
# It can be instructive to play around with each of the data processing steps
# to find out what each step does, e.g. by removing parts from the back like "|>  gather(pct,y,-x)"
# and printing the resulting data.frame.
mu_quantiles_df = draws_df |> 
      subset_draws(variable = c("mu_pred")) |> 
      summarise_draws(~quantile2(.x, probs = c(0.05, .5, 0.95))) |> 
      mutate(x = 1:9) |> 
      pivot_longer(c(q5, q50, q95), names_to = c("pct"))
# Same as above, but for the predictions.
y_quantiles_df = draws_df |> 
      subset_draws(variable = c("y_pred")) |> 
      summarise_draws(~quantile2(.x, probs = c(0.05, .5, 0.95))) |> 
      mutate(x = 1:9) |> 
      pivot_longer(c(q5, q50, q95), names_to = c("pct"))

Plotting happens here:

ggplot() +
  # scatter plot of the training data:  
  geom_point(
    aes(x, y, color=assignment), 
    data=data.frame(x=assignment, y=propstudents, assignment="1-8")
) +
  # scatter plot of the test data:
  geom_point(
    aes(x, y, color=assignment), 
    data=data.frame(x=no_assignments, y=propstudents9, assignment="9")
) +
  # you have to tell us what this plots:
  geom_line(aes(x,y=value,linetype=pct), data=mu_quantiles_df, color='grey', linewidth=1.5) +
  # you have to tell us what this plots:
  geom_line(aes(x,y=value,linetype=pct), data=y_quantiles_df, color='red') +
  # adding xticks for each assignment:
  scale_x_continuous(breaks=1:no_assignments) +
  # adding labels to the plot:
  labs(y="assignment submission %", x="assignment number") +
  # specifying that line types repeat:
  scale_linetype_manual(values=c(2,1,2)) +
  # Specify colours of the observations:
  scale_colour_manual(values = c("1-8"="black", "9"="blue")) +
  # remove the legend for the linetypes:
  guides(linetype="none")

2.2 Quick check for sampling convergence

If your model is correctly implemented, sampling from the posterior distribution should have been successful. You can check whether Stan thinks that sampling succeeded by inspecting the output of the below command, which you should be able to interpret with a little help from the CmdStan User’s Guide.

fit$cmdstan_diagnose()

3 Diagnosing Problems in HMC-NUTS Sampling

3.1 Improper Posterior

An unbounded likelihood without a proper prior can lead to an improper posterior. We recommend to always use proper priors (integral over a proper distribution is finite) to guarantee proper posteriors.

A commonly used model that can have unbounded likelihood is logistic regression with complete separation in data.

3.1.1 Data

Univariate continous predictor \(x\), binary target \(y\), and the two classes are completely separable, which leads to unbounded likelihood.

set.seed(48927+4)
M=1;
N=10;
x=matrix(sort(rnorm(N)),ncol=M)
y=rep(c(0,1), each=N/2)
data_logit <-list(M = M, N = N, x = x, y = y)
data.frame(data_logit) |>
  ggplot(aes(x, y)) +
  geom_point(size = 3, shape=1, alpha=0.6) +
  scale_y_continuous(breaks=c(0,1))

3.1.2 Compile and Sample from the following model

// logistic regression
data {
  int<lower=0> N;
  int<lower=0> M;
  array[N] int<lower=0,upper=1> y;
  matrix[N,M] x;
}
parameters {
  real alpha;
  vector[M] beta;
}
model {
  y ~ bernoulli_logit_glm(x, alpha, beta);
}

3.1.3 Convergence diagnostics

# Get the summary diagnostics using [fit object name]]$diagnostic_summary()

# Get summary statistics from your chains using summarize_draws() (hint: look at how we used the function last week)

# Plot histograms and bivariate scatter plots using mcmc_pairs(). See for guidance on how to use the function here: https://mc-stan.org/bayesplot/reference/MCMC-scatterplots.html

# To plot the tracplot of individual chains, use the mcmc_trace() function. See for guidance on how to use the function here: https://mc-stan.org/bayesplot/reference/MCMC-traces.html 

4 Generalized linear model: Bioassay model with Stan

Replicate the computations for the bioassay example of section 3.7 (BDA3) using Stan.

Write down the model for the bioassay data in Stan syntax. Use the slightly modified Gaussian prior from Assignment 4 and 5, that is \[ \alpha \sim normal(0,2), \quad \beta \sim normal(10,10) \]

Write your Stan code here

4.1 Run model and analyse output

# Load data
data =list(n = bioassay$n, x = bioassay$x, y = bioassay$y, N = nrow(bioassay))

# Compile Model
mod <- cmdstan_model(# your stan model location)

# Estimate model
out <- capture.output(
  # Sampling from the posterior distribution happens here:
  fit <- mod$sample(data=data, refresh=0, show_messages=FALSE, seed = 4711, chains = 4, iter_warmup = 1000, 
                    iter_sampling = 3000)
)

# This extracts the draws from the sampling result as a data.frame.
draws_df = fit$draws(format="draws_df")

# This is what you'll need for convergence diagnostics
summarise_draws(draws_df, Rhat=rhat_basic, ESS= ess_mean, ~ess_quantile(.x, probs = 0.05))

# Scatter plot of the draws
mcmc_scatter(draws_df, pars=c("alpha", "beta"))

# Plot the autocorrelation function and compare to last week
mcmc_acf(draws_df, pars = c("alpha","beta"))