Assignment 6

Author

Aki Vehtari et al.

1 General information

The exercises here refer to the lecture 6/BDA chapter 12 content. The main topics for this assignment are the MCSE and importance sampling.

The exercises constitute 96% of the Quiz 6 grade.

We prepared a quarto notebook specific to this assignment to help you get started. You still need to fill in your answers on Mycourses! You can inspect this and future templates

  • as a rendered html file (to access the qmd file click the “</> Code” button at the top right hand corner of the template)
General Instructions for Answering the Assignment Questions
  • Questions below are exact copies of the text found in the MyCourses quiz and should serve as a notebook where you can store notes and code.
  • We recommend opening these notebooks in the Aalto JupyterHub, see how to use R and RStudio remotely.
  • For inspiration for code, have a look at the BDA R Demos and the specific Assignment code notebooks
  • Recommended additional self study exercises for each chapter in BDA3 are listed in the course web page. These will help to gain deeper understanding of the topic.
  • Common questions and answers regarding installation and technical problems can be found in Frequently Asked Questions (FAQ).
  • Deadlines for all assignments can be found on the course web page and in MyCourses.
  • You are allowed to discuss assignments with your friends, but it is not allowed to copy solutions directly from other students or from internet.
  • Do not share your answers publicly.
  • Do not copy answers from the internet or from previous years. We compare the answers to the answers from previous years and to the answers from other students this year.
  • Use of AI is allowed on the course, but the most of the work needs to by the student, and you need to report whether you used AI and in which way you used them (See points 5 and 6 in Aalto guidelines for use of AI in teaching).
  • All suspected plagiarism will be reported and investigated. See more about the Aalto University Code of Academic Integrity and Handling Violations Thereof.
  • If you have any suggestions or improvements to the course material, please post in the course chat feedback channel, create an issue, or submit a pull request to the public repository!

1.1 Assignment questions

For convenience the assignment questions are copied below. Answer the questions in MyCourses.

Lecture 6/Chapter 12 of BDA Quiz (96% of grade)

This week's assignment focuses on building the foundations for using frontier MCMC methods and the probabilistic programming language Stan. We will directly code in Stan, but later assignments will use the brms package for estimating Stan models. 


1. A brief overview of Hamiltonian Monte Carlo (HMC)

1.1: Based on the lecture slides for this week, what is the intuition behind HMC?

In order to generate good proposals, HMC uses the Hamiltonian function and it's partial derivatives to generate a path along the surface of the log posterior. As in the lecture, define \( \phi \) as the momentum variable which is of the same dimension as the parameter vector \( \theta \). The Hamiltonian function has two terms \( U(\theta) \) and \( K(\phi) \).  

1.2: What is the intuition behind the term \( U( \theta ) \)?
  

1.3: What is the intuition behind the term  \( K( \phi ) \)?

1.4: The partial derivatives of the Hamiltonian, also known as Hamilton's equations, determine how \( \theta \) and \( \phi \) change during MCMC. What problem occurs when implementing Hamilton's equations computationally?

1.5: Therefore, all HMC implementations on the computer need to discretise the simulated trajectory dictated by Hamilton's equations. Which computational method does Stan (an most other HMC based algorithms) use for discretisation?

1.6: It is not necessary in this course to know the computational details behind the leapfrog integrator, only that it is applied for L steps along the Hamiltonian trajectory with step size, \(\epsilon\). The figure below by Neal (2012) shows dynamic simulation in the joint position-momentum space using leapfrog methd. Based on these figures, which of the following statements is false=

Figure 1


HMC has two general steps at each iteration of the MCMC chain. Denote the current state of the MCMC chain as t, then
1) we draw a new momentum variable \( \phi \) (often assumed to distributed marginally Gaussian) and
2) perform a Metropolis update with \( r=\exp\left(-H(\theta^*,\phi^*)+H(\theta^{(t-1)},\phi^{(t-1)})\right) \), where Hamiltonian dynamics are used to produce the proposal.
The two parameters of the algorithm, number of steps L and step size \(\epsilon\), need to be tuned.

1.7: What can happen eventually, when allowing the trajectory length, defined as step size times number of steps, \(\epsilon L\), to be long enough? Check out this interactive demo (algorithm=HamiltonianMC, target=standard) and set Leapfrog steps equal to 75 for visual intuition (if the demo freezes, close the demo, restart and instead of sliders, make the control changes by editing the values in the numeric fields). Do not adjust anything else in the demo. 


To avoid this behaviour that static HMC with fixed integration time (number of steps times the step size) may have, the No-U-Turn (NUTS) algorithm by Hoffman and Gelman (2014) performs automatic tuning: neither the step size nor number of steps need be specified by the user. NUTS uses a tree-building algorithm (see the slides) to adaptively determine the number of steps, L,  while the step size is adapted during the warm-up phase according to a target average acceptance ratio. To gain some more intuition on the behaviour of the adaptivity of NUTS, open the interactive demo again with his link (algorithm=EfficientNUTS, target=banana). Set Autoplay delay to around 500, and set Leapfrog \(\delta\) t to 0.03. This algorithm corresponds to Algorithm 3 in Hoffman and Gelman (2014), where for a fixed \( \epsilon \), the algorithm adaptively determines the number of steps, L.

1.8: What do you observe?


1.9: On the other hand, set Leapfrog \(\delta\) t to 0.6, all else equal. What do you observe?


The user does not need to select the stepsize directly. Stan includes also adaptation of the step size in the warmup phase using stochastic optimization called dual averaging. The user specifies a target acceptance ratio (in Stan actually target for expected discretization error), with argument adapt_delta, and a number of iterations during which adaptation of \( \epsilon L \) occurs (warm-up draws). Open, the interactive demo again with this link (algorithm=DualAveragingNUTS, target=banana), and adjust the Autoplay delay to around 70, but otherwise keep the default options, particularly, keep the target acceptance ratio (here \(\delta\)) at 0.65. On the top left-hand corner m/M_adapt tells you how many draws have been taken compared to number of warm-up draws. Wait until m/M_adapt is at least 50/200.

1.10: What do you observe with the first couple of iterations?
 

1.11: What do you observe with sufficient warm-up?
 

1.12: Now set the target acceptance ratio to 0.95. What do you observe?


1.13: Now set the target acceptance ratio to 0.05. What do you observe?


The current Stan HMC-NUTS implementation has some further enchantments, but we will not go to details of those. The main algorithm paramaters are adapt_delta and max_treedepth options.

adapt_delta 
specifies the target expected discretization error (in the same scale as the average proposal acceptance ratio), during the warm-up phase and max_treedepth determines the maximum number of tree doublings in dynamic simulation and thus determine also the maximum number of steps taken. The default in most packages using Stan is setting adapt_delta=0.8.

1.14: What should happen when you increase adapt_delta, all else equal?


max_treedepth controls the maximum number of doublings in the tree-building algorithm and thus controls the maximum number of steps in Hamiltonian simulation. This allows the sampler to explore further away in the distribution, which can be beneficial when dealing with complex posteriors. The default in Stan is max_treedepth = 10.

1.15: What is the main cost to increasing the max_treedepth?


1.16: Despite the adaptated step-szie, you may encounter challenging posteriors, e.g. with highly varying curvature in log-density. What can happen if the step-size is too big compared to the curvature of the log-density?



2. Stan warmup

From 2018 to 2023, we have been keeping track of assignment submissions for the BDA course given the number of submissions for the 1st assignment. We will fit a simple linear model to answer two questions of interest:

  • What is the trend of student retention as measured by assignment submissions?
  • Given the submission rates for assignments 1–8, how many students will complete the final 9th assignment (and potentially pass the course)?

Below is broken Stan code for a linear model. In the following, we write the equations following the Stan distributional definitions. See Stan documentation for the definitions related to the normal distribution.

\(p(y | x, \alpha, \beta, \sigma) = \text{normal}(y | \alpha + \beta x, \sigma) \) (normal model)
\(p(\alpha, \beta, \sigma) \propto \text{const.} \) (improper flat prior)

In both the statistical model above and in the Stan model below, \(x \in \mathbb{R}^N\) and \(y \in \mathbb{R}^N \) are vectors of the covariates / predictors (the assignment number) and vectors of the observation (proportions of students who have handed in the respective assignment). \(\alpha \in \mathbb{R}\) is the unknown scalar intercept, \(\beta \in \mathbb{R}\) is the unknown scalar slope and \(\sigma \in \mathbb{R}_{>0}\) is the unknown scalar observation standard deviation. The statistical model further implies

\( p(y_{pred.} | x_{pred.}, \alpha, \beta, \sigma) = \text{normal}(y_{pred.} | \alpha + \beta x, \sigma) \)

as the predictive distribution for a new observation \(y_{pred.}\) at a given new covariate value \( x_{pred.} \). The broken Stan model code:
 
data {
    // number of data points
    int<lower=0> N; 
    // covariate / predictor
    vector[N] x; 
    // observations
    vector[N] y; 
    // number of covariate values to make predictions at
    int<lower=0> N_predictions;
    // covariate values to make predictions at
    vector[N_predictions] x_predictions; 
}
parameters {
    // intercept
    real alpha; 
    // slope
    real beta; 
    // the standard deviation should be constrained to be positive
    real<upper=0> sigma; 
}
transformed parameters {
    // deterministic transformation of parameters and data
    vector[N] mu = alpha + beta * x // linear model
}
model {
    // observation model
    y ~ normal(mu, sigma); 
}
generated quantities {
    // compute the means for the covariate values at which to make predictions
    vector[N_predictions] mu_pred = alpha + beta * x_predictions;
    // sample from the predictive distribution normal(mu_pred, sigma).
    array[N_predictions] real y_pred = normal_rng(to_array_1d(mu), sigma);
}

Firstly, let's review some Stan syntax.

2.1: What is the function of the data block?

2.2: Why should you be careful about declarations of constraints in the data block?

2.3: What is the function of the parameters block?

2.4: Can you use statements (e.g. real<lower=0> theta = lambda^2) in the parameters code block, where lambda is some other variable?

2.5: What is the function of the transformed parameters block?

2.6: What is the function of the model block?

2.7: As you've learned in the lecture or from the Stan documentation, log densities can be added to the target density either via the distribution statement using ~ or via the log probability increment statement using target +=. Furthermore, we don't need to include constant terms. Assume the model block has a line   theta ~ normal(0,1);. How could you equivalently increment the log density to get the same increment (ignoring the possible difference in the constants)?

2.8: Why can the normalising constant(s) be dropped when using MCMC (this holds for e.g. variational inference and optimisation too)?


Stan language allows writing the models as usually written in books and articles. As you've learned during the course, a collection of distribution statements define a joint distribution as a product of component distributions. 

Suppose your model is the following:


\( p(y|\mu,\sigma) = \text{normal}(y|\mu,\sigma) \)
\( p(\mu) = \text{normal}(\mu|0,1) \)
\( p(\sigma) = \text{normal}^+(\sigma|0,10) \)


2.9: What are correct ways to write your model in the model block? In the below, assume that there is a line-break after each semi-colon, and that the variables have been appropriately declared in the parameter blocks.

When writing complicated models, it may happen that some distribution statement is accidentally repeated twice. Check out this page in the Stan manual to see what happens in this case. 

2.10: What is the function of the generated quantities block?


Now, let's return to the student retention model code, shown above.

2.11: What are the three mistakes in the Stan model code above?
 

You may find some of the mistakes in the code using Stan syntax checker. If you copy the Stan code to a file ending .stan and open it in RStudio (you can also choose from RStudio menu File \(\rightarrow\)New File\(\rightarrow\)Stan file to create a new Stan file), the editor will show you some syntax errors. More syntax errors might be detected by clicking `Check’ in the bar just above the Stan file in the RStudio editor. Note that some of the errors in the presented Stan code may not be syntax errors.

Use the code in the template to first compile your model using CmdStanR (which is a handy interface for CmdStan which transplies Stan model code to C++ which is compiled to executable program which can do the actual sampling), run the inference with the supplied data, and finally create the plot to answer the following questions. 


Define the term of the model \( \mu = \alpha + \beta x \) as the linear predictor.

2.12: What is the solid red line plotting?

2.13: What are the dashed red lines referring to?

2.14: How and why are these different from the corresponding grey lines?

2.15: What is the general trend of student retention as measured by the assignment submissions?

2.16: Does it do a good job predicting the proportion of students who submit the final 9th assignment?

2.17: What modeling choice could you make to improve the prediction for the given data set?


3. Diagnosing Problems in HMC-NUTS sampling 

Another benefit of the HMC-NUTS algorithm over simpler non-gradient based MCMC algorithms, is that we have access to many diagnostic tools related to the posterior geometry that we would otherwise not have. This feedback is helpful for modeling and also for debugging code, and therefore an essential part of the Bayesian workflow. A model type which is likely to cause problems is logistic regression with complete separation in the data. This creates an unbounded likelihood.

3.1: Why is it problematic to use flat, improper priors for this likelihood?

Using the data generated from the template, compile and run the following Stan model:

// logistic regression
data {
  int<lower=0> N;
  int<lower=0> M;
  array[N] int<lower=0,upper=1> y;
  matrix[N,M] x;
}
parameters {
  real alpha;
  vector[M] beta;
}
model {
  y ~ bernoulli_logit_glm(x, alpha, beta);
}
3.2: You can use the function [fit object name]$diagnostic summary() to check for the number of divergent transitions and max_treedepth exceedences. What do you see?


3.3: You can examine the problematic behaviour of MCMC  by looking at parameter specific sampling diagnostics using the summarize_draws function discussed last week. To gain visual intuition use bayesplot's mcmc_pairs function to plot histograms and bivariate scatter plots of the posteriors. What do you observe?


Before thinking about changing the model, a first step should be to check for any easy mistakes made in the Stan code. The Stan compiler has a pedantic mode to help spot such issues. By default the pedantic mode is not enabled, but we can use option pedantic = TRUE at compilation time, or after compilation with the check_syntax method.

3.4: Use the function [Stan model]$check_syntax(pedantic = TRUE). What warning message(s) do you get?


3.5: Set normal(0,10) priors for both parameters, what do you find from the MCMC output?


As you've learned during the course, using proper priors, even when they are wide, are always preferable to stabilise inference and make the model generative. You'll learn next week more about priors. 

3.6: Sometimes, when adjusting Stan model code you may have removed a variable from the model block, but left the declaration in the parameters block. Add such a variable to your model. What do you see from convergence diagnostics (it may also help you visualise the MCMC chains using mcmc_pairs and mcmc_trace)?
 

3.7: Use check_syntax(pedantic = TRUE), would you have been able to detect this problem from the output?


A further important problem is that of high correlation between parameters in the posterior, which may lead to slow exploration and divergent transitions. Sometimes, we can address this by slightly re-writing the model. Check out the lecture's discussion the Kilpisjärvi case study. More on that in the weeks to follow.  

4. Generalised Linear Model: Bioassay with Stan

Now, we re-investigate the Bioassay model from last week with Stan. If you need a reminder about the model and likelihood definition, check out section 3.7 in BDA3. To make the implementation simpler, we consider the following priors:

alpha ~ normal(0,2)

beta ~ normal(10,10)

4.1: What is the difference in the prior to last week?

4.2: Which pre-built function can you use in Stan to compute the likelihood? 

Use the code hints in the template to complete your Stan model. For questions about declarations and functionality of Stan, your first point of reference should always be the Stan Manual (it's been recently revised, be sure to open at least version 2.35). Use 4 chains, with 1000 warmup and 2000 total draws each, use default options for adapt_delta and max_treedepth, and use seed = 4911.  Compile your model, define the data inputs needed, sample from the posterior, and answer the following. To make sure that your model is correct, compare a scatter plot of alpha and beta to last week (the priors are not exactly the same, but the posterior should be quite similar to last week as the priors are relatively weak):

4.3: Did the Stan model produce any warnings about the sampling efficiency? 

4.4: Use the Rhat function from the posterior package, rhat(), (used in last week's assignment). The Rhat for for alpha and beta  . 

4.5: The ESS mean for alpha= and for beta=.

4.6: ESS q0.05 for alpha=, ESS q0.05 for beta=.

Plot the autocorrelation function for the draws for the chains of alpha and beta using the mcmc_acf function from the bayesplot package and compare that to the autocorrelation function from the Metropolis-Hastings (MH) algorithm you developed last week. You don't need to re-do the model for the MH algorithm with the new priors, but for better comparison this is what you should do outside of this exercise. 

4.7: What do you see?