if(!require(aaltobda)){
install.packages("aaltobda", repos = c("https://avehtari.github.io/BDA_course_Aalto/", getOption("repos")))
library(aaltobda)
}
Loading required package: aaltobda
The maximum amount of points from this assignment is 6.
We have prepared a quarto template specific to this assignment (html, qmd, pdf) to help you get started.
We recommend you use jupyter.cs.aalto.fi or the docker container.
Reading instructions:
Grading instructions:
The grading will be done in peergrade. All grading questions and evaluations for this assignment are contained within this document in the collapsible Rubric blocks.
Installing and using CmdStanR
:
See the Stan demos on how to use Stan in R (or Python). Aalto JupyterHub has working R and CmdStanR/RStan environment and is probably the easiest way to use Stan. * To use CmdStanR in Aalto JupyterHub:
library(cmdstanr)
set_cmdstan_path('/coursedata/cmdstan')
The Aalto Ubuntu desktops also have the necessary libraries installed.]{.aalto}
To install Stan on your laptop, run ‘install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))’ in R. If you encounter problems, see additional answers in FAQ. If you don’t succeed in short amount of time, it is probably easier to use Aalto JupyterHub.
If you use Aalto JupyterHub
, all necessary packages have been pre-installed. In your laptop, install package cmdstanr
. Installation instructions on Linux, Mac and Windows can be found at https://mc-stan.org/cmdstanr/. Additional useful packages are loo
, bayesplot
and posterior
(but you don’t need these in this assignment). For Python users, PyStan
, CmdStanPy
, and ArviZ
packages are useful.
Stan manual can be found at https://mc-stan.org/users/documentation/. From this website, you can also find a lot of other useful material about Stan.
If you edit files ending .stan
in RStudio, you can click “Check” in the editor toolbar to make syntax check. This can significantly speed-up writing a working Stan model.
For posterior statistics of interest, only report digits that are not completely random based on the Monte Carlo standard error (MCSE).
Example: If you estimate \(E(\mu) \approx 1.234\) with MCSE(\(E(\mu)\)) = 0.01, then the true expectation is likely to be between \(1.204\) and \(1.264\), it makes sense to report \(E(\mu) \approx 1.2\).
See Lecture video 4.1, the chapter notes, and a case study for more information.
quarto
and the provided template. The template includes the formatting instructions and how to include code and figures.quarto
, you can use other software to make the PDF report, but the the same instructions for formatting should be used.aaltobda
with data and functionality to simplify coding. The package is pre-installed in JupyterHub. To install the package on your own system, run the following code (upgrade="never" skips question about updating other packages):install.packages("aaltobda", repos = c("https://avehtari.github.io/BDA_course_Aalto/", getOption("repos")))
markmyassignment
(pre-installed in JupyterHub). Information on how to install and use the package can be found in the markmyassignment
documentation. There is no need to include markmyassignment
results in the report.JupyterHub has all the needed packages pre-installed.
The following installs and loads the aaltobda
package:
if(!require(aaltobda)){
install.packages("aaltobda", repos = c("https://avehtari.github.io/BDA_course_Aalto/", getOption("repos")))
library(aaltobda)
}
Loading required package: aaltobda
The following installs and loads the latex2exp
package, which allows us to use LaTeX in plots:
Loading required package: latex2exp
The following installs and loads the posterior
package which imports the rhat_basic()
function:
Loading required package: posterior
This is posterior version 1.4.0
Attaching package: 'posterior'
The following object is masked from 'package:aaltobda':
mcse_quantile
The following objects are masked from 'package:stats':
mad, sd, var
The following objects are masked from 'package:base':
%in%, match
The following installs and loads the ggplot2
package, the bayesplot
package and the dplyr
package
Loading required package: ggplot2
Loading required package: bayesplot
This is bayesplot version 1.10.0
- Online documentation and vignettes at mc-stan.org/bayesplot
- bayesplot theme set to bayesplot::theme_default()
* Does _not_ affect other ggplot2 plots
* See ?bayesplot_theme_set for details on theme setting
Attaching package: 'bayesplot'
The following object is masked from 'package:posterior':
rhat
Loading required package: dplyr
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Loading required package: tidyr
The following installs and loads the cmdstanr
package and tries to install cmdstan
.
if(!require(cmdstanr)){
install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
library(cmdstanr)
}
Loading required package: cmdstanr
This is cmdstanr version 0.5.3
- CmdStanR documentation and vignettes: mc-stan.org/cmdstanr
- CmdStan path: /root/.cmdstan/cmdstan-2.31.0
- CmdStan version: 2.31.0
A newer version of CmdStan is available. See ?install_cmdstan() to install it.
To disable this check set option or environment variable CMDSTANR_NO_VER_CHECK=TRUE.
From 2018 to 2022, we have been keeping track of assignment submissions for the BDA course given the number of submissions for the 1st assignment. We will fit a simple linear model to answer two questions of interest:
The author has given you the broken Stan code below, which they intend to encode the following linear model: \[ \begin{aligned} p(\alpha,\beta,\sigma) &= \mathrm{const.} & \text{(improper flat prior)}&\text{ and}\\ p(y|x,\alpha,\beta,\sigma) &= p_\mathrm{normal}(y|\alpha + \beta x, \sigma) & \text{(normal likelihood)} &\text{.} \end{aligned} \] In both the statistical model above and in the Stan model below, \(x \in \mathbb{R}^N\) and \(y \in \mathbb{R}^N\) are vectors of the covariates / predictors (the assignment number) and vectors of the observation (proportions of students who have handed in the respective assignment). \(\alpha \in \mathbb{R}\) is the unknown scalar intercept, \(\beta \in \mathbb{R}\) is the unknown scalar slope and \(\sigma \in \mathbb{R}_{>0}\) is the unknown scalar observation standard deviation. The statistical model further implies \[ p(y_\mathrm{pred.}|x_\mathrm{pred.},\alpha,\beta,\sigma) = p_\mathrm{normal}(y_\mathrm{pred.}|\alpha + \beta x_\mathrm{pred.}, \sigma) \] as the predictive distribution for a new observation \(y_\mathrm{pred.}\) at a given new covariate value \(x_\mathrm{pred.}\).
You can download the broken stan file from github.
1data {
// number of data points
int<lower=0> N;
// covariate / predictor
vector[N] x;
// observations
vector[N] y;
// number of covariate values to make predictions at
int<lower=0> no_predictions;
// covariate values to make predictions at
vector[no_predictions] x_predictions;
}
2parameters {
// intercept
real alpha;
// slope
real beta;
// the standard deviation should be constrained to be positive
real<upper=0> sigma;
}
3transformed parameters {
// deterministic transformation of parameters and data
vector[N] mu = alpha + beta * x // linear model
}
4model {
// observation model / likelihood
y ~ normal(mu, sigma);
}
5generated quantities {
// compute the means for the covariate values at which to make predictions
vector[no_predictions] mu_pred = alpha + beta * x_predictions;
// sample from the predictive distribution, a normal(mu_pred, sigma).
array[no_predictions] real y_pred = normal_rng(mu, sigma);
}
data
block is for the declaration of variables that are read in as data. […] Each variable’s value is validated against its declaration as it is read. For example, if a variable sigma is declared as real<lower=0>
, then trying to assign it a negative value will raise an error. As a result, data type errors will be caught as early as possible. Similarly, attempts to provide data of the wrong size for a compound data structure will also raise an error.” For more information, follow the link.
parameters
program block correspond directly to the variables being sampled by Stan’s samplers (HMC and NUTS). From a user’s perspective, the parameters in the program block are the parameters being sampled by Stan.” For more information, follow the link.
transformed parameters
program block consists of optional variable declarations followed by statements. After the statements are executed, the constraints on the transformed parameters are validated. Any variable declared as a transformed parameter is part of the output produced for draws.” For more information, follow the link.
model
program block consists of optional variable declarations followed by statements. The variables in the model
block are local variables and are not written as part of the output. […] The statements in the model
block typically define the model. This is the block in which probability (sampling notation) statements are allowed.” For more information, follow the link.
generated quantities
program block is rather different than the other blocks. Nothing in the generated quantities
block affects the sampled parameter values. The block is executed only after a sample has been generated.” For more information, follow the link.
A normal linear model is actually not the best model to use for this type of data, but we will use it here to illustrate the first step in building up to more appropriate, complicated models.
You may find some of the mistakes in the code using Stan syntax checker. If you copy the Stan code to a file ending .stan
and open it in RStudio (you can also choose from RStudio menu File\(\rightarrow\)New File\(\rightarrow\)Stan file to create a new Stan file), the editor will show you some syntax errors. More syntax errors might be detected by clicking `Check’ in the bar just above the Stan file in the RStudio editor. Note that some of the errors in the presented Stan code may not be syntax errors.
The author runs the corrected Stan file using the following R code and plots the returned MCMC sample. Read through the code below to understand what is being plotted.
Data assembly happens here:
# These are our observations y: the proportion of students handing in each assignment (1-8),
# sorted by year (row-wise) and assignment (column-wise).
# While the code suggest a matrix structure,
# the result will actually be a vector of length N = no_years * no_assignments
propstudents<-c(c(176, 174, 158, 135, 138, 129, 126, 123)/176,
c(242, 212, 184, 177, 174, 172, 163, 156)/242,
c(332, 310, 278, 258, 243, 242, 226, 224)/332,
c(301, 269, 231, 232, 217, 208, 193, 191)/301,
c(245, 240, 228, 217, 206, 199, 191, 182)/245)
# These are our predictors x: for each observation, the corresponding assignment number.
assignment <- rep(1:8, 5)
# These are in some sense our test data: the proportion of students handing in the last assignment (9),
# sorted by year.
# Usually, we would not want to split our data like that and instead
# use e.g. Leave-One-Out Cross-Validation (LOO-CV, see e.g. http://mc-stan.org/loo/index.html)
# to evaluate model performance.
propstudents9 = c(121/176, 153/242, 218/332, 190/301, 175/245)
# The total number of assignments
no_assignments = 9
# The assignment numbers for which we want to generate predictions
x_predictions = 1:no_assignments
# (Cmd)Stan(R) expects the data to be passed in the below format:
model_data = list(N=length(assignment),
x=assignment,
y=propstudents,
no_predictions=no_assignments,
x_predictions=x_predictions)
Sampling from the posterior distribution happens here:
# This reads the file at the specified path and tries to compile it.
# If it fails, an error is thrown.
retention_model = cmdstan_model("./additional_files/assignment6_linear_model.stan")
Error in initialize(...): Assertion on 'stan_file' failed: File does not exist: './additional_files/assignment6_linear_model.stan'.
# This "out <- capture.output(...)" construction suppresses output from cmdstanr
# See also https://github.com/stan-dev/cmdstanr/issues/646
out <- capture.output(
# Sampling from the posterior distribution happens here:
fit <- retention_model$sample(data=model_data, refresh=0, show_messages=FALSE)
)
Error in withVisible(...elt(i)): object 'retention_model' not found
Draws postprocessing happens here:
# This extracts the draws from the sampling result as a data.frame.
draws_df = fit$draws(format="draws_df")
Error in eval(expr, envir, enclos): object 'fit' not found
# This does some data/draws wrangling to compute the 5, 50 and 95 percentiles of
# the mean at the specified covariate values (x_predictions).
# It can be instructive to play around with each of the data processing steps
# to find out what each step does, e.g. by removing parts from the back like "|> gather(pct,y,-x)"
# and printing the resulting data.frame.
mu_quantiles_df = draws_df |>
subset_draws(variable = c("mu_pred")) |>
summarise_draws(~quantile2(.x, probs = c(0.05, .5, 0.95))) |>
mutate(x = 1:9) |>
pivot_longer(c(q5, q50, q95), names_to = c("pct"))
Error in UseMethod("subset_draws"): no applicable method for 'subset_draws' applied to an object of class "function"
# Same as above, but for the predictions.
y_quantiles_df = draws_df |>
subset_draws(variable = c("y_pred")) |>
summarise_draws(~quantile2(.x, probs = c(0.05, .5, 0.95))) |>
mutate(x = 1:9) |>
pivot_longer(c(q5, q50, q95), names_to = c("pct"))
Error in UseMethod("subset_draws"): no applicable method for 'subset_draws' applied to an object of class "function"
Plotting happens here:
ggplot() +
# scatter plot of the training data:
geom_point(
aes(x, y, color=assignment),
data=data.frame(x=assignment, y=propstudents, assignment="1-8")
) +
# scatter plot of the test data:
geom_point(
aes(x, y, color=assignment),
data=data.frame(x=no_assignments, y=propstudents9, assignment="9")
) +
# you have to tell us what this plots:
geom_line(aes(x,y=value,linetype=pct), data=mu_quantiles_df, color='grey', linewidth=1.5) +
# you have to tell us what this plots:
geom_line(aes(x,y=value,linetype=pct), data=y_quantiles_df, color='red') +
# adding xticks for each assignment:
scale_x_continuous(breaks=1:no_assignments) +
# adding labels to the plot:
labs(y="assignment submission %", x="assignment number") +
# specifying that line types repeat:
scale_linetype_manual(values=c(2,1,2)) +
# Specify colours of the observations:
scale_colour_manual(values = c("1-8"="black", "9"="blue")) +
# remove the legend for the linetypes:
guides(linetype="none")
Error in fortify(data): object 'mu_quantiles_df' not found
If your model is correctly implemented, sampling from the posterior distribution should have been successful. You can check whether Stan thinks that sampling succeeded by inspecting the output of the below command, which you should be able to interpret with a little help from the CmdStan User’s Guide.
Based on the above plot, answer the following questions:
Replicate the computations for the bioassay example of section 3.7 (BDA3) using Stan.
You will need Stan functions multi_normal
and binomial_logit
for implementing the prior and observation model, respectively. In Stan code, it is easiest to declare a variable (say theta
) which is a two-element vector so that the first value denotes \(\alpha\) and latter one \(\beta\). This is because the multi_normal
function that you need for implementing the prior requires a vector as an input.