Assignment 6

Author

anonymous

1 General information

This is the template for assignment 6. You can download the broken stan-file and the qmd-file or copy the code from this rendered document after clicking on </> Code in the top right corner.

Please replace the instructions in this template by your own text, explaining what you are doing in each exercise.

Setup

This block will only be visible in your HTML output, but will be hidden when rendering to PDF with quarto for the submission. Make sure that this does not get displayed in the PDF!

JupyterHub has all the needed packages pre-installed.

The following installs and loads the aaltobda package:

if(!require(aaltobda)){
    install.packages("aaltobda", repos = c("https://avehtari.github.io/BDA_course_Aalto/", getOption("repos")))
    library(aaltobda)
}
Loading required package: aaltobda

The following installs and loads the latex2exp package, which allows us to use LaTeX in plots:

if(!require(latex2exp)){
    install.packages("latex2exp")
    library(latex2exp)
}
Loading required package: latex2exp

The following installs and loads the posterior package which imports the rhat_basic() function:

if(!require(posterior)){
    install.packages("posterior")
    library(posterior)
}
Loading required package: posterior
This is posterior version 1.4.0

Attaching package: 'posterior'
The following object is masked from 'package:aaltobda':

    mcse_quantile
The following objects are masked from 'package:stats':

    mad, sd, var
The following objects are masked from 'package:base':

    %in%, match

The following installs and loads the ggplot2 package, the bayesplot package and the dplyr package

if(!require(ggplot2)){
    install.packages("ggplot2")
    library(ggplot2)
}
Loading required package: ggplot2
if(!require(bayesplot)){
    install.packages("bayesplot")
    library(bayesplot)
}
Loading required package: bayesplot
This is bayesplot version 1.10.0
- Online documentation and vignettes at mc-stan.org/bayesplot
- bayesplot theme set to bayesplot::theme_default()
   * Does _not_ affect other ggplot2 plots
   * See ?bayesplot_theme_set for details on theme setting

Attaching package: 'bayesplot'
The following object is masked from 'package:posterior':

    rhat
if(!require(dplyr)){
    install.packages("dplyr")
    library(dplyr)
}
Loading required package: dplyr

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
if(!require(tidyr)){
    install.packages("tidyr")
    library(tidyr)
}
Loading required package: tidyr
# Some additional set-up to make plots legible
ggplot2::theme_set(theme_minimal(base_size = 14))
bayesplot::bayesplot_theme_set(theme_minimal(base_size = 14))
# register_knitr_engine()

The following installs and loads the cmdstanr package and tries to install cmdstan.

if(!require(cmdstanr)){
    install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
    library(cmdstanr)
}
Loading required package: cmdstanr
This is cmdstanr version 0.5.3
- CmdStanR documentation and vignettes: mc-stan.org/cmdstanr
- CmdStan path: /root/.cmdstan/cmdstan-2.31.0
- CmdStan version: 2.31.0

A newer version of CmdStan is available. See ?install_cmdstan() to install it.
To disable this check set option or environment variable CMDSTANR_NO_VER_CHECK=TRUE.
cmdstan_installed <- function(){
  res <- try(out <- cmdstanr::cmdstan_path(), silent = TRUE)
  !inherits(res, "try-error")
}
if(!cmdstan_installed()){
    install_cmdstan()
}

2 Stan warm-up: linear model of BDA retention with Stan (2 points)

2.1 (b)

Write your answers/code here!

Data preparation and sampling from the posterior

This block will only be visible in your HTML output, but will be hidden when rendering to PDF with quarto for the submission. Make sure that this does not get displayed in the PDF!

Data assembly happens here:

# These are our observations y: the proportion of students handing in each assignment (1-8),
# sorted by year (row-wise) and assignment (column-wise).
# While the code suggest a matrix structure,
# the result will actually be a vector of length N = no_years * no_assignments
propstudents<-c(c(176, 174, 158, 135, 138, 129, 126, 123)/176,
                c(242, 212, 184, 177, 174, 172, 163, 156)/242,
                c(332, 310, 278, 258, 243, 242, 226, 224)/332,
                c(301, 269, 231, 232, 217, 208, 193, 191)/301,
                c(245, 240, 228, 217, 206, 199, 191, 182)/245)
# These are our predictors x: for each observation, the corresponding assignment number.
assignment <- rep(1:8, 5)
# These are in some sense our test data: the proportion of students handing in the last assignment (9),
# sorted by year.
# Usually, we would not want to split our data like that and instead
# use e.g. Leave-One-Out Cross-Validation (LOO-CV, see e.g. http://mc-stan.org/loo/index.html)
# to evaluate model performance.
propstudents9 = c(121/176, 153/242, 218/332, 190/301, 175/245)
# The total number of assignments
no_assignments = 9
# The assignment numbers for which we want to generate predictions
x_predictions = 1:no_assignments
# (Cmd)Stan(R) expects the data to be passed in the below format:
model_data = list(N=length(assignment),
                 x=assignment,
                 y=propstudents,
                 no_predictions=no_assignments,
                 x_predictions=x_predictions)

Sampling from the posterior distribution happens here:

# This reads the file at the specified path and tries to compile it.
# If it fails, an error is thrown.
retention_model = cmdstan_model("./additional_files/assignment6_linear_model.stan")
Error in initialize(...): Assertion on 'stan_file' failed: File does not exist: './additional_files/assignment6_linear_model.stan'.
# This "out <- capture.output(...)" construction suppresses output from cmdstanr
# See also https://github.com/stan-dev/cmdstanr/issues/646
out <- capture.output(
    # Sampling from the posterior distribution happens here:
    fit <- retention_model$sample(data=model_data, refresh=0, show_messages=FALSE)
)
Error in withVisible(...elt(i)): object 'retention_model' not found

Draws postprocessing happens here:

# This extracts the draws from the sampling result as a data.frame.
draws_df = fit$draws(format="draws_df")
Error in eval(expr, envir, enclos): object 'fit' not found
# This does some data/draws wrangling to compute the 5, 50 and 95 percentiles of
# the mean at the specified covariate values (x_predictions).
# It can be instructive to play around with each of the data processing steps
# to find out what each step does, e.g. by removing parts from the back like "|>  gather(pct,y,-x)"
# and printing the resulting data.frame.
mu_quantiles_df = draws_df |>
      subset_draws(variable = c("mu_pred")) |>
      summarise_draws(~quantile2(.x, probs = c(0.05, .5, 0.95))) |>
      mutate(x = 1:9) |>
      pivot_longer(c(q5, q50, q95), names_to = c("pct"))
Error in UseMethod("subset_draws"): no applicable method for 'subset_draws' applied to an object of class "function"
# Same as above, but for the predictions.
y_quantiles_df = draws_df |>
      subset_draws(variable = c("y_pred")) |>
      summarise_draws(~quantile2(.x, probs = c(0.05, .5, 0.95))) |>
      mutate(x = 1:9) |>
      pivot_longer(c(q5, q50, q95), names_to = c("pct"))
Error in UseMethod("subset_draws"): no applicable method for 'subset_draws' applied to an object of class "function"

Plotting happens here:

ggplot() +
  # scatter plot of the training data:
  geom_point(
    aes(x, y, color=assignment),
    data=data.frame(x=assignment, y=propstudents, assignment="1-8")
) +
  # scatter plot of the test data:
  geom_point(
    aes(x, y, color=assignment),
    data=data.frame(x=no_assignments, y=propstudents9, assignment="9")
) +
  # you have to tell us what this plots:
  geom_line(aes(x,y=value,linetype=pct), data=mu_quantiles_df, color='grey', linewidth=1.5) +
  # you have to tell us what this plots:
  geom_line(aes(x,y=value,linetype=pct), data=y_quantiles_df, color='red') +
  # adding xticks for each assignment:
  scale_x_continuous(breaks=1:no_assignments) +
  # adding labels to the plot:
  labs(y="assignment submission %", x="assignment number") +
  # specifying that line types repeat:
  scale_linetype_manual(values=c(2,1,2)) +
  # Specify colours of the observations:
  scale_colour_manual(values = c("1-8"="black", "9"="blue")) +
  # remove the legend for the linetypes:
  guides(linetype="none")
Error in fortify(data): object 'mu_quantiles_df' not found
Quick check for sampling convergence

This block will only be visible in your HTML output, but will be hidden when rendering to PDF with quarto for the submission. Make sure that this does not get displayed in the PDF!

If your model is correctly implemented, sampling from the posterior distribution should have been successful. You can check whether Stan thinks that sampling succeeded by inspecting the output of the below command, which you should be able to interpret with a little help from the CmdStan User’s Guide.

fit$cmdstan_diagnose()
Error in eval(expr, envir, enclos): object 'fit' not found

2.2 (c)

Write your answers/code here!

3 Generalized linear model: Bioassay with Stan (4 points)

3.1 (d)

Write your answers/code here!

data("bioassay")

3.2 (e)

Write your answers/code here!

3.3 (f)

Write your answers/code here!

3.4 (g)

Write your answers/code here!