Assignment 8

LOO-CV model comparison


Aki Vehtari et al.

1 General information

The maximum amount of points from this assignment is 6.

We have prepared a quarto template specific to this assignment (html, qmd, pdf) to help you get started.

We recommend you use or the docker container.


Reading instructions:

Grading instructions:

The grading will be done in peergrade. All grading questions and evaluations for this assignment are contained within this document in the collapsible Rubric blocks.

Installing and using CmdStanR:

See the Stan demos on how to use Stan in R (or Python). Aalto JupyterHub has working R and CmdStanR/RStan environment and is probably the easiest way to use Stan. * To use CmdStanR in Aalto JupyterHub:

The Aalto Ubuntu desktops also have the necessary libraries installed.]{.aalto}

To install Stan on your laptop, run ‘install.packages("cmdstanr", repos = c("", getOption("repos")))’ in R. If you encounter problems, see additional answers in FAQ. If you don’t succeed in short amount of time, it is probably easier to use Aalto JupyterHub.

If you use Aalto JupyterHub, all necessary packages have been pre-installed. In your laptop, install package cmdstanr. Installation instructions on Linux, Mac and Windows can be found at Additional useful packages are loo, bayesplot and posterior (but you don’t need these in this assignment). For Python users, PyStan, CmdStanPy, and ArviZ packages are useful.

Stan manual can be found at From this website, you can also find a lot of other useful material about Stan.

If you edit files ending .stan in RStudio, you can click “Check” in the editor toolbar to make syntax check. This can significantly speed-up writing a working Stan model.

Reporting accuracy

For posterior statistics of interest, only report digits that are not completely random based on the Monte Carlo standard error (MCSE).

Example: If you estimate \(E(\mu) \approx 1.234\) with MCSE(\(E(\mu)\)) = 0.01, then the true expectation is likely to be between \(1.204\) and \(1.264\), it makes sense to report \(E(\mu) \approx 1.2\).

See Lecture video 4.1, the chapter notes, and a case study for more information.

  • The recommended tool in this course is R (with the IDE RStudio).
  • Instead of installing R and RStudio on you own computer, see how to use R and RStudio remotely.
  • If you want to install R and RStudio locally, download R and RStudio.
  • There are tons of tutorials, videos and introductions to R and RStudio online. You can find some initial hints from RStudio Education pages.
  • When working with R, we recommend writing the report using quarto and the provided template. The template includes the formatting instructions and how to include code and figures.
  • Instead of quarto, you can use other software to make the PDF report, but the the same instructions for formatting should be used.
  • Report all results in a single, anonymous *.pdf -file and submit it in
  • The course has its own R package aaltobda with data and functionality to simplify coding. The package is pre-installed in JupyterHub. To install the package on your own system, run the following code (upgrade="never" skips question about updating other packages):
install.packages("aaltobda", repos = c("", getOption("repos")))
  • Many of the exercises can be checked automatically using the R package markmyassignment (pre-installed in JupyterHub). Information on how to install and use the package can be found in the markmyassignment documentation. There is no need to include markmyassignment results in the report.
  • Recommended additional self study exercises for each chapter in BDA3 are listed in the course web page. These will help to gain deeper understanding of the topic.
  • Common questions and answers regarding installation and technical problems can be found in Frequently Asked Questions (FAQ).
  • Deadlines for all assignments can be found on the course web page and in Peergrade. You can set email alerts for the deadlines in Peergrade settings.
  • You are allowed to discuss assignments with your friends, but it is not allowed to copy solutions directly from other students or from internet.
  • You can copy, e.g., plotting code from the course demos, but really try to solve the actual assignment problems with your own code and explanations.
  • Do not share your answers publicly.
  • Do not copy answers from the internet or from previous years. We compare the answers to the answers from previous years and to the answers from other students this year.
  • Use of AI is allowed on the course, but the most of the work needs to by the student, and you need to report whether you used AI and in which way you used them (See points 5 and 6 in Aalto guidelines for use of AI in teaching).
  • All suspected plagiarism will be reported and investigated. See more about the Aalto University Code of Academic Integrity and Handling Violations Thereof.
  • Do not submit empty PDFs, almost empty PDFs, copy of the questions, nonsense generated by yourself or AI, as these are just harming the other students as they can’t do peergrading for the empty or nonsense submissions. Violations of this rule will be reported and investigated in the same way was plagiarism.
  • If you have any suggestions or improvements to the course material, please post in the course chat feedback channel, create an issue, or submit a pull request to the public repository!
Rubric S1: General information - 7.5/100 points
  • Q1: Can you open the PDF and it’s not blank nor nonsense? If the pdf is blank, nonsense, or something like only a copy of the questions, 1) report it as problematic in Peergrade-interface to get another report to review, and 2) send a message to TAs.
  • Q2: Is the report anonymous?

2 A hierarchical model for chicken weight time series

2.1 Exploratory data analysis

In the first part of this assignment, you will explore the dataset ChickWeight. In particular, you will see what information is recorded in the dataset, and how you can use visualisation to learn more about the dataset. More information can be found on the corresponding page of the R documentation.

head(ChickWeight, 10)
   weight Time Chick Diet
1      42    0     1    1
2      51    2     1    1
3      59    4     1    1
4      64    6     1    1
5      76    8     1    1
6      93   10     1    1
7     106   12     1    1
8     125   14     1    1
9     149   16     1    1
10    171   18     1    1
Subtask 2.a)

Create a histogram to explore the range of chicken weights. Describe what you see in the plot. What is the qualitative range of the data?

Subrubric 2.a)
  • Q3: Does the plot look correct and is it readable?
  • Q4: Has it been stated that the data is ?
Subtask 2.b)

Plot the weight of each chicken over time in a line plot. Add colours based on the diet. Describe what you see in the plot.

Subrubric 2.b)
  • Q5: Does the plot look correct and is it readable?

2.2 Linear regression

In this section, you will build a model that predicts the weight of a chicken over time and depending on the diet. After sampling from the posteriors, you will use posterior predictive checks to see how well the predictions match the observations. Then you will adjust the model by adding more complexity, and check again.

Subtask 2.c)

Using brms, implement a pooled linear regression with a normal model and weight as the predicted variable using Diet and Time as predictors. Try to use weakly informative priors.


For the prior on Time, consider how much the weight of a chicken (in grams) could possibly change each day. For the priors on the effects of different diets, consider how much average weight difference would be possible between diets.

Note that as Diet is a categorical variable, the priors need to be specified for each category (apart from Diet1 which is taken to be the baseline).

Subrubric 2.c)
  • Q6: Is the brms-formula ?
  • Q7: Is the family ?
  • Q8: Are the prior standard deviations around ?

Next, you can use the bayesplot package to check the posterior predictions in relation to the observed data using the pp_check function. The function plots the \(y\) values, which are the observed data, and the \(y_\text{rep}\) values, which are replicated data sets from the posterior predictive distribution.

Subtask 2.d)

Perform the posterior predictive check with the default arguments. What do you observe? Based on the plot, do the posterior predictions encapsulate the main features of the observed data? Point out any major differences between the predictions and the observed data. Answer the following questions:

  • Are there qualitative differences between the observed data and the predicted data?
  • Do the observed data seem quantitatively similar?
Subrubric 2.d)
  • Q9: Does the plot look correct and is it readable?
  • Q10: Has it been recognized that the predicted data include ?
  • Q11: Has it been recognized that the observed and predicted data ?

The default density plot is not always informative, but bayesplot has different settings that can be used to create plots more appropriate for specific data.

Subtask 2.e)

Create another plot with grouping to the PPC plot using the arguments type = "intervals_grouped" and group = "Diet". What do you observe? Point out any major differences between the predictions and the observed data. Based on your visualisations, how could the model be improved?

Subrubric 2.e)
  • Q12: Does the plot look correct and is it readable?
  • Q13: Is there at least one reasonable way to improve the model, e.g. ?

2.3 Log-normal linear regression

Based on the identified issues from the posterior predictive check, the model can be improved. It is advisable to change only one or a few things about a model at once. At this stage, focus on changing the observation model family to better account for the observed data.

One option is to use the lognormal observation model, which only allows positive values. In brms you can change the observation model family to this by setting the argument family = "lognormal". Note that when using the log-normal observation model, the regression coefficients represent the change in the log weight of a chicken. The priors have been adjusted accordingly in the template.

Subtask 2.f)

Adjust the model, sample from the posterior and create the same two posterior predictive check plots. Comment on your observations. Does the new model better capture some aspects of the data?

Subrubric 2.f)
  • Q14: Do the plots look correct and are they readable?
  • Q15: Has it been recognized that the fit to data is ?

2.4 Hierarchical log-normal linear regression

The model can further be improved by directly considering potential differences in growth rate for individual chicken. Some chickens may innately grow faster than others, and this difference can be included by including both population and group level effects in to the model.

To include a group effect in brms, the code + (predictor|group) can be added to the model formula. In this case, the predictor is Time and the group is Chick.

Subtask 2.g)

Create the same two plots as for the previous models. Comment on what you see. Do the predictions seem to better capture the observed data? Are there remaining discrepancies between the predictions and observed data that could be addressed?

Subrubric 2.g)
  • Q16: Do the plots look correct and are they readable?
  • Q17: Has it been recognized that the fit to data is ?
Subtask 2.h)

Have you encountered any convergence issues in the above models? Report and comment.

Subrubric 2.h)
  • Q18: Has there been a potentially brief discussion of the standard convergence criteria (Rhat, ESS, divergent transitions) for all models?

2.5 Model comparison using the ELPD

There are many ways of comparing models1. Commonly, we evaluate point predictions, such as the mean of the predictive distribution2, or accuracy of the whole posterior predictive. Whether we prioritise point or density predictive accuracy may serve different purposes and lead to different outcomes for model choice 3. It is common, however, to report predictive accuracy via log-scores and point-predictive accuracy via root-mean-squared-error based on the empirical average of the predictive distribution. To cross-validate both metrics on left out observations without need to sample from each leave-one-out posterior, we use Pareto-smoothed importance sampling as discussed in the course materials (see Lecture 9).

We start comparing models based on the log-score. Use loo::loo() and loo::loo_compare() to quantify the differences in predictive performance.

Subtask 2.i)

Answer the following questions using loo/loo_compare:

  • Which model has the best predictive performance?
  • Does the uncertainty influence the decision of which model is best?
Subrubric 2.i)
  • Q19: Do the results look correct and have they been presented in a readable way? They should be roughly .
  • Q20: Has it been recognized that the best model is and that the uncertainty is ?
Subtask 2.j)

Assess whether the approximation to the LOO-CV distributions are reliable. Consult the \(\hat{k}\) statistic which informs on the reliability of PSIS computation in PSIS-LOO. Plot the \(\hat{k}\) values for each model against the data point ID and discuss. Are they as expected?


For hierarchical models, it may be more important to think about how well the individual group is predicted and how many observations there are in a group compared to the number of parameters estimated. Also check out CV-FAQ on high Pareto-\(\hat{k}\) values.

Subrubric 2.j)
  • Q21: Do the plots look correct and are they readable?
  • Q22: Has it been explained why the \(\hat{k}\) values are highest for the ? .
Subtask 2.k)

Perform a PPC for the hierarchical model for

  • a few of the chickens with the highest \(\hat{k}\) values and
  • a few of the chickens with the lowest \(\hat{k}\) values

using the code in the template. What do you observe?

Subrubric 2.k)
  • Q23: Does the plot look correct and is it readable?
  • Q24: Has it been recognized that the chickens with high \(\hat{k}\) values ?

2.6 Model comparison using the RMSE

Subtask 2.l)

Use the function in the template to compare the RMSE and the LOO-RMSE for the three models. Explain the difference between the RMSE and the LOO-RMSE in 1–3 sentences. Is one generally lower than the other? Why?

Subrubric 2.l)
  • Q25: Do the results look correct and have they been presented in a readable way? They should be roughly: .
  • Q26: Has it been recognized that the RMSE is than the LOO-RMSE because ?

3 Overall quality of the report

Rubric S3: Overall quality of the report - 7.5/100 points
  • Q27: Does the report include comment on whether AI was used, and if AI was used, explanation on how it was used?
    • No
    • Yes
  • Q28: Does the report follow the formatting instructions?
    • Not at all
    • Little
    • Mostly
    • Yes
  • Q29: In case the report doesn’t fully follow the general and formatting instructions, specify the instructions that have not been followed. If applicable, specify the page of the report, where this difference is visible. This will help the other student to improve their reports so that they are easier to read and review. If applicable, specify the page of the report, where this difference in formatting is visible.
  • Q30: Please also provide feedback on the presentation (e.g. text, layout, flow of the responses, figures, figure captions). Part of the course is practicing making data analysis reports. By providing feedback on the report presentation, other students can learn what they can improve or what they already did well. You should be able to provide constructive or positive feedback for all non-empty and non-nonsense reports. If you think the report is perfect, and you can’t come up with any suggestions how to improve, you can provide feedback on what you liked and why you think some part of the report is better than yours.


  1. In principle, when comparing models based on accuracy in predictions or parameter estimation (if true parameter values are available to you, as e.g. in simulation studies), we want to use so called strictly proper scoring rules that will always indicate when a “better” model is better and the score reaches its uniquely defined best value at the “true” model, if it is also well defined. See Gneiting and Raftery, (2007) for and in depth treatment of this topic.↩︎

  2. NOT predictions based on the mean of the posterior parameters, but first generating the predictive distribution and then computing an average.↩︎

  3. For instance, a unimodal and bimodal predictive density may have the same expected value, but very different areas of high posterior density and therefore very different log-scores.↩︎