Assignment 2

Author

Aki Vehtari et al.

1 General information

The exercises here refer to the lecture 2/BDA chapters 1-2 content. All questions check your understanding of a simple posterior analysis using a binomial model for the observations and a beta prior.

The exercises constitute 90% of the Quiz 2 grade.

We prepared a quarto notebook specific to this assignment to help you get started. You still need to fill in your answers on Mycourses! You can inspect this and future templates

General Instructions for Answering the Assignment Questions
  • Questions below are exact copies of the text found in the MyCourses quiz and should serve as a notebook where you can store notes and code.
  • We recommend opening these notebooks in the Aalto JupyterHub, see how to use R and RStudio remotely.
  • For inspiration for code, have a look at the BDA R Demos and the specific Assignment code notebooks
  • Recommended additional self study exercises for each chapter in BDA3 are listed in the course web page. These will help to gain deeper understanding of the topic.
  • Common questions and answers regarding installation and technical problems can be found in Frequently Asked Questions (FAQ).
  • Deadlines for all assignments can be found on the course web page and in MyCourses.
  • You are allowed to discuss assignments with your friends, but it is not allowed to copy solutions directly from other students or from internet.
  • Do not share your answers publicly.
  • Do not copy answers from the internet or from previous years. We compare the answers to the answers from previous years and to the answers from other students this year.
  • Use of AI is allowed on the course, but the most of the work needs to by the student, and you need to report whether you used AI and in which way you used them (See points 5 and 6 in Aalto guidelines for use of AI in teaching).
  • All suspected plagiarism will be reported and investigated. See more about the Aalto University Code of Academic Integrity and Handling Violations Thereof.
  • If you have any suggestions or improvements to the course material, please post in the course chat feedback channel, create an issue, or submit a pull request to the public repository!

1.1 Assignment questions

For convenience the assignment questions are copied below. Answer the questions in MyCourses.

Inference for binomial proportion


Algae status is monitored in 274 sites at Finnish lakes and rivers. The observations for the 2008 algae status at each site are presented in the dataset algae in the aaltobda package ('0': no algae, '1': algae present).

Let \( \theta \) be the probability of a monitoring site having detectable blue-green algae levels, \( y \) the number of observed sites with algae detected, and \( n \) be total number sites surveyed. Use a binomial model for the observations and a \( \text{Beta}(2,10) \) prior for binomial model parameter \( \theta \) to formulate a Bayesian model. Here we will not test you on the individual mathematical operations needed in order to derive the posterior distribution for \( \theta \) as it has already been done in the book (and lecture) so make sure to look that up.

Your task is to perform Bayesian inference for a binomial model and fill in the quiz below based on it. 

For questions with checkboxes, more than one answer may be correct.

1. Formulating Probabilities

The algae dataset contains the results of 274 measurements from Finnish lakes, with the following results:

  • No Algae: 230 sites
  • Algae: 44 sites

Our goal for the following set of questions is to find the formulation of the posterior using a binomial likeliood and a beta prior on the unknown probability parameter \( \theta \)


  • 1.1 The prior \(p(\theta)\) can be expressed as:
  • 1.2 The likelihood \( p(y = 44 | \theta, n = 274) \) as a function of \( \theta \) can be expressed as:
  • 1.3 The resulting posterior \( p(\theta|y = 44, n = 274) \) can be expressed as :


2. Summary of the posterior distribution of \( \theta \)


The posterior distribution \( p(\theta|y) \) is analytically available as \( \text{Beta}(\alpha, \beta) \), so we can use the properties of that distribution to summarise what we know about \( \theta \).  And in particular, we can make probability statements about ranges of values for \( \theta \). Let's however start with the average value of \( \theta \) you expect after having conditioned on the data. 

  • 2.1 Which of the following is the correct formula for the mean (\( E[\cdot] \)) of a \( \text{Beta}(\alpha, \beta) \) distribution:
  • 2.2 Using your answer above, what is the mean of our posterior (i.e.,  \( E(\theta|y) \))? Report the result in decimals with two decimal digits.

Posterior intervals are sometimes called credible intervals and are different from confidence intervals (for more on this, see here). These are computed using the quantile function of the posterior distribution. As the quantiles of a \( \text{Beta}(\alpha, \beta) \) distribution do not have a simple analytical form like the expectation, you can use R to compute the posterior intervals.

  • 2.3 What R function would you use here to compute posterior intervals?

Using your answer above, calculate (report the results in decimals with two decimal digits):

  • 2.4 90% posterior interval lower bound:
  • 2.5 90% posterior interval upper bound:

3. Comparison to historical records

We are interested in using our posterior distribution to estimate the probability that the proportion of detected algae samples (\( \theta \)) is smaller than the historical detection rate \( \theta_0 = 0.2 \), i.e. \( p(\theta \leq \theta_0 \mid y) \). 
  • 3.1 Which of the following approaches would we take?
  • 3.2 What statistical function computes this probability for us?
  • 3.3 Which R function does this for you? 
  • 3.4 Using your answers above, report this probability (report the result in decimals with two decimal digits):


4. Prior sensitivity analysis

Redo the analysis using a uniform prior, \( \text{Beta}( 1,1 \)). 

  • 4.1 What is the mean of our posterior (i.e.,  \( E(\theta|y) \))? Report the result in decimals with two decimal digits.
  • 4.2 90% posterior interval lower bound. Report the result in decimals with two decimal digits.
  • 4.3 90% posterior interval upper bound. Report the result in decimals with two decimal digits.
  • 4.4 Probability \( p(\theta \leq \theta_0 \mid y) \). Report the result in decimals with two decimal digits. 

Redo the analysis using as prior \( \text{Beta}(0.5,0.5) \).

  • 4.5 What is the mean of our posterior (i.e.,  \( E(\theta|y) \))? Report the result in decimals with two decimal digits.
  • 4.6 90% posterior interval lower bound. Report the result in decimals with two decimal digits.
  • 4.7 90% posterior interval upper bound. Report the result in decimals with two decimal digits.
  • 4.8 Probability \( p(\theta \leq \theta_0 \mid y) \). Report the result in decimals with two decimal digits. 

Redo the analysis using as prior \( \text{Beta}(100,2) \).

  • 4.9 What is the mean of our posterior (i.e.,  \( E(\theta|y) \))? Report the result in decimals with two decimal digits.
  • 4.10 90% posterior interval lower bound. Report the result in decimals with two decimal digits.
  • 4.11 90% posterior interval upper bound. Report the result in decimals with two decimal digits.
  • 4.12 Probability \( p(\theta \leq \theta_0 \mid y) \). Report the result in decimals with two decimal digits. 


4.13 Based on testing different priors, would you consider the posterior results believe and defensible (w.r.t. to this data set). In order to help your reasoning you can plot the prior and posteriors used with the code template for Assignment 2?