Assignment 1

Author

Aki Vehtari et al.

1 General information

The exercises here refer to the lecture 1/BDA chapter 1 content, not the course infrastructure quiz. This assignment is meant to test whether or not you have sufficient knowledge to participate in the course. The first question checks that you remember basic terms of probability calculus. The second exercise checks you recognise the most important notation used throughout the course and used in BDA3. The third-fifth exercise you will solve some basic Bayes theorem questions to check your understanding on the basics of probability theory. The 6th exercise checks on whether you recall the three steps of Bayesian Data Ananlysis as mentioned in chapter 1 of BDA3. The last exercise walks you through an example of how we can use models to generate distributions for outcomes of interest, applied to a setting of a simplified Roulette table.

The exercises constitute 86% of the Quiz 1 grade.

We prepared a quarto notebook specific to this assignment to help you get started. You still need to fill in your answers on Mycourses! You can inspect this and future templates

as a rendered html file (to access the qmd file click the “</> Code” button at the top right hand corner of the template)

General Instructions for Answering the Assignment Questions

Questions below are exact copies of the text found in the MyCourses quiz and should serve as a notebook where you can store notes and code.
We recommend opening these notebooks in the Aalto JupyterHub, see how to use R and RStudio remotely.
For inspiration for code, have a look at the BDA R Demos and the specific Assignment code notebooks
Recommended additional self study exercises for each chapter in BDA3 are listed in the course web page. These will help to gain deeper understanding of the topic.
Common questions and answers regarding installation and technical problems can be found in Frequently Asked Questions (FAQ).
Deadlines for all assignments can be found on the course web page and in MyCourses.
You are allowed to discuss assignments with your friends, but it is not allowed to copy solutions directly from other students or from internet.
Do not share your answers publicly.
Do not copy answers from the internet or from previous years. We compare the answers to the answers from previous years and to the answers from other students this year.
Use of AI is allowed on the course, but the most of the work needs to by the student, and you need to report whether you used AI and in which way you used them (See points 5 and 6 in Aalto guidelines for use of AI in teaching).
All suspected plagiarism will be reported and investigated. See more about the Aalto University Code of Academic Integrity and Handling Violations Thereof.
If you have any suggestions or improvements to the course material, please post in the course chat feedback channel, create an issue, or submit a pull request to the public repository!
The decimal separator throughout the whole course is a dot, “.” (Follows English language convention). Please be aware that MyCourses will not accept numerical value answers with “,” as a decimal separator
Unless stated otherwise: if the question instructions ask for reporting of numerical values in terms of the ith decimal digit, round to the ith decimal digit. For example, 0.0075 for two decimal digits should be reported as 0.01. More on this in Assignment 4.

1.1 Assignment questions

For convenience the assignment questions are copied below. Answer the questions in MyCourses.

Lecture 1/Chapter 1 of BDA Quiz (86% of grade)

1.Terminology

Match the following terms with the correct definition: Note that the answers order and set of possible answers is the same for questions 1.1 - 1.8. Check the BDA chapter 1, the lecture slides, and Wikipedia if you are uncertain about the terms below.

1.1 Probability:

1.2 Probability mass (function):

1.3 Probability density (function):

1.4 Probability distribution:

1.5 Discrete probability distribution:

1.6 Continuous probability distribution:

1.7 Cumulative distribution function (cdf):

1.8 Likelihood:

2.Notation

Match the following notation with the correct definition:

2.1 \(\sim\) :

2.2 \(\propto\) :

2.3 \(\mathbb{E}\left[\right]\) :

2.4 \(p\left(y\vert\theta\right)\):

3. Bayes’ Theorem 1

A group of researchers has designed a new inexpensive and painless test for detecting lung cancer. The test is intended to be an initial screening test for the population in general. A positive result (presence of lung cancer) from the test would be followed up immediately with medication, surgery or more extensive and expensive test.

The researchers know from their studies the following facts:

Test gives a positive result in 0.98 of the time when the test subject has lung cancer.
Test gives a negative result in 0.96 of the time when the test subject does not have lung cancer.
In general population approximately one person in 1000 has lung cancer.

Here are some probability values that can help you figure out if you copied the right conditional probabilities from the question:

P(Test gives positive | Subject does not have lung cancer) = 0.04
P(Test gives positive and Subject has lung cancer) = 9.8 × 10^-4
- this is also referred to as the joint probability of test being positive and the subject having lung cancer

Your goal is calculate the probability of having cancer given a positive test result: P(cancer|positive).

3.1 Which quantity in Bayes’ Theorem does this represent?

3.2 What is the probability of the test having a positive result, given that the test subject has cancer (P(B|A))?

3.3 What is the probability of having cancer (P(A))?

3.5 What is the probability of having a positive test (P(B))?

3.6 Using your previous answers, what is the probability of having cancer given a positive test?

4.Bayes’ Theorem 2

We have three boxes, A, B, and C. There are

2 red balls and 5 white balls in the box A
4 red balls and 1 white ball in the box B
1 red ball and 3 white balls in the box C.

Consider a random experiment in which one of the boxes is randomly selected and from that box, one ball is randomly picked up. After observing the color of the ball it is replaced in the box it came from. Suppose also that on average box A is selected 40% of the time and box B 10% of the time (i.e. P(A) = 0.4).

4.1 What is the probability of picking a red ball from box A?

4.2 What is the probability of picking a red ball from box B?

4.3 What is the probability of picking a red ball from box C?

4.4 Considering the probabilities of selecting each box, what is the probability of picking a red ball (enter as a number between 0 and 1 with 2 decimal digit accuracy)?

4.5 If a red ball was picked, calculate the probability that it was picked from (enter as a number between 0 and 1 with 2 decimal digit accuracy):

Box A:
Box B:
Box C:

5. Three Steps of Bayesian Data Analysis

5.1 Select the three steps of Bayesian data analysis (see BDA3 p. 3):

6. A Binomial Model for the Roulette Table

In this course, models are used to explain social and physical data, and we will be able to generate data from our models which we can use for checking how well our model does. In this example, we show how to generate outcomes from a binomial model to explain outcomes of a roulette game (there is a connection to the history of statistics). Suppose a roulette table with only red and black colours. Roulette tables won’t be perfect and it’s likely that the probability of red vs black is not exactly 0.5 (the tables can have adjustments that are randomized each day to avoid long term bias).

Suppose your model for the count of reds is a Binomial, given the total number of trials and a probability of red as parameter theta. Set theta to 0.6 (this is much bigger than what we would expect in real roulette, but makes it easier as a teaching example) and generate a series (for a sequence of 100 equally spaced trial values between 10 and 1000) of proportion of observed reds (number of reds / number of trials). Generate 1000 random draws from your model for each trial value and save the data in a Data frame with columns Proportions, Nsims and Trials. Incomplete code can be found below.

# load the tidyverse package for data manipulation and plotting
library(tidyverse)
library(ggplot2)

# proportion of red/black
theta <- # declare probability parameter for the binomial model

# Sequence of trials
trials <- seq(#start value of sequence,#end value of sequence,#value for spacing)

# Number of simulation draws from the model
nsims <- # number of of simulations from the binomial model

# Helper function for getting the proportions
binom_gen <- function(trials,theta,nsims){
    df <-  as.data.frame(rbinom(nsims,trials,theta)/trials) |> mutate(nsims = nsims,trials = trials)
    colnames(df) <- c("Proportions","Nsims","Trials")
  return(df)
}

# Create a data frame containing the draws for each number of trials
proportion_60 <- do.call(rbind, lapply(trials, binom_gen, theta, nsims)) # lapply applies elements in trials column to binom_gen function, which is then rowbound via do.call

6.1 Suppose you are unsure whether the code to create the data frame worked. Which of the following functions should you use in order to check on the structure of the dataframe object (assuming df below stands for a generic dataframe object)?

6.2 The structure checks out, but now you want to print the first 5 rows of the dataframe to check whether the values are as expected. Which of the following functions should you use?

6.3 The quick peek checks also out, but you would be more at ease scrolling all data, perhaps you’ll find some interesting patterns. Which of the following actions allows you to scroll through the data in a separate window (for the below, we assume that you have the code loaded in an RStudio session)?

Now, plot a histogram of the computed proportionss for 10, 50 and 1000 trials, using the code below

# Plot the Distributions
subset_df <- proportion_60[proportion_60$Trials %in% c(#trial values), ] # Subset your 

subset_df |> ggplot(aes(Proportions)) +
  geom_histogram(position = "identity" ,bins = 40) +
  facet_grid(cols = vars(Trials))  +
  ggtitle("Proportions for specific trials")

6.4 Which histogram below is the correct one for theta = 0.6?

6.5 What do these distributions refer to?

6.6 Given these histograms, which number of trials gives you the most certainty about the likely red/black proportion for that table?

6.7 Given the draws from the model, give an estimate about the probability p(proportion<=0.5) for the model with 1000 trials (enter as a number between 0 and 1 with 2 decimal digit accuracy).

Suppose you are now certain that theta = 0.6, plot the probability density given 1000 trials using the code below.

size =  # number of trials
prob =  # probability of success

binom_data <- data.frame(
  Success = 0:size,
  Probability = dbinom(0:size, size = size, prob = prob)
)

ggplot(binom_data, aes(x = Success, y = Probability)) +
  geom_point() +
  geom_line() +
  labs(title = "PMF of Binomial Distribution", x = "Number of Successes", y = "PDF")

6.8 Which plot of the PMF is the correct one?

6.9 How does the PMF plot relate to the histogram of proportions plotted earlier?

6.10 Given the PMF for your model, calculate the probability for 1000 trials of observing less or equal to 500 red outcomes using theta = 0.6. Use the pbinom function in R.