if(!require(aaltobda)){
install.packages("remotes")
remotes::install_github("avehtari/BDA_course_Aalto", subdir = "rpackage", upgrade="never")
library(aaltobda)
}
Loading required package: aaltobda
This assignment is related to Lecture 3 and BDA3 Chapters 2 and 3. Use Frank Harrell’s recommendations on how to state results in Bayesian two group comparisons (and note that there is no point null hypothesis testing in this assignment).
Reading instructions:
The following installs and loads the aaltobda
package:
if(!require(aaltobda)){
install.packages("remotes")
remotes::install_github("avehtari/BDA_course_Aalto", subdir = "rpackage", upgrade="never")
library(aaltobda)
}
Loading required package: aaltobda
The following will set-up markmyassignment
to check your functions at the end of the notebook:
Loading required package: markmyassignment
assignment_path = paste("https://github.com/avehtari/BDA_course_Aalto/blob/master/tests/assignment3.yml")
set_assignment(assignment_path)
Assignment set:
assignment3: Bayesian Data Analysis: Assignment 3
The assignment contain the following (6) tasks:
- mu_point_est
- mu_interval
- mu_pred_interval
- mu_pred_point_est
- posterior_odds_ratio_point_est
- posterior_odds_ratio_interval
The following installs and loads the latex2exp
package, which allows us to use LaTeX in plots:
posterior
and ggdist
)The following installs and loads the posterior
package, which allows us to use its rvar
Random Variable Datatype:
Loading required package: posterior
This is posterior version 1.6.0
Attaching package: 'posterior'
The following object is masked from 'package:aaltobda':
mcse_quantile
The following objects are masked from 'package:stats':
mad, sd, var
The following objects are masked from 'package:base':
%in%, match
The following installs and loads the ggdist
package along with other packages for advanced plotting functions:
A factory has a production line for manufacturing car windshields. A sample of windshields has been taken for testing hardness. The observed hardness values \(\mathbf{y}_1\) can be found in the dataset windshieldy1
in the aaltobda
package.
We may assume that the observations follow a normal distribution with an unknown standard deviation \(\sigma\). We wish to obtain information about the unknown average hardness \(\mu\). For simplicity we assume standard uninformative prior discussed in the book, that is, \(p(\mu, \sigma) \propto \sigma^{-1}\). It is not necessary to derive the posterior distribution in the report, as it has already been done in the book (see section 3.2).
Loading the library and the data.
data("windshieldy1")
# The data are now stored in the variable `windshieldy1`.
# The below displays the data:
windshieldy1
[1] 13.357 14.928 14.896 15.297 14.820 12.067 14.824 13.865 17.447
The below data is only for the tests, you need to change to the full data windshieldy1
when reporting your results.
Write your answers and code here!
Keep the below name and format for the functions to work with markmyassignment
:
# Useful functions: mean(), length(), sqrt(), sum()
# and qtnew(), dtnew() (from aaltobda)
mu_point_est <- function(data) {
# Do computation here, and return as below.
# This is the correct return value for the test data provided above.
14.5
}
mu_interval <- function(data, prob = 0.95) {
# Do computation here, and return as below.
# This is the correct return value for the test data provided above.
c(13.3, 15.7)
}
You can plot the density as below if you implement mu_pdf
to compute the PDF of the posterior \(p(\mu|y)\) of the average hardness \(\mu\).
mu_pdf <- function(data, x){
# Compute necessary parameters here.
# These are the correct parameters for `windshieldy_test`
# with the provided uninformative prior.
df = 3
location = 14.5
scale = 0.3817557
# Use the computed parameters as below to compute the PDF:
dtnew(x, df, location, scale)
}
x_interval = mu_interval(windshieldy1, .999)
lower_x = x_interval[1]
upper_x = x_interval[2]
x = seq(lower_x, upper_x, length.out=1000)
plot(
x, mu_pdf(windshieldy1, x), type="l",
xlab=TeX(r'(average hardness $\mu$)'),
ylab=TeX(r'(PDF of the posterior $p(\mu|y)$)')
)
# Define the inputs for the below: Sufficient Statistics
y <- windshieldy1
n <- length(y)
s2 <- var(y)
my <- mean(y)
# helper functions to sample from and evaluate
# scaled inverse chi-squared distribution
rsinvchisq <- function(n, nu, s2, ...) nu*s2 / rchisq(n , nu, ...)
dsinvchisq <- function(x, nu, s2){
exp(log(nu/2)*nu/2 - lgamma(nu/2) + log(s2)/2*nu - log(x)*(nu/2+1) - (nu*s2/2)/x)
}
# Sample 1000 draws from marginal posteriors
ns <- 1000
sigma2 <- rsinvchisq(ns, n-1, s2)
mu <- my + sqrt(sigma2/n)*rnorm(length(sigma2))
sigma <- sqrt(sigma2)
# Compute the density in a grid of ranged for the grids
t1l <- c(10, 20)
t2l <- c(0.5, 7)
nl <- c(0.001, 50)
t1 <- seq(t1l[1], t1l[2], length.out = ns)
t2 <- seq(t2l[1], t2l[2], length.out = ns)
# Compute the exact marginal density of mu:
# multiplication by 1./sqrt(s2/n) is due to the transformation of
# variable z=(x-mean(y))/sqrt(s2/n), see BDA3 p. 21
pm <- dt((t1-my) / sqrt(s2/n), n-1) / sqrt(s2/n)
# Estimate the marginal density using samples and ad hoc Gaussian kernel approximation
pmk <- density(mu, adjust = 2, n = ns, from = t1l[1], to = t1l[2])$y
#Compute the exact marginal density of sigma:
# the multiplication by 2*t2 is due to the transformation of
# variable z=t2^2, see BDA3 p. 21
ps <- dsinvchisq(t2^2, n-1, s2) * 2*t2
# Estimate the marginal density using samples and ad hoc Gaussian kernel approximation
psk <- density(sigma, n = ns, from = t2l[1], to = t2l[2])$y
# Evaluate the joint density in a grid. Note that the following is not normalized, but for plotting contours it does not matter :
# Combine grid points into another data frame
# with all pairwise combinations
dfj <- data.frame(t1 = rep(t1, each = length(t2)),
t2 = rep(t2, length(t1)))
dfj$z <- dsinvchisq(dfj$t2^2, n-1, s2) * 2*dfj$t2 * dnorm(dfj$t1, my, dfj$t2/sqrt(n))
# breaks for plotting the contours
cl <- seq(1e-5, max(dfj$z), length.out = 6)
# Now, visualise the joint and marginal densities
dfm <- data.frame(t1, Exact = pm, Empirical = pmk) |>
pivot_longer(cols = !t1, names_to="grp", values_to="p")
margmu <- ggplot(dfm) +
geom_line(aes(t1, p, color = grp)) +
coord_cartesian(xlim = t1l) +
labs(title = 'Marginal of mu', x = '', y = '') +
scale_y_continuous(breaks = NULL) +
theme(legend.background = element_blank(),
legend.position.inside = c(0.75, 0.8),
legend.title = element_blank())
dfs <- data.frame(t2, Exact = ps, Empirical = psk) |>
pivot_longer(cols = !t2, names_to="grp", values_to="p")
margsig <- ggplot(dfs) +
geom_line(aes(t2, p, color = grp)) +
coord_cartesian(xlim = t2l) +
coord_flip() +
labs(title = 'Marginal of sigma', x = '', y = '') +
scale_y_continuous(breaks = NULL) +
theme(legend.background = element_blank(),
legend.position.inside = c(0.75, 0.8),
legend.title = element_blank())
Coordinate system already present. Adding new coordinate system, which will
replace the existing one.
joint1labs <- c('Samples','Exact contour')
joint1 <- ggplot() +
geom_point(data = data.frame(mu,sigma), aes(mu, sigma, col = '1'), size = 0.1) +
geom_contour(data = dfj, aes(t1, t2, z = z, col = '2'), breaks = cl) +
coord_cartesian(xlim = t1l,ylim = t2l) +
labs(title = 'Joint posterior', x = '', y = '') +
scale_y_continuous(labels = NULL) +
scale_x_continuous(labels = NULL) +
scale_color_manual(values=c('blue', 'black'), labels = joint1labs) +
guides(color = guide_legend(nrow = 1, override.aes = list(
shape = c(16, NA), linetype = c(0, 1), size = c(2, 1)))) +
theme(legend.background = element_blank(),
legend.position.inside = c(0.5, 0.9),
legend.title = element_blank())
# blank plot for combining the plots
bp <- grid.rect(gp = gpar(col = 'white'))
Write your answers and code here!
Keep the below name and format for the functions to work with markmyassignment
:
# Useful functions: mean(), length(), sqrt(), sum()
# and qtnew(), dtnew() (from aaltobda)
mu_pred_point_est <- function(data) {
# Do computation here, and return as below.
# This is the correct return value for the test data provided above.
14.5
}
mu_pred_interval <- function(data, prob = 0.95) {
# Do computation here, and return as below.
# This is the correct return value for the test data provided above.
c(11.8, 17.2)
}
You can plot the density as below if you implement mu_pred_pdf
to compute the PDF of the posterior predictive \(p(\tilde{y}|y)\) of a new hardness observation \(\tilde{y}\).
mu_pred_pdf <- function(data, x){
# Compute necessary parameters here.
# These are the correct parameters for `windshieldy_test`
# with the provided uninformative prior.
df = 3
location = 14.5
scale = 0.8536316
# Use the computed parameters as below to compute the PDF:
dtnew(x, df, location, scale)
}
x_interval = mu_pred_interval(windshieldy1, .999)
lower_x = x_interval[1]
upper_x = x_interval[2]
x = seq(lower_x, upper_x, length.out=1000)
plot(
x, mu_pred_pdf(windshieldy1, x), type="l",
xlab=TeX(r'(new hardness observation $\tilde{y}$)'),
ylab=TeX(r'(PDF of the posterior predictive $p(\tilde{y}|y)$)')
)
An experiment was performed to estimate the effect of beta-blockers on mortality of cardiac patients. A group of patients was randomly assigned to treatment and control groups: out of 674 patients receiving the control, 39 died, and out of 680 receiving the treatment, 22 died. Assume that the outcomes are independent and binomially distributed, with probabilities of death of \(p_0\) and \(p_1\) under the control and treatment, respectively. Set up a noninformative or weakly informative prior distribution on \((p_0,p_1)\).
Summarize the posterior distribution for the odds ratio, \[ \mathrm{OR} = (p_1/(1-p_1))/(p_0/(1-p_0)). \]
In order to help you answering the quiz questions, we recommend you:
Use Frank Harrell’s recommendations how to state results in Bayesian two group comparison.
With a conjugate prior, a closed-form posterior is the Beta form for each group separately (see equations in the book). You can use rbeta()
to sample from the posterior distributions of \(p_0\) and \(p_1\), and use this sample and odds ratio equation to get a sample from the distribution of the odds ratio.
Write your answers and code here!
The below data is only for the tests:
Keep the below name and format for the functions to work with markmyassignment
:
# Useful function: mean(), quantile()
posterior_odds_ratio_point_est <- function(p0, p1) {
# Do computation here, and return as below.
# This is the correct return value for the test data provided above.
2.650172
}
posterior_odds_ratio_interval <- function(p0, p1, prob = 0.95) {
# Do computation here, and return as below.
# This is the correct return value for the test data provided above.
c(0.6796942,7.3015964)
}
posterior
’s rvar
, ggdist
’s stat_dotsinterval
)The posterior
package’s random variable datatype rvar
is a “sample-based representation of random variables” which makes handling of random samples (of draws) such as the ones contained in the above variables p0
and p1
easier. By default, it prints as the mean and standard deviation of the draws, such that rvar(p0)
prints as 0.05 ± 0.021 and rvar(p1)
prints as 0.1 ± 0.029.
The datatype is “designed to […] be able to be used inside data.frame()
s and tibble()
s, and to be used with distribution visualizations in the ggdist package.” The code below sets up an R data.frame()
with the draws in p0
and p1
wrapped in an rvar
, and uses that data frame to visualize the draws using ggdist
, an R package building on ggplot2
and “designed for both frequentist and Bayesian uncertainty visualization”.
The below plot, Figure 3 uses ggdist
’s stat_dotsinterval()
, which by default visualizes
rvar
’s median and central 66% and 95% intervals using a black dot and lines of varying thicknesses as when using ggdist
’s stat_pointinterval()
andrvar
’s draws using grey dots as when using ggdist
’s stat_dots()
:r0 = rvar(p0)
r1 = rvar(p1)
ggplot(data.frame(
rv_name=c("control", "treatment"), rv=c(r0, r1)
)) +
aes(xdist=rv, y=rv_name) +
labs(x="probabilities of death", y="patient group") +
stat_dotsinterval()
rvar
s make it easy to compute functions of random variables, such as
r0 - r1
computes an rvar
which prints as -0.05 ± 0.037, indicating the sample mean and the sample standard deviation of the difference of the probabilities of death,r0 * r1
computes an rvar
which prints as 0.005 ± 0.0026 which in this case has no great interpretationBelow, in Figure 4, we compute the odds ratios using the rvar
s and visualize its median, central intervals and draws, as above in Figure 3:
rodds_ratio = (r1/(1-r1))/(r0/(1-r0))
ggplot(data.frame(
rv=c(rodds_ratio)
)) +
aes(xdist=rv) +
labs(x="odds ratio", y="relative amount of draws") +
stat_dotsinterval()
You can use Figure 4 to visually check whether the answers you computed make sense.
Consider a case where the same factory has two production lines for manufacturing car windshields. Independent samples from the two production lines were tested for hardness. The hardness measurements for the two samples \(\mathbf{y}_1\) and \(\mathbf{y}_2\) be found in the datasets windshieldy1
and windshieldy2
in the aaltobda
package.
We assume that the samples have unknown standard deviations \(\sigma_1\) and \(\sigma_2\). Use the uninformative prior.
data("windshieldy2")
# The new data are now stored in the variable `windshieldy2`.
# The below displays the first few rows of the new data:
head(windshieldy2)
[1] 15.980 14.206 16.011 17.250 15.993 15.722
With a conjugate prior, a closed-form posterior is Student’s \(t\) form for each group separately (see equations in the book). You can use the rtnew()
function to sample from the posterior distributions of \(\mu_1\) and \(\mu_2\), and use this sample to get a sample from the distribution of the difference \(\mu_d = \mu_1 - \mu_2\).
Write your answers and code here!