These notes help you to focus on the most important parts of each
chapter related o the Bayesian Data Analysis course. Before reading a
chapter, you can check below which sections, pages, and terms are the
most important. After reading the chapter or following the corresponding
lecture, you can check here for additional clarifications. There also
some notes for the chapters not included in the course.
Chapter 1 Probability and inference
Chapter 1
is related to the pre-requisites and Lecture 1
Introduction.
Outline
- 1.1-1.3 important terms, especially 1.3 for the notation
- 1.4 an example related to the first exercise, and another practical
example
- 1.5 foundations
- 1.6 good example related to visualization exercise
- 1.7 example which can be skipped
- 1.8 background material, good to read before doing the first
assignment
- 1.9 background material, good to read before doing the second
assignment
- 1.10 a point of view for using Bayesian inference
The most important terms
Find all the terms and symbols listed below. Note that some of the
terms are now only briefly introduced and will be covered later in more
detail. When reading the chapter, write down questions related to things
unclear for you or things you think might be unclear for others.
- full probability model
- posterior distribution
- potentially observable quantity
- quantities that are not directly observable
- exchangeability
- independently and identically distributed
- \(\theta, y, \tilde{y}, x, X,
p(\cdot|\cdot), p(\cdot), \operatorname{Pr}(\cdot), \sim,
H\)
- sd, E, var
- Bayes rule
- prior distribution
- sampling distribution, data distribution
- joint probability distribution
- posterior density
- probability
- density
- distribution
- \(p(y|\theta)\) as a function of
\(y\) or \(\theta\)
- likelihood
- posterior predictive distribution
- probability as measure of uncertainty
- subjectivity and objectivity
- transformation of variables
- simulation
- inverse cumulative distribution function
Recommended exercises
Optional but recommended end of the chapter exercises in BDA3 to get
a better understanding of the chapter topic:
Proportional to, \(\propto\)
The symbol \(\propto\) means
proportional to, which means left hand side is equal to right
hand size given a constant multiplier. For instance if \(y=2x\), then \(y
\propto x\). It’s \ propto
in LaTeX. See Proportionality
in Wikipedia.
Model and likelihood
Term \(p(y|\theta,M)\) has two
different names depending on the situation. Due to the short notation
used, there is possibility of confusion.
- Term \(p(y|\theta,M)\) is called a
model (sometimes more specifically observation model
or statistical model) when it is used to describe uncertainty
about \(y\) given \(\theta\) and \(M\). Longer notation \(p_y(y|\theta,M)\) shows explicitly that it
is a function of \(y\).
- In Bayes rule, the term \(p(y|\theta,M)\) is called likelihood
function. Posterior distribution describes the probability (or
probability density) for different values of \(\theta\) given a fixed \(y\), and thus when the posterior is
computed the terms on the right hand side (in Bayes rule) are also
evaluated as a function of \(\theta\)
given fixed \(y\). Longer notation
\(p_\theta(y|\theta,M)\) shows
explicitly that it is a function of \(\theta\). Term has it’s own name
(likelihood) to make the difference to the model. The likelihood
function is unnormalized probability distribution describing uncertainty
related to \(\theta\) (and that’s why
Bayes rule has the normalization term to get the posterior
distribution).
Ambiguous notation in statistics
- In \(p(y|\theta)\)
- \(y\) can be variable or value
- we could clarify by using \(p(Y|\theta)\) or \(p(y|\theta)\)
- \(\theta\) can be variable or value
- we could clarify by using \(p(y|\Theta)\) or \(p(y|\theta)\)
- \(p\) can be a discrete or
continuous function of \(y\) or \(\theta\)
- we could clarify by using \(P_Y\),
\(P_\Theta\), \(p_Y\) or \(p_\Theta\)
- \(P_Y(Y|\Theta=\theta)\) is a
probability mass function, sampling distribution, observation model
- \(P(Y=y|\Theta=\theta)\) is a
probability
- \(P_\Theta(Y=y|\Theta)\) is a
likelihood function (can be discrete or continuous)
- \(p_Y(Y|\Theta=\theta)\) is a
probability density function, sampling distribution, observation
model
- \(p(Y=y|\Theta=\theta)\) is a
density
- \(p_\Theta(Y=y|\Theta)\) is a
likelihood function (can be discrete or continuous)
- \(y\) and \(\theta\) can also be mix of continuous and
discrete
- Due to the sloppiness sometimes likelihood is used to refer \(P_{Y,\theta}(Y|\Theta)\), \(p_{Y,\theta}(Y|\Theta)\)
Exchangeability
You don’t need to understand or use the term exchangeability before
Chapter 5 and Lecture 7. At this point and until Chapter 5 and Lecture
7, it is sufficient that you know that 1) independence is stronger
condition than exchangeability, 2) independence implies exchangeability,
3) exchangeability does not imply independence, 4) exchangeability is
related to what information is available instead of the properties of
unknown underlying data generating mechanism. If you want to know more
about exchangeability right now, then read BDA Section 5.2 and notes for Chapter 5.
Chapter 2 Single-parameter models
Chapter 2
is related to the prerequisites and Lecture 2 Basics of Bayesian
inference.
Outline
- 2.1 Binomial model (e.g. biased coin flipping)
- 2.2 Posterior as compromise between data and prior information
- 2.3 Posterior summaries
- 2.4 Informative prior distributions (skip exponential families and
sufficient statistics)
- 2.5 Gaussian model with known variance
- 2.6 Other single parameter models
- in this course the normal distribution with known mean but unknown
variance is the most important
- glance through Poisson and exponential
- 2.7 glance through this example, which illustrates benefits of prior
information, no need to read all the details (it’s quite long
example)
- 2.8 Noninformative priors
- 2.9 Weakly informative priors
Laplace’s approach for approximating integrals is discussed in more
detail in Chapter 4.
The most important terms
Find all the terms and symbols listed below. When reading the
chapter, write down questions related to things unclear for you or
things you think might be unclear for others.
- binomial model
- Bernoulli trial
- \(\mathop{\mathrm{Bin}}\), \(\binom{n}{y}\)
- Laplace’s law of succession
- think which expectations in eqs. 2.7-2.8
- summarizing posterior inference
- mode, mean, median, standard deviation, variance, quantile
- central posterior interval
- highest posterior density interval / region
- uninformative / informative prior distribution
- principle of insufficient reason
- hyperparameter
- conjugacy, conjugate family, conjugate prior distribution, natural
conjugate prior
- nonconjugate prior
- normal distribution
- conjugate prior for mean of normal distribution with known
variance
- posterior for mean of normal distribution with known variance
- precision
- posterior predictive distribution
- normal model with known mean but unknown variance
- proper and improper prior
- unnormalized density
- difficulties with noninformative priors
- weakly informative priors
R and Python demos
- 2.1: Binomial model and Beta posterior. R.
Python.
- 2.2: Comparison of posterior distributions with different parameter
values for the Beta prior distribution. R.
Python.
- 2.3: Use samples to plot histogram with quantiles, and the same for
a transformed variable. R.
Python.
- 2.4: Grid sampling using inverse-cdf method. R.
Python.
Recommended exercises
Optional but recommended end of the chapter exercises in BDA3 to get
a better understanding of the chapter topic:
- 2.1-2.5, 2.8,
2.9, 2.14, 2.17, 2.22 (model
solutions available for 2.1-2.5, 2.7-2.13, 2.16, 2.17, 2.20, and
2.14 is in course slides)
Integration over Beta distribution
Chapter 2 has an example of analysing the ratio of girls born in
Paris 1745–1770. Laplace used binomial model and uniform prior which
produces Beta distribution as posterior distribution. Laplace wanted to
calculate \(p(\theta \geq 0.5)\), which
is obtained as \[
\begin{aligned}
p(\theta \geq 0.5) &=& \int_{0.5}^1 p(\mathbf{\theta}|y,n,M)
d\theta \\
&=& \frac{493473!}{241945!251527!} \int_{0.5}^1
\theta^y(1-\theta)^{n-y} d\theta\end{aligned}
\] Note that \(\Gamma(n)=(n-1)!\). Integral has a form
which is called incomplete Beta function. Bayes and Laplace had
difficulties in computing this, but nowadays there are several series
and continued fraction expressions. Furthermore usually the
normalization term is computed by computing \(\log(\Gamma(\cdot))\) directly without
explicitly computing \(\Gamma(\cdot)\).
Bayes was able to solve integral given small \(n\) and \(y\). In case of large \(n\) and \(y\), Laplace used Gaussian approximation of
the posterior (more in Chapter 4). In this specific case, R
pbeta
gives the same results as Laplace’s result with at
least 3 digit accuracy.
Numerical accuracy
Laplace calculated \[
p(\theta \geq 0.5 | y, n, M) \approx 1.15 \times 10^{-42}.
\] Correspondingly Laplace could have calculated \[
p(\theta \geq 0.5 | y, n, M) = 1 - p(\theta \leq 0.5 | y, n, M),
\] which in theory could be computed in R with
1-pbeta(0.5,y+1,n-y+1)
. In practice this fails, due to the
limitation in the floating point representation used by the computers.
In R the largest floating point number w