These notes help you to focus on the most important parts of each chapter related o the Bayesian Data Analysis course. Before reading a chapter, you can check below which sections, pages, and terms are the most important. After reading the chapter or following the corresponding lecture, you can check here for additional clarifications. There also some notes for the chapters not included in the course.

Chapter 1
is related to the pre-requisites and Lecture 1
*Introduction*.

- 1.1-1.3 important terms, especially 1.3 for the notation
- 1.4 an example related to the first exercise, and another practical example
- 1.5 foundations
- 1.6 good example related to visualization exercise
- 1.7 example which can be skipped
- 1.8 background material, good to read before doing the first assignment
- 1.9 background material, good to read before doing the second assignment
- 1.10 a point of view for using Bayesian inference

Find all the terms and symbols listed below. Note that some of the terms are now only briefly introduced and will be covered later in more detail. When reading the chapter, write down questions related to things unclear for you or things you think might be unclear for others.

- full probability model
- posterior distribution
- potentially observable quantity
- quantities that are not directly observable
- exchangeability
- independently and identically distributed
- \(\theta, y, \tilde{y}, x, X, p(\cdot|\cdot), p(\cdot), \operatorname{Pr}(\cdot), \sim, H\)
- sd, E, var
- Bayes rule
- prior distribution
- sampling distribution, data distribution
- joint probability distribution
- posterior density
- probability
- density
- distribution
- \(p(y|\theta)\) as a function of \(y\) or \(\theta\)
- likelihood
- posterior predictive distribution
- probability as measure of uncertainty
- subjectivity and objectivity
- transformation of variables
- simulation
- inverse cumulative distribution function

Optional but recommended end of the chapter exercises in BDA3 to get a better understanding of the chapter topic:

The symbol \(\propto\) means
*proportional to*, which means left hand side is equal to right
hand size given a constant multiplier. For instance if \(y=2x\), then \(y
\propto x\). It’s `\ propto`

in LaTeX. See Proportionality
in Wikipedia.

Term \(p(y|\theta,M)\) has two different names depending on the situation. Due to the short notation used, there is possibility of confusion.

- Term \(p(y|\theta,M)\) is called a
*model*(sometimes more specifically*observation model*or*statistical model*) when it is used to describe uncertainty about \(y\) given \(\theta\) and \(M\). Longer notation \(p_y(y|\theta,M)\) shows explicitly that it is a function of \(y\). - In Bayes rule, the term \(p(y|\theta,M)\) is called
*likelihood function*. Posterior distribution describes the probability (or probability density) for different values of \(\theta\) given a fixed \(y\), and thus when the posterior is computed the terms on the right hand side (in Bayes rule) are also evaluated as a function of \(\theta\) given fixed \(y\). Longer notation \(p_\theta(y|\theta,M)\) shows explicitly that it is a function of \(\theta\). Term has it’s own name (likelihood) to make the difference to the model. The likelihood function is unnormalized probability distribution describing uncertainty related to \(\theta\) (and that’s why Bayes rule has the normalization term to get the posterior distribution).

Epistemic and aleatory uncertainty are reviewed nicely in the article: Tony O’Hagan, ``Dicing with the unknown’’ Significance 1(3):132-133, 2004.

In that paper, there is one typo using the word *aleatory*
instead of *epistemic* (if you notice this, it’s then quite
obvious).

- See BDA3 p. 21

- In \(p(y|\theta)\)
- \(y\) can be variable or value
- we could clarify by using \(p(Y|\theta)\) or \(p(y|\theta)\)

- \(\theta\) can be variable or value
- we could clarify by using \(p(y|\Theta)\) or \(p(y|\theta)\)

- \(p\) can be a discrete or
continuous function of \(y\) or \(\theta\)
- we could clarify by using \(P_Y\), \(P_\Theta\), \(p_Y\) or \(p_\Theta\)

- \(P_Y(Y|\Theta=\theta)\) is a probability mass function, sampling distribution, observation model
- \(P(Y=y|\Theta=\theta)\) is a probability
- \(P_\Theta(Y=y|\Theta)\) is a likelihood function (can be discrete or continuous)
- \(p_Y(Y|\Theta=\theta)\) is a probability density function, sampling distribution, observation model
- \(p(Y=y|\Theta=\theta)\) is a density
- \(p_\Theta(Y=y|\Theta)\) is a likelihood function (can be discrete or continuous)
- \(y\) and \(\theta\) can also be mix of continuous and discrete
- Due to the sloppiness sometimes likelihood is used to refer \(P_{Y,\theta}(Y|\Theta)\), \(p_{Y,\theta}(Y|\Theta)\)

- \(y\) can be variable or value

You don’t need to understand or use the term exchangeability before Chapter 5 and Lecture 7. At this point and until Chapter 5 and Lecture 7, it is sufficient that you know that 1) independence is stronger condition than exchangeability, 2) independence implies exchangeability, 3) exchangeability does not imply independence, 4) exchangeability is related to what information is available instead of the properties of unknown underlying data generating mechanism. If you want to know more about exchangeability right now, then read BDA Section 5.2 and notes for Chapter 5.

Chapter 2
is related to the prerequisites and Lecture 2 *Basics of Bayesian
inference*.

- 2.1 Binomial model (e.g. biased coin flipping)
- 2.2 Posterior as compromise between data and prior information
- 2.3 Posterior summaries
- 2.4 Informative prior distributions (skip exponential families and sufficient statistics)
- 2.5 Gaussian model with known variance
- 2.6 Other single parameter models
- in this course the normal distribution with known mean but unknown variance is the most important
- glance through Poisson and exponential

- 2.7 glance through this example, which illustrates benefits of prior information, no need to read all the details (it’s quite long example)
- 2.8 Noninformative priors
- 2.9 Weakly informative priors

Laplace’s approach for approximating integrals is discussed in more detail in Chapter 4.

Find all the terms and symbols listed below. When reading the chapter, write down questions related to things unclear for you or things you think might be unclear for others.

- binomial model
- Bernoulli trial
- \(\mathop{\mathrm{Bin}}\), \(\binom{n}{y}\)
- Laplace’s law of succession
- think which expectations in eqs. 2.7-2.8
- summarizing posterior inference
- mode, mean, median, standard deviation, variance, quantile
- central posterior interval
- highest posterior density interval / region
- uninformative / informative prior distribution
- principle of insufficient reason
- hyperparameter
- conjugacy, conjugate family, conjugate prior distribution, natural conjugate prior
- nonconjugate prior
- normal distribution
- conjugate prior for mean of normal distribution with known variance
- posterior for mean of normal distribution with known variance
- precision
- posterior predictive distribution
- normal model with known mean but unknown variance
- proper and improper prior
- unnormalized density
- difficulties with noninformative priors
- weakly informative priors

- 2.1: Binomial model and Beta posterior. R. Python.
- 2.2: Comparison of posterior distributions with different parameter values for the Beta prior distribution. R. Python.
- 2.3: Use samples to plot histogram with quantiles, and the same for a transformed variable. R. Python.
- 2.4: Grid sampling using inverse-cdf method. R. Python.

Optional but recommended end of the chapter exercises in BDA3 to get a better understanding of the chapter topic:

- 2.1-2.5, 2.8, 2.9, 2.14, 2.17, 2.22 (model solutions available for 2.1-2.5, 2.7-2.13, 2.16, 2.17, 2.20, and 2.14 is in course slides)

Chapter 2 has an example of analysing the ratio of girls born in
Paris 1745–1770. Laplace used binomial model and uniform prior which
produces Beta distribution as posterior distribution. Laplace wanted to
calculate \(p(\theta \geq 0.5)\), which
is obtained as \[
\begin{aligned}
p(\theta \geq 0.5) &=& \int_{0.5}^1 p(\mathbf{\theta}|y,n,M)
d\theta \\
&=& \frac{493473!}{241945!251527!} \int_{0.5}^1
\theta^y(1-\theta)^{n-y} d\theta\end{aligned}
\] Note that \(\Gamma(n)=(n-1)!\). Integral has a form
which is called *incomplete Beta function*. Bayes and Laplace had
difficulties in computing this, but nowadays there are several series
and continued fraction expressions. Furthermore usually the
normalization term is computed by computing \(\log(\Gamma(\cdot))\) directly without
explicitly computing \(\Gamma(\cdot)\).
Bayes was able to solve integral given small \(n\) and \(y\). In case of large \(n\) and \(y\), Laplace used Gaussian approximation of
the posterior (more in Chapter 4). In this specific case, R
`pbeta`

gives the same results as Laplace’s result with at
least 3 digit accuracy.

Laplace calculated \[
p(\theta \geq 0.5 | y, n, M) \approx 1.15 \times 10^{-42}.
\] Correspondingly Laplace could have calculated \[
p(\theta \geq 0.5 | y, n, M) = 1 - p(\theta \leq 0.5 | y, n, M),
\] which in theory could be computed in R with
`1-pbeta(0.5,y+1,n-y+1)`

. In practice this fails, due to the
limitation in the floating point representation used by the computers.
In R the largest floating point number w