These notes help you to focus on the most important parts of each chapter related o the Bayesian Data Analysis course. Before reading a chapter, you can check below which sections, pages, and terms are the most important. After reading the chapter or following the corresponding lecture, you can check here for additional clarifications. There also some notes for the chapters not included in the course.

Chapter 1 Probability and inference

Chapter 1 is related to the pre-requisites and Lecture 1 Introduction.

Outline

  • 1.1-1.3 important terms, especially 1.3 for the notation
  • 1.4 an example related to the first exercise, and another practical example
  • 1.5 foundations
  • 1.6 good example related to visualization exercise
  • 1.7 example which can be skipped
  • 1.8 background material, good to read before doing the first assignment
  • 1.9 background material, good to read before doing the second assignment
  • 1.10 a point of view for using Bayesian inference

The most important terms

Find all the terms and symbols listed below. Note that some of the terms are now only briefly introduced and will be covered later in more detail. When reading the chapter, write down questions related to things unclear for you or things you think might be unclear for others.

  • full probability model
  • posterior distribution
  • potentially observable quantity
  • quantities that are not directly observable
  • exchangeability
  • independently and identically distributed
  • \(\theta, y, \tilde{y}, x, X, p(\cdot|\cdot), p(\cdot), \operatorname{Pr}(\cdot), \sim, H\)
  • sd, E, var
  • Bayes rule
  • prior distribution
  • sampling distribution, data distribution
  • joint probability distribution
  • posterior density
  • probability
  • density
  • distribution
  • \(p(y|\theta)\) as a function of \(y\) or \(\theta\)
  • likelihood
  • posterior predictive distribution
  • probability as measure of uncertainty
  • subjectivity and objectivity
  • transformation of variables
  • simulation
  • inverse cumulative distribution function

Proportional to, \(\propto\)

The symbol \(\propto\) means proportional to, which means left hand side is equal to right hand size given a constant multiplier. For instance if \(y=2x\), then \(y \propto x\). It’s \ propto in LaTeX. See Proportionality in Wikipedia.

Model and likelihood

Term \(p(y|\theta,M)\) has two different names depending on the situation. Due to the short notation used, there is possibility of confusion.

  • Term \(p(y|\theta,M)\) is called a model (sometimes more specifically observation model or statistical model) when it is used to describe uncertainty about \(y\) given \(\theta\) and \(M\). Longer notation \(p_y(y|\theta,M)\) shows explicitly that it is a function of \(y\).
  • In Bayes rule, the term \(p(y|\theta,M)\) is called likelihood function. Posterior distribution describes the probability (or probability density) for different values of \(\theta\) given a fixed \(y\), and thus when the posterior is computed the terms on the right hand side (in Bayes rule) are also evaluated as a function of \(\theta\) given fixed \(y\). Longer notation \(p_\theta(y|\theta,M)\) shows explicitly that it is a function of \(\theta\). Term has it’s own name (likelihood) to make the difference to the model. The likelihood function is unnormalized probability distribution describing uncertainty related to \(\theta\) (and that’s why Bayes rule has the normalization term to get the posterior distribution).

Two types of uncertainty

Epistemic and aleatory uncertainty are reviewed nicely in the article: Tony O’Hagan, ``Dicing with the unknown’’ Significance 1(3):132-133, 2004.

In that paper, there is one typo using the word aleatory instead of epistemic (if you notice this, it’s then quite obvious).

Transformation of variables

Ambiguous notation in statistics

  • In \(p(y|\theta)\)
    • \(y\) can be variable or value
      • we could clarify by using \(p(Y|\theta)\) or \(p(y|\theta)\)
    • \(\theta\) can be variable or value
      • we could clarify by using \(p(y|\Theta)\) or \(p(y|\theta)\)
    • \(p\) can be a discrete or continuous function of \(y\) or \(\theta\)
      • we could clarify by using \(P_Y\), \(P_\Theta\), \(p_Y\) or \(p_\Theta\)
    • \(P_Y(Y|\Theta=\theta)\) is a probability mass function, sampling distribution, observation model
    • \(P(Y=y|\Theta=\theta)\) is a probability
    • \(P_\Theta(Y=y|\Theta)\) is a likelihood function (can be discrete or continuous)
    • \(p_Y(Y|\Theta=\theta)\) is a probability density function, sampling distribution, observation model
    • \(p(Y=y|\Theta=\theta)\) is a density
    • \(p_\Theta(Y=y|\Theta)\) is a likelihood function (can be discrete or continuous)
    • \(y\) and \(\theta\) can also be mix of continuous and discrete
    • Due to the sloppiness sometimes likelihood is used to refer \(P_{Y,\theta}(Y|\Theta)\), \(p_{Y,\theta}(Y|\Theta)\)

Exchangeability

You don’t need to understand or use the term exchangeability before Chapter 5 and Lecture 7. At this point and until Chapter 5 and Lecture 7, it is sufficient that you know that 1) independence is stronger condition than exchangeability, 2) independence implies exchangeability, 3) exchangeability does not imply independence, 4) exchangeability is related to what information is available instead of the properties of unknown underlying data generating mechanism. If you want to know more about exchangeability right now, then read BDA Section 5.2 and notes for Chapter 5.

Chapter 2 Single-parameter models

Chapter 2 is related to the prerequisites and Lecture 2 Basics of Bayesian inference.

Outline

  • 2.1 Binomial model (e.g. biased coin flipping)
  • 2.2 Posterior as compromise between data and prior information
  • 2.3 Posterior summaries
  • 2.4 Informative prior distributions (skip exponential families and sufficient statistics)
  • 2.5 Gaussian model with known variance
  • 2.6 Other single parameter models
    • in this course the normal distribution with known mean but unknown variance is the most important
    • glance through Poisson and exponential
  • 2.7 glance through this example, which illustrates benefits of prior information, no need to read all the details (it’s quite long example)
  • 2.8 Noninformative priors
  • 2.9 Weakly informative priors

Laplace’s approach for approximating integrals is discussed in more detail in Chapter 4.

The most important terms

Find all the terms and symbols listed below. When reading the chapter, write down questions related to things unclear for you or things you think might be unclear for others.

  • binomial model
  • Bernoulli trial
  • \(\mathop{\mathrm{Bin}}\), \(\binom{n}{y}\)
  • Laplace’s law of succession
  • think which expectations in eqs. 2.7-2.8
  • summarizing posterior inference
  • mode, mean, median, standard deviation, variance, quantile
  • central posterior interval
  • highest posterior density interval / region
  • uninformative / informative prior distribution
  • principle of insufficient reason
  • hyperparameter
  • conjugacy, conjugate family, conjugate prior distribution, natural conjugate prior
  • nonconjugate prior
  • normal distribution
  • conjugate prior for mean of normal distribution with known variance
  • posterior for mean of normal distribution with known variance
  • precision
  • posterior predictive distribution
  • normal model with known mean but unknown variance
  • proper and improper prior
  • unnormalized density
  • difficulties with noninformative priors
  • weakly informative priors

R and Python demos

  • 2.1: Binomial model and Beta posterior. R. Python.
  • 2.2: Comparison of posterior distributions with different parameter values for the Beta prior distribution. R. Python.
  • 2.3: Use samples to plot histogram with quantiles, and the same for a transformed variable. R. Python.
  • 2.4: Grid sampling using inverse-cdf method. R. Python.

Integration over Beta distribution

Chapter 2 has an example of analysing the ratio of girls born in Paris 1745–1770. Laplace used binomial model and uniform prior which produces Beta distribution as posterior distribution. Laplace wanted to calculate \(p(\theta \geq 0.5)\), which is obtained as \[ \begin{aligned} p(\theta \geq 0.5) &=& \int_{0.5}^1 p(\mathbf{\theta}|y,n,M) d\theta \\ &=& \frac{493473!}{241945!251527!} \int_{0.5}^1 \theta^y(1-\theta)^{n-y} d\theta\end{aligned} \] Note that \(\Gamma(n)=(n-1)!\). Integral has a form which is called incomplete Beta function. Bayes and Laplace had difficulties in computing this, but nowadays there are several series and continued fraction expressions. Furthermore usually the normalization term is computed by computing \(\log(\Gamma(\cdot))\) directly without explicitly computing \(\Gamma(\cdot)\). Bayes was able to solve integral given small \(n\) and \(y\). In case of large \(n\) and \(y\), Laplace used Gaussian approximation of the posterior (more in Chapter 4). In this specific case, R pbeta gives the same results as Laplace’s result with at least 3 digit accuracy.

Numerical accuracy

Laplace calculated \[ p(\theta \geq 0.5 | y, n, M) \approx 1.15 \times 10^{-42}. \] Correspondingly Laplace could have calculated \[ p(\theta \geq 0.5 | y, n, M) = 1 - p(\theta \leq 0.5 | y, n, M), \] which in theory could be computed in R with 1-pbeta(0.5,y+1,n-y+1). In practice this fails, due to the limitation in the floating point representation used by the computers. In R the largest floating point number w