These notes help you to focus on the most important parts of each chapter related o the Bayesian Data Analysis course. Before reading a chapter, you can check below which sections, pages, and terms are the most important. After reading the chapter or following the corresponding lecture, you can check here for additional clarifications. There also some notes for the chapters not included in the course.
Chapter 1 is related to the pre-requisites and Lecture 1 Introduction.
Find all the terms and symbols listed below. Note that some of the terms are now only briefly introduced and will be covered later in more detail. When reading the chapter, write down questions related to things unclear for you or things you think might be unclear for others.
Optional but recommended end of the chapter exercises in BDA3 to get a better understanding of the chapter topic:
The symbol \(\propto\) means
proportional to, which means left hand side is equal to right
hand size given a constant multiplier. For instance if \(y=2x\), then \(y
\propto x\). It’s \ propto
in LaTeX. See Proportionality
in Wikipedia.
Term \(p(y|\theta,M)\) has two different names depending on the situation. Due to the short notation used, there is possibility of confusion.
Epistemic and aleatory uncertainty are reviewed nicely in the article: Tony O’Hagan, ``Dicing with the unknown’’ Significance 1(3):132-133, 2004.
In that paper, there is one typo using the word aleatory instead of epistemic (if you notice this, it’s then quite obvious).
You don’t need to understand or use the term exchangeability before Chapter 5 and Lecture 7. At this point and until Chapter 5 and Lecture 7, it is sufficient that you know that 1) independence is stronger condition than exchangeability, 2) independence implies exchangeability, 3) exchangeability does not imply independence, 4) exchangeability is related to what information is available instead of the properties of unknown underlying data generating mechanism. If you want to know more about exchangeability right now, then read BDA Section 5.2 and notes for Chapter 5.
Chapter 2 is related to the prerequisites and Lecture 2 Basics of Bayesian inference.
Laplace’s approach for approximating integrals is discussed in more detail in Chapter 4.
Find all the terms and symbols listed below. When reading the chapter, write down questions related to things unclear for you or things you think might be unclear for others.
Optional but recommended end of the chapter exercises in BDA3 to get a better understanding of the chapter topic:
Chapter 2 has an example of analysing the ratio of girls born in
Paris 1745–1770. Laplace used binomial model and uniform prior which
produces Beta distribution as posterior distribution. Laplace wanted to
calculate \(p(\theta \geq 0.5)\), which
is obtained as \[
\begin{aligned}
p(\theta \geq 0.5) &=& \int_{0.5}^1 p(\mathbf{\theta}|y,n,M)
d\theta \\
&=& \frac{493473!}{241945!251527!} \int_{0.5}^1
\theta^y(1-\theta)^{n-y} d\theta\end{aligned}
\] Note that \(\Gamma(n)=(n-1)!\). Integral has a form
which is called incomplete Beta function. Bayes and Laplace had
difficulties in computing this, but nowadays there are several series
and continued fraction expressions. Furthermore usually the
normalization term is computed by computing \(\log(\Gamma(\cdot))\) directly without
explicitly computing \(\Gamma(\cdot)\).
Bayes was able to solve integral given small \(n\) and \(y\). In case of large \(n\) and \(y\), Laplace used Gaussian approximation of
the posterior (more in Chapter 4). In this specific case, R
pbeta
gives the same results as Laplace’s result with at
least 3 digit accuracy.
Laplace calculated \[
p(\theta \geq 0.5 | y, n, M) \approx 1.15 \times 10^{-42}.
\] Correspondingly Laplace could have calculated \[
p(\theta \geq 0.5 | y, n, M) = 1 - p(\theta \leq 0.5 | y, n, M),
\] which in theory could be computed in R with
1-pbeta(0.5,y+1,n-y+1)
. In practice this fails, due to the
limitation in the floating point representation used by the computers.
In R the largest floating point number which is smaller than 1 is about
1-eps/4, where eps is about \(2.22 \times
10^{-16}\) (the smallest floating point number larger than 1 is
1+eps). Thus the result from pbeta(0.5,y+1,n-y+1)
will be
rounded to 1 and \(1-1=0\neq 1.15 \times
10^{-42}\). We can compute \(p(\theta
\geq 0.5 | y, n, M)\) in R with
pbeta(0.5, y+1, n-y+1, lower.tail=FALSE)
.
HPD interval is not invariant to reparametrization. Here’s an
illustrative example (using R and package HDInterval
):
> r <- exp(rnorm(1000))
> quantile(log(r),c(.05, .95))
5% 95%
-1.532931 1.655137
> log(quantile(r,c(.05, .95)))
5% 95%
-1.532925 1.655139
> hdi(log(r), credMass = 0.9)
lower upper
-1.449125 1.739169
attr(,"credMass")
[1] 0.9
> log(hdi(r, credMass = 0.9))
lower upper
-2.607574 1.318569
attr(,"credMass")
[1] 0.9
Gaussian distribution is commonly used in mixture models, hierarchical models, hierarchical prior structures, scale mixture distributions, Gaussian latent variable models, Gaussian processes, Gaussian random Markov fields, Kalman filters, proposal distribution in Monte Carlo methods, etc.
Often the predictive distribution is more interesting than the posterior distribution. The posterior distribution describes the uncertainty in the parameters (like the proportion of red chips in the bag), but the predictive distribution describes also the uncertainty about the future event (like which color is picked next). This difference is important, for example, if we want to what could happen if some treatment is given to a patient.
In case of Gaussian distribution with known variance \(\sigma^2\) the model is \[ \begin{aligned} y\sim \operatorname{N}(\theta,\sigma^2), \end{aligned} \] where \(\sigma^2\) describes aleatoric uncertainty. Using uniform prior the posterior is \[ \begin{aligned} p(\theta|y) \sim \mathop{\mathrm{N}}(\theta|\bar{y},\sigma^2/n), \end{aligned} \] where \(\sigma^2/n\) described epistemic uncertainty related to \(\theta\). Using uniform prior the posterior predictive distribution for new \(\tilde{y}\) is \[ \begin{aligned} p(\tilde{y}|y) \sim \operatorname{N}(\tilde{y}|\bar{y},\sigma^2+\sigma^2/n), \end{aligned} \] where the uncertainty is sum of epistemic (\(\sigma^2/n\)) and aleatoric uncertainty (\(\sigma^2\)).
Our thinking has advanced since sections 2.8 and 2.9 were written. We’re even more strongly in favor weakly informative priors, and in favor of more information in the priors. Non-informative priors are likely to produce more unstable estimates (higher variance), and the lectures include also examples of how seemingly non-informative priors can actually be informative on some aspect. See further discussion and example in the Prior Choice Recommendations Wiki. Thus Prior Choice Recommendations Wiki will see also some further updates (we’re doing research and learning more all the time).
Andrew Gelman’s blog post answering worries that data analyst would choose a too optimistic prior.
You don’t need to understand or use the term exchangeability before Chapter 5 and Lecture 7. At this point and until Chapter 5 and Lecture 7, it is sufficient that you know that 1) independence is stronger condition than exchangeability, 2) independence implies exchangeability, 3) exchangeability does not imply independence, 4) exchangeability is related to what information is available instead of the properties of unknown underlying data generating mechanism. If you want to know more about exchangeability right now, then read BDA3 Section 5.2 and notes for Chapter 5.
Chapter 3 is related to the Lecture 3 Multidimensional posterior.
Normal model is used a lot as a building block of the models in the later chapters, so it is important to learn it now. Bioassay example is good example used to illustrate many important concepts and it is used in several exercises over the course.
Find all the terms and symbols listed below. When reading the chapter, write down questions related to things unclear for you or things you think might be unclear for others.
Optional but recommended end of the chapter exercises in BDA3 to get a better understanding of the chapter topic:
BDA3 p. 67 mentions that the conjugate prior for normal distribution has to have a product form \(p(\sigma^2)p(\mu|\sigma^2)\). The book refers to (3.2) and the following discussion. As additional hint is useful to think the relation of terms \((n-1)s^2\) and \(n(\bar{y}-\mu)^2\) in 3.2 to equations 3.3 and 3.4.
Trace of square matrix, \(\operatorname{trace}\), \(\operatorname{tr}A\), \(\operatorname{trace}(A)\), \(\operatorname{tr}(A)\), is the sum of diagonal elements. To derive equation 3.11 the following property has been used \(\operatorname{tr}(ABC) = \operatorname{tr}(CAB) = \operatorname{tr}(BCA)\).
See Earliest Known Uses of Some of the Words of Mathematics.
Chapter 3 discusses closed form posteriors for binomial and normal models given conjugate priors. These are also used as part of the assignment. The assignment also requires forming a posterior for derived quantities, and these posterior don’t have closed form (so no need to try derive them). As we know how to sample from the posterior of binomial and normal models, we can use these posterior draws to get draws from the posterior of derived quantity.
For example, given posteriors \(p(\theta_1|y_1)\) and \(p(\theta_2|y_2)\) we want to find the posterior for the difference \(p(\theta_1-\theta_2|y_1,y_2)\).
This is one reason why Monte Carlo approaches are so commonly used.
This will discussed in Lecture 4 and Chapter 10. Meanwhile, e.g., 1000 draws is sufficient.
Bioassay example is is an example of very common statistical inference task typical, for example, medicine, pharmacology, health care, cognitive science, genomics, industrial processes etc.
The example is from Racine et al (1986) (see ref in the end of the
BDA3). Swiss company makes classification of chemicals to different
toxicity categories defined by authorities (like EU). Toxicity
classification is based on lethal dose 50% (LD50) which tells what
amount of chemical kills 50% of the subjects. Smaller the LD50 more
lethal the chemical is. The original paper mentions "1983 Swiss poison
Regulation" which defines following categories for chemicals orally
given to rats (mg/ml)
Class | LD50 |
---|---|
1 | <5 |
2 | 5-50 |
3 | 50-500 |
4 | 500-2000 |
5 | 2000-5000 |
To reduce the number of rats needed in the experiments, the company started to use Bayesian methods. The paper mentions that in those days use of just 20 rats to define the classification was very little. Book gives LD50 in log(g/ml). When the result from 3.6 is transformed to scale mg/ml, we see that the mean LD50 is about 900 and \(p(500<\text{LD50}<2000)\approx 0.99\). Thus, the tested chemical can be classified as category 4 toxic.
Note that the chemical testing is moving away from using rats and other animals to using, for example, human cells grown in chips, tissue models and human blood cells. The human-cell based approaches are also more accurate to predict the effect for humans.
\(\operatorname{logit}\) transformation can be justified information theoretically when binomial likelihood is used.
Code in demo 3.6 (R, Python) can be helpful in exercises related to Bioassay example.
When asking to compare groups, some students get confused as the frequentist testing is quite different. The frequentist testing is often focusing on a) differently named tests for different models and b) null hypothesis testing. In Bayesian inference a) the same Bayes rule and investigation of posterior is used for all models, b) null hypothesis testing is less common. We come later to decision making given posterior and utility/ cost function (Lecture 10.1) and more about null hypothesis testing (Lecture 12.1). Now it is assumed you will report the posterior (e.g. histogram), possible summaries, and report what you can infer from that. Specifically as in this assignment the group comparisons are based on continuous model parameter, the probability of 0 difference is 0 (later lecture 12.1 covers null hypothesis testing). Instead of forcing dichotomous answer (yes/no) about whether there is difference, report the whole posterior that tells also how big that difference might be. What big means depends on the application, which brings us back to the fact of importance of domain expertise. You are not experts on the application examples used in the assignment, but you can think how would you report what you have learned to a domain expert.
Frank Harrell’s recommendations how to state results in two group comparisons are excellent.
Chapter 4 is related to the Lecture 11 Normal approximation, frequency properties.
Normal approximation is used often used as part of posterior computation (more about this in Chapter 13, which is not a part of the course).
Find all the terms and symbols listed below. When reading the chapter, write down questions related to things unclear for you or things you think might be unclear for others.
Other normal posterior approximations are discussed in Chapter 13. For example, variational and expectation propagation methods improve the approximation by global fitting instead of just the curvature at the mode. The normal approximation at the mode is often also called the Laplace method, as Laplace used it first.
Several researchers have provided partial proofs that posterior converges towards normal distribution. In the mid 20th century Le Cam was first to provide a strict proof.
When \(n\rightarrow\infty\), the posterior distribution approaches normal distribution. As the log density of the normal is a quadratic function, the higher derivatives of the log posterior approach zero. The curvature at the mode describes the information only in the case if asymptotic normality. In the case of the normal distribution, the curvature describes also the width of the normal. Information matrix is a precision matrix, which is inverse of a covariance matrix.
In Finnish: valetoisto.
Aliasing is a special case of under-identifiability, where likelihood repeats in separate points of the parameter space. That is, likelihood will get exactly same values and has same shape although possibly mirrored or otherwise projected. For example, the following mixture model \[ p(y_i|\mu_1,\mu_2,\sigma_1^2,\sigma_2^2,\lambda)=\lambda\operatorname{N}(\mu_1,\sigma_1^2)+(1-\lambda)\operatorname{N}(\mu_2,\sigma_2^2), \] has two normals with own means and variances. With a probability \(\lambda\) the observation comes from \(\operatorname{N}(\mu_1,\sigma_1^2)\) and a probability \(1-\lambda\) from \(\operatorname{N}(\mu_2,\sigma_2^2)\). This kind of model could be used, for example, for the Newcomb’s data, so that the another normal component would model faulty measurements. Model does not state which of the components 1 or 2, would model good measurements and which would model the faulty measurements. Thus it is possible to interchange values of \((\mu_1,\mu_2)\) and \((\sigma_1^2,\sigma_2^2)\) and replace \(\lambda\) with \((1-\lambda)\) to get the equivalent model. Posterior distribution then has two modes which are mirror images of each other. When \(n\rightarrow\infty\) modes will get narrower, but the posterior does not converge to a single point.
If we can integrate over the whole posterior, the aliasing is not a problem. However aliasing makes the approximative inference more difficult.
Bayesians can evaluate frequency properties of Bayesian estimates without being frequentist. For Bayesians the starting point is the Bayes rule and decision theory. Bayesians care more about efficiency than unbiasedness. For frequentists the starting point is to find an estimator with desired frequency properties and quite often unbiasedness is chosen as the first restriction.
See BDA3 p. 21 for the explanation how to derive densities for transformed variables. This explains, for example, why uniform prior \(p(\log(\sigma^2))\propto 1\) for \(\log(\sigma^2)\) corresponds to prior \(p(\sigma^2)=\sigma^{-2}\) for \(\sigma^{2}\).
Here’s a reminder how to integrate with respect to \(g(\theta)\). For example \[ \frac{d}{d\log\sigma}\sigma^{-2}=-2 \sigma^{-2} \] is easily solved by setting \(z=\log\sigma\) to get \[ \frac{d}{dz}\exp(z)^{-2}=-2\exp(z)^{-3}\exp(z)=-2\exp(z)^{-2}=-2\sigma^{-2}. \]
Chapter 5 is related to the Lecture 7 Hierarchical models and exchangeability.
The hierarchical models in the chapter are simple to keep computation simple. More advanced computational tools are presented in Chapters 10, 11 and 12 (part of the course), and 13 (not part of the course).
Find all the terms and symbols listed below. When reading the chapter, write down questions related to things unclear for you or things you think might be unclear for others.
Optional but recommended end of the chapter exercises in BDA3 to get a better understanding of the chapter topic:
Examples in BDA3 Sections 5.3 and 5.4 continue computation with factorization and grid, but there is no need to go deep in to computational details as in the assignments you will use MCMC and Stan instead.
Exchangeability and independence are two separate concepts. Neither necessarily implies the other. Independent identically distributed variables/parameters are exchangeable. Exchangeability is less strict condition than independence. Often we may assume that observations or unobserved quantities are in fact dependent, but if we can’t get information about these dependencies we may assume those observations or unobserved quantities as exchangeable. ``Ignorance implies exchangeability.’’
In case of exchangeable observations, we may sometimes act as if observations were independent if the additional potential information gained from the dependencies is very small. This is related to de Finetti’s theorem (BDA3 p. 105), which applies formally only when \(J\rightarrow\infty\), but in practice difference may be small if \(J\) is finite but relatively large (see examples below).
Here are some examples you may consider.
Ex 5.1. Exchangeability with known model parameters: For each of following three examples, answer: (i) Are observations \(y_1\) and \(y_2\) exchangeable? (ii) Are observations \(y_1\) and \(y_2\) independent? (iii) Can we act as if the two observations are independent?
Ex 5.2. Exchangeability with unknown model parameters: For each of following three examples, answer: (i) Are observations \(y_1\) and \(y_2\) exchangeable? (ii) Are observations \(y_1\) and \(y_2\) independent? (iii) Can we act as if the two observations are independent?
Note that for example in opinion polls, balls i.e. humans are not put back and there is a large but finite number of humans.
Following complements the divorce example in the book by discussing the effect of the additional observations
Often observations are not fully exchangeable, but are partially or conditionally exchangeable. Two basic cases
Here are additional examples (Bernardo & Smith, Bayesian Theory, 1994), which illustrate the above basic cases. Think of old fashioned thumb pin. This kind of pin can stay flat on it’s base or slanting so that the pin head and the edge of the base touch the table. This kind of pin represents realistic version of "unfair" coin.
Our thinking has advanced since section 5.7 was written. Section 5.7 (BDA3 p. 128–) recommends use of half-Cauchy as weakly informative prior for hierarchical variance parameters. More recent recommendation is half-normal if you have substantial information on the high end values, or or half-\(t_4\) if you there might be possibility of surprise. Often we don’t have so much prior information that we would be able to well define the exact tail shape of the prior, but half-normal produces usually more sensible prior predictive distributions and is thus better justified. Half-normal leads also usually to easier inference.
See the Prior Choice Wiki for more recent general discussion and model specific recommendations.
Chapter 6 is related to the Lecture 8 Model checking & cross-validation.
Find all the terms and symbols listed below. When reading the chapter, write down questions related to things unclear for you or things you think might be unclear for others.
Optional but recommended end of the chapter exercises in BDA3 to get a better understanding of the chapter topic:
Predictive \(\tilde{y}\) is the next not yet observed possible observation. \(y^{\mathrm{rep}}\) refers to replicating the whole experiment (with same values of \(x\)) and obtaining as many replicated observations as in the original data.
Section 6.3 discusses posterior predictive \(p\)-values, which we don’t recommend any more especially in a form of hypothesis testing.
Prior predictive checking using just the prior predictive distributions is very useful tool for assessing the sensibility of the model and priors even before observing any data or before doing the posterior inference. See additional reading below for examples.
The following article has some useful discussion and examples also about prior and posterior predictive checking.
And some additional useful demos
Chapter 7 is related to the Lecture 8 Model checking & cross-validation’’ and he Lecture 9 Model comparison and selection*.
Instead of Sections 7.2 and 7.3 it’s better to read
In Sections 7.2 and 7.3 of BDA, for historical reasons there is a multiplier \(-2\) used. After the book was published, we have concluded that it causes too much confusion and recommend not to multiply by \(-2\). The above paper is not using \(-2\) anymore.
The following article provides excellent discussion about “How should I evaluate my modeling choices?” from a scientific perspective.
Danielle J. Navarro (2019). Between the devil and the deep blue sea: Tensions between scientific judgment and statistical model selection. Computational Brain & Behavior 2:28–34. Online.
Extra videos, slides, case studies, and references on model selection
Sections 1 and 5 (less than 3 pages) of “Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison” clarify how to interpret standard error in model comparison
Find all the terms and symbols listed below. When reading the chapter and the above mentioned article, write down questions related to things unclear for you or things you think might be unclear for others.
More theoretical details can be found in
More experimental comparisons can be found in
Gelman: “To take a historical example, I don’t find it useful, from a statistical perspective, to say that in 1850, say, our posterior probability that Newton’s laws were true was 99%, then in 1900 it was 50%, then by 1920, it was 0.01% or whatever. I’d rather say that Newton’s laws were a good fit to the available data and prior information back in 1850, but then as more data and a clearer understanding became available, people focused on areas of lack of fit in order to improve the model.”
Newton’s laws are still sufficient for prediction in specific contexts (non-relative speeds and differences in gravity, non-significant effects of air resistance or other friction). See more in the course video 1.1 Introduction to uncertainty and modeling.
Chapter 8 is not part of the BDA Aalto course.
In the earlier chapters it was assumed that the data collection is ignorable. Chapter 8 explains when data collection can be ignorable and when we need to model also the data collection. We don’t have time to go through chapter 8 in BDA course at Aalto, but it is highly recommended that you would read it in the end or after the course. Most important sections are 8.1, 8.5, pp. 220–222 of 8.6, and 8.8, and you can get back to the other sections later.
Chapter 9 is related to the Lecture 10 Decision analysis.
Find all the terms and symbols listed below. When reading the chapter, write down questions related to things unclear for you or things you think might be unclear for others.
The lectures have simpler examples and also discuss some challenges in selecting utilities or costs.
Chapter 7 discusses how model selection con be considered as a decision problem.
Chapter 10 is related to the Lecture 4 Monte Carlo.
Sections 10.1–10.4 give overview of different computational methods. Some of then have been already used in the book.
Section 10.5 is very important and related to the exercises.
Find all the terms and symbols listed below. When reading the chapter, write down questions related to things unclear for you or things you think might be unclear for others.
Optional but recommended end of the chapter exercises in BDA3 to get a better understanding of the chapter topic:
Many models use continuous real valued parameters. Computers have finite memory and thus the continuous values are also presented with finite number of bits and thus with finite accuracy. Most commonly used presentations are floating-point presentations that try to have balanced accuracy over the range of values where it mostly matters. As the the presentation has finite accuracy there are limitations, for example, with IEC 60559 floating-point (double precision) arithmetic used in current R
the smallest positive floating-point number \(x\) such that \(1 + x \neq 1\) is \(2.220446\cdot 10^{-16}\)
the smallest non-zero normalized floating-point number is \(2.225074\cdot 10^{-308}\)
the largest normalized floating-point number \(1.797693\cdot 10^{308}\)
the largest integer which can be represented is \(2^{31} - 1 = 2147483647\)
see more in R documentation: Numerical Characteristics of the Machine.
What Is Floating-Point Arithmetic? blog post by Nick Higham provides nice short introduction.
What Every Computer Scientist Should Know About Floating-Point Arithmetic by Goldberg (1991) provides nice overview of floating-point arithmetic and how the computations should be arranged for improved accuracy.
Stat 3701 Lecture Notes: Computer Arithmetic by Geyer (2020) provide code examples in R illustrating the most common issues in floating-point arithmetic including examples similar shown in the BDA course lecture.
Stan User Guide Chapter 15 discusses floating point arithmetic in context of Stan.
A group of draws is a sample. A sample can consist of one draw, and thus some people use the word sample for both single item and for the group. For clarity, we prefer separate words for a single item (draw) and for the group (sample). Sample can also mean a group of observations, and thus talking about posterior draws reduces ambiguity.
Sometimes ‘quadrature’ is used to refer generically to any numerical integration method (including Monte Carlo), sometimes it is used to refer just to deterministic numerical integration methods.
Rejection sampling is mostly used as a part of fast methods for univariate sampling. For example, sampling from the normal distribution is often made using Ziggurat method, which uses a proposal distribution resembling stairs.
Rejection sampling is also commonly used for truncated distributions, in which case all draws from the truncated part are rejected.
Popularity of importance sampling is increasing. It is used, for example, as part of other methods as particle filters and pseudo marginal likelihood approaches, and to improve distributional approximations (including variational inference in machine learning).
Importance sampling is useful in importance sampling leave-one-out cross-validation. Cross-validation is discussed in Chapter 7 and importance sampling leave-one-out cross-validation is discussed in the article
After the book was published, we have developed Pareto smoothed importance sampling which is more stable than plain importance sampling and has very useful Pareto-\(k\) diagnostic to check the reliability
BDA3 p. 266 recommends importance resampling without replacement. At the time of writing that in 2013, we had less experience with importance sampling and there were some reasonable papers showing reduced variance doing resampling without replacement. We don’t recommend this anymore as Pareto smoothed importance sampling works better and is also applicable when the resample sample size is equal to the original sample size.
BDA3 1st (2013) and 2nd (2014) printing have an error for \(\tilde{w}(\theta^s)\) used in the effective sample size equation 10.4. The normalized weights equation should not have the multiplier S (the normalized weights should sum to one). The effective sample size estimate mentioned in the book is generic approximation, and more accurate effective sample size estimate would take into account also the functional. For example, importance sampling effective sample size can be different when estimating \(\operatorname{E}[\theta]\) or \(\operatorname{E}[\theta]^2\). If you are interested see more details, for example, the Pareto smoothed importance sampling paper.
The derivation for the effective sample size and Monte Carlo standard error (MCSE) for importance sampling can be found, for example, in Chapter 9 of Monte Carlo theory, methods and examples by Art B. Owen.
Buffon’s needle is considered to be the first Monte Carlo style approach presented in 1777. It’s not known whether Buffon actually did the experiment in addition of describing the approach for estimating the value of \(\pi\). Check out a nice computer simulation of Buffon’s needle dropping method.
Chapter 11 is related to the Lecture 5 Markov chain Monte Carlo.
Find all the terms and symbols listed below. When reading the chapter, write down questions related to things unclear for you or things you think might be unclear for others.
Optional but recommended end of the chapter exercises in BDA3 to get a better understanding of the chapter topic:
Slides by J. Virtamo for the course S-38.143 Queuing Theory (Finnish version) have a nice review of the fundamental terms. See specially the slides for the lecture 4. To prove that Metropolis algorithm works, it is sufficient to show that chain is irreducible, aperiodic and not transient.
There is a lot of freedom in selection of proposal distribution in Metropolis algorithm. There are some restrictions, but we don’t go to the mathematical details in this course.
Don’t confuse rejection in the rejection sampling and in Metropolis algorithm. In the rejection sampling, the rejected samples are thrown away. In Metropolis algorithm the rejected proposals are thrown away, but time moves on and the previous sample \(x_t\) is also the sample \(x_{t+1}\).
When rejecting a proposal, the previous sample is repeated in the chain, they have to be included and they are valid samples from the distribution. For basic Metropolis, it can be shown that optimal rejection rate is 55%–77%, so that on even the optimal case quite many of the samples are repeated samples. However, high number of rejections is acceptable as then the accepted proposals are on average further away from the previous point. It is better to jump further away 23%–45% of time than more often to jump really close. Methods for estimating the effective sample size are useful for measuring how effective a given chain is.
Transition distribution is a property of Markov chain. In Metropolis algorithm the transition distribution is a mixture of a proposal distribution and a point mass in the current point. The book uses also term jumping distribution to refer to proposal distribution.
Theoretical convergence in an infinite time is different than practical convergence in a finite time. There is no exact moment when chain has converged and thus it is not possible to detect when the chain has converged (except for rare perfect sampling methods not discussed in BDA3). The convergence diagnostics can help to find out if the chain is unlikely to be representative of the target distribution. Furthermore, even if would be able to start from a independent sample from the posterior so that chain starts from the convergence, the mixing can be so slow that we may require very large number of samples before the samples are representative of the target distribution.
If starting point is selected at or near the mode, less time is needed to reach the area of essential mass, but still the samples in the beginning of the chain are not representative of the true distribution unless the starting point was somehow samples directly from the target distribution.
There are many versions of \(\widehat{R}\) and effective sample size. Beware that some software packages compute \(\widehat{R}\) using old inferior approaches.
The \(\widehat{R}\) and the approach to estimate effective sample size were updated in BDA3, and slightly updated version of this is described in Stan 2.18+ user guide. Since then we have developed even better \(\widehat{R}\), ESS (effective sample size with change from \(n_\mathrm{eff}\) to ESS is due to improved consistency in the notation) in
New \(\widehat{R}\), ESS, and Monte
Carlo error estimates are available in RStan monitor
function in R, in posterior
package in R, and in ArviZ
package in Python.
Due to randomness in chains, \(\widehat{R}\) may get values slightly below 1.
Brief Guide to Stan’s Warnings provides summary of available convergence diagnostics in Stan and how to interpret them.
Sometimes people write *the number of effective samples’’ which is wrong (it is possible that notation \(n_\mathrm{eff}\) is partially to blame for this misconception). All the posterior draws in autocorrelated Markov chain are effective, but their efficiency for estimating an expectation depends on the autocorrelation. The effective sample size is not property of individual draws, but joint property of all draws in a sample. Effective sample size also depends on the functional and the effective sample size for a given dependent sample is often different when estimating, for example, \(\operatorname{E}[\theta]\) or \(\operatorname{E}[\theta^2]\).
Chapter 12 is related to the Lecture 6 Stan, HMC, PPL.
For the BDA course, there are only 8 pages to read (Sections 12.4–12.6) what is inside Stan.
g - 12.1: The No-U-Turn Sampler (NUTS) / Dynamic Hamiltonian Monte Carlo. R. - CmdStanR Demos - PyStan demos
These don’t include the specific version of dynamic HMC in Stan, but are useful illustrations anyway.
An excellent review of static HMC (the number of steps in dynamic simulation are not adaptively selected) is
Stan uses a variant of dynamic Hamiltonian Monte Carlo (using adaptive number of steps in the dynamic simulation), which has been further developed since BDA3 was published. The first dynamic HMC variant was NUTS
The No-U-Turn Sampler gave the name NUTS which you can see often associated with Stan, but the current variant implemented in Stan has some further developments described (mostly) in
This is a minor comment on the terminology. As a shorthand it’s common to see mentioned just Stan compiler, but sometimes the transpiler term is also mentioned as in the slides for this part.
A Wikipedia article states: A source-to-source translator, source-to-source compiler (S2S compiler), transcompiler, or transpiler is a type of translator that takes the source code of a program written in a programming language as its input and produces an equivalent source code in the same or a different programming language. A source-to-source translator converts between programming languages that operate at approximately the same level of abstraction, while a traditional compiler translates from a higher level programming language to a lower level programming language.
So it is more accurate to say that the Stan model code is first transpiled to a C++ code, and then that C++ code is compiled to machine code to create an executable program. Cool thing about the new stanc3 transpiler is that it can create also, for example, LLVM IR or TensorFlow code.
Using transpiler and compiler allows to develop Stan language to be good for writing models, but get the benefit of speed and external libraries of C++, TensorFlow, and whatever comes in the future.
Chapter 13 is not part of the BDA Aalto course.
Chapter 4 presented normal distribution approximation at the mode (aka Laplace approximation). Chapter 13 discusses more about distributional approximations.
Chapter 14 is not part of the BDA Aalto course.
Chapter 15 is not part of the BDA Aalto course.
Chapter 15 combines hierarchical models from Chapter 5 and linear models from Chapter 14. The chapter discusses some computational issues, but probabilistic programming frameworks make computation for hierarchical linear models easy.
y \sim 1 + x
fixed / population effect; pooled model
y \sim 1 + (0 + x | g)
random / group effects
y \sim 1 + x + (1 + x | g)
mixed effects; hierarchical
model ——————————- —————————————–Chapter 16 is not part of the BDA Aalto course.
Chapter 16 extends linear models to have non-normal observation models. Model in Bioassay example in Chapter 3 is also generalized linear model. Chapter reviews the basics and discusses some computational issues, but probabilistic programming frameworks make computation for generalized linear models easy (especially with rstanarm and brms). See ROS) for discussion of generalized linear models from the modeling perspective more thoroughly.
Chapter 17 is not part of the BDA Aalto course.
Chapter 17 discusses over-dispersed observation models. The discussion is useful beyond generalized linear models. The computation is outdated. See ROS for more examples.
Chapter 18 is not part of the BDA Aalto course.
Chapter 18 extends the data collection modeling from Chapter 8. See ROS for more examples.