This is the online appendix of the paper

- Aki Vehtari, Andrew Gelman, Daniel Simpson, Bob Carpenter, Paul-Christian Bürkner (2019): Rank-normalization, folding, and localization: An improved \(\widehat{R}\) for assessing convergence of MCMC. arXiv preprint arXiv:1903.08008.

The code for the paper is available on Github (https://github.com/avehtari/rhat_ess). Here, we introduce all the code related to the examples presented in the paper and more numerical experiments not discussed in the paper itself.

To help you finding your way through all the examples presented in this online appendix, below please find a list of links to the examples discussed in the paper:

- Cauchy: A distribution with infinite mean and variance
- Hierarchical model: Eight schools
- Appendix A: Computing the effective sample size
- Appendix B: Normal distributions with additional trend, shift or scaling
- Appendix C: More experiments with the Cauchy distribution
- Appendix D: A centered eight schools model with very long chains and thinning
- Appendix E: A centered eight schools model fit using a Gibbs sampler

In this section, we will go through some examples to demonstrate the usefulness of our proposed methods as well as the associated workflow in determining convergence. Appendices B-E contain more detailed analysis of different algorithm variants and further examples.

First, we load all the necessary R packages and additional functions.

```
library(tidyverse)
library(gridExtra)
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
library(bayesplot)
theme_set(bayesplot::theme_default(base_family = "sans"))
library(rjags)
library(abind)
source('monitornew.R')
source('monitorplot.R')
```

This section relates to the examples presented in Section 5.1 of the paper.

Traditional \(\widehat{R}\) is based on calculating within and between chain variances. If the marginal distribution of a chain is such that the variance is infinite, this approach is not well justified, as we demonstrate here with a Cauchy-distributed example. The following Cauchy models are from Michael Betancourt’s case study Fitting The Cauchy Distribution. Appendix C contains more detailed analysis of different algorithm variants and further Cauchy examples.

The nominal Cauchy model with direct parameterization is as follows:

`writeLines(readLines("cauchy_nom.stan"))`

```
parameters {
vector[50] x;
}
model {
x ~ cauchy(0, 1);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
```

Run the nominal model:

`fit_nom <- stan(file = 'cauchy_nom.stan', seed = 7878, refresh = 0)`

```
Warning: There were 1421 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
```

`Warning: Examine the pairs() plot to diagnose sampling problems`

We get HMC specific diagnostic (Betancourt, 2017) warnings about a very large number of transitions after warmup that exceed the maximum treedepth and low estimated Bayesian fraction of missing information, indicating slow mixing likely due to very long tails of the Cauchy distribution.

```
mon <- monitor(fit_nom)
print(mon)
```

```
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
x[1] -5.87 0.02 6.78 0.92 13.64 1.01 1972 356
x[2] -5.76 -0.03 5.11 -0.27 7.53 1.00 2411 797
x[3] -6.12 0.00 7.73 0.77 10.28 1.01 440 112
x[4] -5.30 -0.01 5.41 -0.06 7.19 1.01 2263 691
x[5] -13.14 -0.04 5.83 -2.78 17.58 1.03 167 44
x[6] -8.10 -0.06 6.39 -0.92 15.59 1.01 1603 398
x[7] -7.25 -0.08 7.55 -0.29 9.68 1.01 1220 457
x[8] -4.79 0.02 6.23 0.57 8.65 1.02 1559 428
x[9] -4.69 0.01 4.82 0.13 5.43 1.00 2603 778
x[10] -8.43 -0.04 6.50 -1.92 32.96 1.01 1599 257
x[11] -6.21 0.02 7.09 0.43 16.21 1.01 2458 469
x[12] -5.55 0.02 6.62 0.17 6.37 1.00 2830 398
x[13] -6.16 0.00 6.13 0.44 12.78 1.01 2277 561
x[14] -5.45 0.02 6.79 0.29 8.44 1.01 3118 642
x[15] -8.17 0.05 9.55 0.26 30.61 1.01 2121 517
x[16] -5.83 0.00 6.23 0.31 8.60 1.01 2171 583
x[17] -18.10 -0.07 7.39 -6.89 47.38 1.03 207 65
x[18] -5.89 0.05 5.90 0.10 5.42 1.01 3409 660
x[19] -7.04 0.03 6.23 -0.23 12.26 1.01 1422 416
x[20] -6.58 0.03 7.53 -0.19 7.81 1.02 947 441
x[21] -8.11 0.01 6.57 -2.16 28.47 1.00 1785 391
x[22] -5.47 0.04 5.24 0.25 8.30 1.00 2836 706
x[23] -10.04 -0.05 6.55 -11.16 87.97 1.04 314 68
x[24] -13.18 -0.07 5.72 -4.58 30.89 1.01 431 95
x[25] -6.89 -0.02 5.60 -0.44 8.76 1.02 1888 500
x[26] -5.33 -0.02 4.69 -0.30 6.55 1.01 3095 727
x[27] -5.19 -0.04 5.17 -0.26 8.50 1.00 3113 868
x[28] -6.81 -0.02 6.81 0.15 7.71 1.00 4186 919
x[29] -9.96 -0.02 6.99 -1.66 18.87 1.01 926 219
x[30] -6.20 0.02 6.62 0.44 14.44 1.01 3500 552
x[31] -18.84 -0.04 5.77 -4.22 29.48 1.02 208 48
x[32] -6.33 -0.01 6.21 -0.02 6.73 1.02 3044 743
x[33] -4.81 0.03 4.93 0.22 7.89 1.01 3504 1003
x[34] -5.75 0.02 5.75 -0.12 9.23 1.00 1889 656
x[35] -5.98 0.03 7.16 0.06 8.68 1.01 2701 634
x[36] -7.37 0.00 9.19 0.08 10.82 1.01 1404 394
x[37] -5.03 0.02 7.82 2.52 29.21 1.01 2020 353
x[38] -5.40 -0.01 5.59 -0.06 6.64 1.01 3644 738
x[39] -4.78 -0.02 5.36 0.27 6.40 1.00 2849 752
x[40] -6.16 0.00 4.58 -0.91 10.45 1.01 1843 452
x[41] -5.46 0.07 6.55 0.05 6.06 1.01 2322 616
x[42] -7.63 -0.05 6.38 -0.75 10.10 1.01 1786 510
x[43] -7.47 -0.04 6.39 -0.59 8.32 1.01 1871 555
x[44] -5.22 0.00 5.17 0.11 6.15 1.01 1613 569
x[45] -4.44 0.04 13.98 3.19 17.67 1.01 295 70
x[46] -5.04 0.01 5.60 0.22 9.06 1.02 3151 432
x[47] -5.87 0.03 5.46 0.19 11.73 1.02 2247 495
x[48] -4.99 -0.03 4.91 0.05 10.22 1.01 2914 718
x[49] -6.42 0.01 10.78 10.69 83.66 1.02 223 51
x[50] -6.24 0.02 5.71 0.41 19.28 1.00 2610 783
I 0.00 1.00 1.00 0.51 0.50 1.00 683 4000
lp__ -88.54 -68.47 -50.77 -69.10 11.75 1.01 249 449
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
```

`which_min_ess <- which.min(mon[1:50, 'Tail_ESS'])`

Several values of Rhat greater than 1.01 and some ESS less than 400 also indicate convergence propblems. The Appendix C contains more results with longer chains.

We can further analyze potential problems using local efficiency and rank plots. We specifically investigate x[5], which, in this specific run, had the smallest tail-ESS of 44.

We examine the sampling efficiency in different parts of the posterior by computing the efficiency of small interval probability estimates (see Section Efficiency estimate for small interval probability estimates). Each interval contains \(1/k\) of the draws (e.g., \(5\%\) each, if \(k=20\)). The small interval efficiency measures mixing of an function which indicates when the values are inside or outside the specific small interval. As detailed above, this gives us a local efficiency measure which does not depend on the shape of the distribution.

`plot_local_ess(fit = fit_nom, par = which_min_ess, nalpha = 20)`

The efficiency of sampling is low in the tails, which is clearly caused by slow mixing in long tails of the Cauchy distribution.

Orange ticks show iterations that exceeded the maximum treedepth.

An alternative way to examine the efficiency in different parts of the posterior is to compute efficiency estimates for quantiles (see Section Efficiency for quantiles). Each interval has a specified proportion of draws, and the efficiency measures mixing of a function which indicates when the values are smaller than or equal to the corresponding quantile.

`plot_quantile_ess(fit = fit_nom, par = which_min_ess, nalpha = 40)`

Similar as above, we see that the efficiency of sampling is low in the tails. Orange ticks show iterations that exceeded the maximum treedepth.

We may also investigate how the estimated effective sample sizes change when we use more and more draws; Brooks and Gelman (1998) proposed to use similar graph for \(\widehat{R}\). If the effective sample size is highly unstable, does not increase proportionally with more draws, or even decreases, this indicates that simply running longer chains will likely not solve the convergence issues. In the plot below, we see how unstable both bulk-ESS and tail-ESS are for this example.

`plot_change_ess(fit = fit_nom, par = which_min_ess)`