1 Overview

This is the online appendix of the paper

  • Aki Vehtari, Andrew Gelman, Daniel Simpson, Bob Carpenter, Paul-Christian Bürkner (2019): Rank-normalization, folding, and localization: An improved \(\widehat{R}\) for assessing convergence of MCMC. arXiv preprint arXiv:1903.08008.

The code for the paper is available on Github (https://github.com/avehtari/rhat_ess). Here, we introduce all the code related to the examples presented in the paper and more numerical experiments not discussed in the paper itself.

To help you finding your way through all the examples presented in this online appendix, below please find a list of links to the examples discussed in the paper:

2 Examples

In this section, we will go through some examples to demonstrate the usefulness of our proposed methods as well as the associated workflow in determining convergence. Appendices B-E contain more detailed analysis of different algorithm variants and further examples.

First, we load all the necessary R packages and additional functions.

library(tidyverse)
library(gridExtra)
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
library(bayesplot)
theme_set(bayesplot::theme_default(base_family = "sans"))
library(rjags)
library(abind)
source('monitornew.R')
source('monitorplot.R')

2.1 Cauchy: A distribution with infinite mean and variance

This section relates to the examples presented in Section 5.1 of the paper.

Traditional \(\widehat{R}\) is based on calculating within and between chain variances. If the marginal distribution of a chain is such that the variance is infinite, this approach is not well justified, as we demonstrate here with a Cauchy-distributed example. The following Cauchy models are from Michael Betancourt’s case study Fitting The Cauchy Distribution. Appendix C contains more detailed analysis of different algorithm variants and further Cauchy examples.

2.1.1 Nominal parameterization of Cauchy

The nominal Cauchy model with direct parameterization is as follows:

writeLines(readLines("cauchy_nom.stan"))
parameters {
  vector[50] x;
}

model {
  x ~ cauchy(0, 1);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

2.1.1.1 Default Stan options

Run the nominal model:

fit_nom <- stan(file = 'cauchy_nom.stan', seed = 7878, refresh = 0)
Warning: There were 1421 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems

We get HMC specific diagnostic (Betancourt, 2017) warnings about a very large number of transitions after warmup that exceed the maximum treedepth and low estimated Bayesian fraction of missing information, indicating slow mixing likely due to very long tails of the Cauchy distribution.

mon <- monitor(fit_nom)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

          Q5    Q50    Q95   Mean    SD  Rhat Bulk_ESS Tail_ESS
x[1]   -5.87   0.02   6.78   0.92 13.64  1.01     1967      360
x[2]   -5.76  -0.03   5.11  -0.27  7.53  1.00     2417      802
x[3]   -6.12   0.00   7.73   0.77 10.28  1.01      448       82
x[4]   -5.30  -0.01   5.41  -0.06  7.19  1.01     2266      697
x[5]  -13.14  -0.04   5.83  -2.78 17.58  1.03      169       45
x[6]   -8.10  -0.06   6.39  -0.92 15.59  1.01     1611      402
x[7]   -7.25  -0.08   7.55  -0.29  9.68  1.01     1227      462
x[8]   -4.79   0.02   6.23   0.57  8.65  1.02     1568      432
x[9]   -4.69   0.01   4.82   0.13  5.43  1.00     2607      784
x[10]  -8.43  -0.04   6.50  -1.92 32.96  1.01     1609      261
x[11]  -6.21   0.02   7.09   0.43 16.21  1.01     2464      474
x[12]  -5.55   0.02   6.62   0.17  6.37  1.00     2813      404
x[13]  -6.16   0.00   6.13   0.44 12.78  1.01     2282      565
x[14]  -5.45   0.02   6.79   0.29  8.44  1.01     3121      647
x[15]  -8.17   0.05   9.55   0.26 30.61  1.01     2126      521
x[16]  -5.83   0.00   6.23   0.31  8.60  1.01     2177      589
x[17] -18.10  -0.07   7.39  -6.89 47.38  1.03      209       67
x[18]  -5.89   0.05   5.90   0.10  5.42  1.01     3336      664
x[19]  -7.04   0.03   6.23  -0.23 12.26  1.01     1434      421
x[20]  -6.58   0.03   7.53  -0.19  7.81  1.02      952      446
x[21]  -8.11   0.01   6.57  -2.16 28.47  1.00     1790      397
x[22]  -5.47   0.04   5.24   0.25  8.30  1.00     2841      712
x[23] -10.04  -0.05   6.55 -11.16 87.97  1.04      318       47
x[24] -13.18  -0.07   5.72  -4.58 30.89  1.01      441       98
x[25]  -6.89  -0.02   5.60  -0.44  8.76  1.02     1892      490
x[26]  -5.33  -0.02   4.69  -0.30  6.55  1.01     3099      732
x[27]  -5.19  -0.04   5.17  -0.26  8.50  1.00     3096      874
x[28]  -6.81  -0.02   6.81   0.15  7.71  1.00     4185      924
x[29]  -9.96  -0.02   6.99  -1.66 18.87  1.01      935      223
x[30]  -6.20   0.02   6.62   0.44 14.44  1.01     3504      559
x[31] -18.84  -0.04   5.77  -4.22 29.48  1.02      210       49
x[32]  -6.33  -0.01   6.21  -0.02  6.73  1.02     3039      748
x[33]  -4.81   0.03   4.93   0.22  7.89  1.01     3505     1008
x[34]  -5.75   0.02   5.75  -0.12  9.23  1.00     1896      661
x[35]  -5.98   0.03   7.16   0.06  8.68  1.01     2706      639
x[36]  -7.37   0.00   9.19   0.08 10.82  1.01     1405      399
x[37]  -5.03   0.02   7.82   2.52 29.21  1.01     2029      357
x[38]  -5.40  -0.01   5.59  -0.06  6.64  1.01     3611      741
x[39]  -4.78  -0.02   5.36   0.27  6.40  1.00     2855      757
x[40]  -6.16   0.00   4.58  -0.91 10.45  1.01     1860      457
x[41]  -5.46   0.07   6.55   0.05  6.06  1.01     2328      620
x[42]  -7.63  -0.05   6.38  -0.75 10.10  1.01     1796      513
x[43]  -7.47  -0.04   6.39  -0.59  8.32  1.01     1876      558
x[44]  -5.22   0.00   5.17   0.11  6.15  1.01     1610      575
x[45]  -4.44   0.04  13.98   3.19 17.67  1.01      290       64
x[46]  -5.04   0.01   5.60   0.22  9.06  1.02     3154      398
x[47]  -5.87   0.03   5.46   0.19 11.73  1.02     2258      497
x[48]  -4.99  -0.03   4.91   0.05 10.22  1.01     2916      724
x[49]  -6.42   0.01  10.78  10.69 83.66  1.02      214       43
x[50]  -6.24   0.02   5.71   0.41 19.28  1.00     2615      788
I       0.00   1.00   1.00   0.51  0.50  1.00      688      688
lp__  -88.54 -68.47 -50.77 -69.10 11.75  1.01      252      453

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).
which_min_ess <- which.min(mon[1:50, 'Tail_ESS'])

Several values of Rhat greater than 1.01 and some ESS less than 400 also indicate convergence propblems. The Appendix C contains more results with longer chains.

We can further analyze potential problems using local efficiency and rank plots. We specifically investigate x[49], which, in this specific run, had the smallest tail-ESS of 43.

We examine the sampling efficiency in different parts of the posterior by computing the efficiency of small interval probability estimates (see Section Efficiency estimate for small interval probability estimates). Each interval contains \(1/k\) of the draws (e.g., \(5\%\) each, if \(k=20\)). The small interval efficiency measures mixing of an function which indicates when the values are inside or outside the specific small interval. As detailed above, this gives us a local efficiency measure which does not depend on the shape of the distribution.

plot_local_ess(fit = fit_nom, par = which_min_ess, nalpha = 20)

The efficiency of sampling is low in the tails, which is clearly caused by slow mixing in long tails of the Cauchy distribution.
Orange ticks show iterations that exceeded the maximum treedepth.

An alternative way to examine the efficiency in different parts of the posterior is to compute efficiency estimates for quantiles (see Section Efficiency for quantiles). Each interval has a specified proportion of draws, and the efficiency measures mixing of a function which indicates when the values are smaller than or equal to the corresponding quantile.

plot_quantile_ess(fit = fit_nom, par = which_min_ess, nalpha = 40)

Similar as above, we see that the efficiency of sampling is low in the tails. Orange ticks show iterations that exceeded the maximum treedepth.

We may also investigate how the estimated effective sample sizes change when we use more and more draws; Brooks and Gelman (1998) proposed to use similar graph for \(\widehat{R}\). If the effective sample size is highly unstable, does not increase proportionally with more draws, or even decreases, this indicates that simply running longer chains will likely not solve the convergence issues. In the plot below, we see how unstable both bulk-ESS and tail-ESS are for this example.

plot_change_ess(fit = fit_nom, par = which_min_ess)

We can further analyze potential problems using rank plots which clearly show the mixing problem between chains. In case of good mixing all rank plots should be close to uniform.

samp <- as.array(fit_nom)
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

2.1.2 Alternative parameterization of Cauchy

Next, we examine an alternative parameterization of the Cauchy as a scale mixture of Gaussians. The model has two parameters, and the Cauchy distributed \(x\) can be computed deterministically from those. In addition to improved sampling performance, the example illustrates that focusing on diagnostics matters.

writeLines(readLines("cauchy_alt_1.stan"))
parameters {
  vector[50] x_a;
  vector<lower=0>[50] x_b;
}

transformed parameters {
  vector[50] x = x_a ./ sqrt(x_b);
}

model {
  x_a ~ normal(0, 1);
  x_b ~ gamma(0.5, 0.5);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

Run the alternative model:

fit_alt1 <- stan(file = 'cauchy_alt_1.stan', seed = 7878, refresh = 0)

There are no warnings, and the sampling is much faster.

mon <- monitor(fit_alt1)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

            Q5    Q50    Q95   Mean     SD  Rhat Bulk_ESS Tail_ESS
x_a[1]   -1.68  -0.03   1.60  -0.02   1.00     1     4259     3068
x_a[2]   -1.59  -0.01   1.71   0.02   1.00     1     3650     3149
x_a[3]   -1.62   0.00   1.61   0.01   1.00     1     4090     2890
x_a[4]   -1.66  -0.02   1.58  -0.02   0.99     1     4145     3085
x_a[5]   -1.69  -0.01   1.72  -0.01   1.02     1     4571     2518
x_a[6]   -1.67   0.01   1.71   0.00   1.02     1     4279     2994
x_a[7]   -1.69   0.03   1.71   0.02   1.01     1     4070     2909
x_a[8]   -1.61   0.01   1.64   0.01   0.99     1     3998     2889
x_a[9]   -1.71   0.02   1.64   0.01   1.00     1     3866     2989
x_a[10]  -1.63  -0.01   1.60  -0.01   1.00     1     3932     2749
x_a[11]  -1.60  -0.04   1.61  -0.01   0.99     1     4217     3576
x_a[12]  -1.64   0.01   1.64   0.00   1.00     1     3674     2689
x_a[13]  -1.70  -0.01   1.68  -0.01   1.02     1     3887     2907
x_a[14]  -1.69   0.03   1.66   0.02   1.00     1     3960     2728
x_a[15]  -1.63  -0.01   1.67   0.00   1.00     1     3742     3187
x_a[16]  -1.65  -0.01   1.65   0.00   1.00     1     4343     3251
x_a[17]  -1.66  -0.02   1.67   0.00   1.01     1     4344     2881
x_a[18]  -1.56   0.01   1.55   0.00   0.95     1     3917     2635
x_a[19]  -1.61  -0.02   1.65  -0.02   0.99     1     4046     2885
x_a[20]  -1.64   0.01   1.72   0.01   1.01     1     4216     2881
x_a[21]  -1.62  -0.01   1.58  -0.01   0.98     1     4313     2549
x_a[22]  -1.61   0.00   1.61  -0.01   0.98     1     3873     3011
x_a[23]  -1.69  -0.02   1.62  -0.03   1.00     1     3858     3068
x_a[24]  -1.61  -0.02   1.61  -0.01   0.98     1     3459     3014
x_a[25]  -1.65   0.01   1.64   0.00   1.01     1     4311     2984
x_a[26]  -1.60   0.05   1.64   0.03   0.98     1     3960     3115
x_a[27]  -1.64  -0.02   1.66  -0.02   1.01     1     3585     2799
x_a[28]  -1.62   0.01   1.59   0.01   0.99     1     3899     2891
x_a[29]  -1.60   0.02   1.58   0.02   0.97     1     4386     3173
x_a[30]  -1.65   0.02   1.70   0.02   1.01     1     4215     3112
x_a[31]  -1.64   0.00   1.64   0.00   0.98     1     3864     2662
x_a[32]  -1.67  -0.01   1.65   0.00   1.00     1     4340     2693
x_a[33]  -1.65  -0.02   1.59  -0.01   0.99     1     4231     2781
x_a[34]  -1.67   0.02   1.69   0.01   1.00     1     4170     3166
x_a[35]  -1.62   0.00   1.58  -0.01   0.98     1     3704     2632
x_a[36]  -1.68   0.01   1.65   0.00   0.99     1     3897     2909
x_a[37]  -1.60   0.03   1.66   0.02   0.99     1     4486     2500
x_a[38]  -1.69   0.01   1.72   0.01   1.03     1     4030     2710
x_a[39]  -1.64  -0.01   1.66  -0.01   1.01     1     4406     2790
x_a[40]  -1.69  -0.02   1.60  -0.03   1.01     1     4144     2932
x_a[41]  -1.64   0.02   1.65   0.01   1.00     1     4188     3142
x_a[42]  -1.65  -0.02   1.61  -0.03   0.99     1     3666     3173
x_a[43]  -1.62  -0.02   1.64  -0.01   1.00     1     3792     2291
x_a[44]  -1.65   0.02   1.76   0.03   1.03     1     3759     3035
x_a[45]  -1.69   0.04   1.71   0.02   1.03     1     3855     2972
x_a[46]  -1.67  -0.01   1.70   0.01   1.02     1     4204     3089
x_a[47]  -1.62  -0.03   1.65  -0.01   1.01     1     3883     3012
x_a[48]  -1.64   0.01   1.67   0.01   1.01     1     4373     3307
x_a[49]  -1.64   0.02   1.64   0.01   0.99     1     3825     3137
x_a[50]  -1.65  -0.01   1.63  -0.01   1.01     1     3733     2691
x_b[1]    0.00   0.41   3.83   1.00   1.43     1     2376     1408
x_b[2]    0.00   0.47   3.73   1.00   1.39     1     2249     1424
x_b[3]    0.00   0.47   3.76   1.00   1.38     1     2846     1860
x_b[4]    0.01   0.49   3.95   1.02   1.37     1     2759     1601
x_b[5]    0.00   0.49   3.89   1.04   1.48     1     3380     1880
x_b[6]    0.00   0.46   3.81   1.00   1.38     1     2597     1389
x_b[7]    0.01   0.47   4.07   1.02   1.44     1     2836     1441
x_b[8]    0.01   0.45   3.75   1.01   1.42     1     2566     1185
x_b[9]    0.00   0.44   3.83   1.00   1.42     1     2791     1270
x_b[10]   0.00   0.48   3.78   1.01   1.40     1     2695     1392
x_b[11]   0.00   0.45   3.87   0.99   1.36     1     2813     1685
x_b[12]   0.00   0.42   3.94   1.00   1.44     1     2553     1380
x_b[13]   0.00   0.45   3.94   1.02   1.44     1     2508     1294
x_b[14]   0.00   0.44   3.71   0.97   1.39     1     2967     1638
x_b[15]   0.00   0.43   3.98   1.02   1.45     1     2518     1374
x_b[16]   0.01   0.49   3.62   1.00   1.38     1     2403     1412
x_b[17]   0.00   0.47   3.86   1.01   1.39     1     2431     1460
x_b[18]   0.00   0.46   3.79   0.98   1.38     1     2179     1358
x_b[19]   0.00   0.44   3.81   0.99   1.39     1     2985     1732
x_b[20]   0.00   0.48   3.61   0.97   1.33     1     3160     1687
x_b[21]   0.00   0.47   3.74   0.98   1.35     1     2893     1501
x_b[22]   0.00   0.45   3.85   1.02   1.48     1     2446     1468
x_b[23]   0.00   0.46   4.02   1.02   1.44     1     2781     1918
x_b[24]   0.00   0.46   3.79   1.00   1.45     1     2554     1441
x_b[25]   0.00   0.49   3.84   1.01   1.41     1     2515     1295
x_b[26]   0.00   0.44   4.08   1.03   1.51     1     3287     1667
x_b[27]   0.00   0.43   3.75   0.99   1.38     1     2447     1435
x_b[28]   0.00   0.45   3.77   1.00   1.43     1     2636     1292
x_b[29]   0.00   0.42   3.80   0.97   1.40     1     2569     1492
x_b[30]   0.00   0.45   3.82   0.99   1.42     1     2508     1632
x_b[31]   0.00   0.43   3.84   0.99   1.37     1     2815     1816
x_b[32]   0.00   0.45   3.75   1.00   1.41     1     2410     1352
x_b[33]   0.01   0.46   3.81   0.99   1.38     1     2330     1405
x_b[34]   0.00   0.46   3.89   1.01   1.49     1     2841     1492
x_b[35]   0.00   0.41   3.71   0.97   1.42     1     2538     1580
x_b[36]   0.00   0.44   3.60   0.97   1.33     1     2484     1798
x_b[37]   0.00   0.47   3.82   1.01   1.42     1     2795     1643
x_b[38]   0.00   0.47   4.12   1.04   1.49     1     2650     1492
x_b[39]   0.00   0.45   3.96   1.02   1.44     1     2596     1562
x_b[40]   0.00   0.45   3.68   0.99   1.40     1     2462     1482
x_b[41]   0.00   0.48   3.76   1.00   1.38     1     2773     1638
x_b[42]   0.00   0.46   3.95   1.01   1.43     1     2267     1182
x_b[43]   0.00   0.48   3.85   1.02   1.42     1     2523     1266
x_b[44]   0.00   0.46   3.50   0.97   1.32     1     2706     1476
x_b[45]   0.00   0.45   4.01   1.04   1.49     1     2846     1642
x_b[46]   0.00   0.45   3.94   1.00   1.42     1     2487     1658
x_b[47]   0.00   0.49   3.75   1.01   1.40     1     2537     1298
x_b[48]   0.00   0.46   3.79   0.99   1.42     1     2350     1273
x_b[49]   0.00   0.43   3.85   1.00   1.40     1     2502     1449
x_b[50]   0.00   0.46   3.78   1.01   1.44     1     2778     1413
x[1]     -7.64  -0.04   7.33  -0.69  32.90     1     3894     2157
x[2]     -6.02  -0.01   6.41   4.16 362.61     1     3511     1730
x[3]     -6.54   0.00   6.33   0.19  33.51     1     3498     2547
x[4]     -5.52  -0.02   5.74  -3.14 191.65     1     3775     2314
x[5]     -6.20  -0.01   6.29  -0.47  21.58     1     4131     2630
x[6]     -5.55   0.01   6.40   0.19  91.87     1     4172     2475
x[7]     -5.73   0.05   6.63  -0.03  14.48     1     3751     2167
x[8]     -5.80   0.01   5.58  -1.23  53.41     1     3656     2401
x[9]     -5.96   0.03   6.01  -6.47 228.29     1     3569     2184
x[10]    -5.73  -0.01   6.11  -1.49  88.67     1     3612     1971
x[11]    -6.35  -0.05   6.08   0.98  48.99     1     3932     2519
x[12]    -6.59   0.02   5.94  -0.79  30.31     1     3394     2205
x[13]    -6.36  -0.01   6.24  -6.55 221.32     1     3898     2281
x[14]    -6.64   0.04   6.51   3.57 149.72     1     3697     2533
x[15]    -7.28  -0.01   5.90  -0.32  18.93     1     3695     2036
x[16]    -5.65  -0.01   5.28   0.37  32.30     1     3912     2343
x[17]    -6.10  -0.03   6.17  -0.33  26.14     1     3999     2162
x[18]    -5.74   0.02   6.79   3.13  94.50     1     3862     2080
x[19]    -6.01  -0.03   6.04   0.35  18.39     1     3803     2331
x[20]    -5.56   0.01   6.22  -0.07  30.58     1     3877     2340
x[21]    -6.21  -0.01   6.22   0.33  35.50     1     3907     2390
x[22]    -6.23   0.00   6.02  -8.18 356.87     1     3685     1771
x[23]    -6.17  -0.02   6.09  -0.37  14.57     1     3859     2629
x[24]    -6.25  -0.02   5.56  -0.61  81.96     1     3135     2013
x[25]    -6.72   0.02   5.47   1.85 202.62     1     3898     2320
x[26]    -5.61   0.06   5.81  -0.34  34.70     1     3875     2847
x[27]    -7.91  -0.03   6.89   6.55 323.96     1     3694     2566
x[28]    -6.29   0.01   6.80   0.35  31.31     1     3828     2205
x[29]    -6.57   0.02   6.30  -0.41  23.66     1     3797     2358
x[30]    -5.92   0.03   6.31  -0.31  38.53     1     3879     2160
x[31]    -5.80   0.00   6.28   0.36  35.66     1     3700     2567
x[32]    -6.10  -0.02   6.39  -3.81 298.91     1     4374     2447
x[33]    -6.00  -0.03   5.39   1.22  81.47     1     3643     2294
x[34]    -5.87   0.04   6.56   0.09  30.46     1     3854     2569
x[35]    -6.81   0.00   6.06  -0.07  27.08     1     3462     2259
x[36]    -6.12   0.01   6.09  -0.81  40.91     1     3737     2640
x[37]    -5.92   0.03   6.50  -6.16 263.64     1     4235     2372
x[38]    -6.46   0.01   6.00   0.60  58.15     1     3708     2085
x[39]    -6.79  -0.01   6.27  -1.86  64.47     1     4008     2319
x[40]    -6.97  -0.02   5.62   0.25 146.44     1     3776     2198
x[41]    -6.12   0.02   5.98  -0.88  67.65     1     3948     2554
x[42]    -7.48  -0.03   6.42 -18.09 643.89     1     3637     2154
x[43]    -5.99  -0.02   6.57  -0.40  52.47     1     3489     2331
x[44]    -6.09   0.03   5.93   0.44  26.31     1     3772     2410
x[45]    -6.25   0.04   6.65  -0.18  15.75     1     3732     2348
x[46]    -5.88  -0.02   7.59   0.75  28.36     1     3577     2172
x[47]    -6.18  -0.03   5.82   0.24  79.45     1     3776     2193
x[48]    -6.61   0.01   6.03  -3.56 129.17     1     3800     2092
x[49]    -5.88   0.02   6.52  -0.30  46.02     1     3776     2551
x[50]    -7.08   0.00   6.32  -0.71  45.99     1     3489     2385
I         0.00   0.00   1.00   0.49   0.50     1     2907     2907
lp__    -95.29 -81.24 -69.52 -81.62   7.89     1     1520     2679

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).
which_min_ess <- which.min(mon[101:150, 'Tail_ESS'])

For all parameters, Rhat is less than 1.01 and ESS exceeds 400, indicating that sampling worked much better with this alternative parameterization. Appendix C has more results using other parameterizations of the Cauchy distribution. The vectors x_a and x_b used to form the Cauchy-distributed x have stable quantile, mean and sd values. The quantiles of each x_j are stable too, but the mean and variance estimates are widely varying.

We can further analyze potential problems using local efficiency estimates and rank plots. For this example. we take a detailed look at x[2], which had the smallest bulk-ESS of 3149.

We examine the sampling efficiency in different parts of the posterior by computing the efficiency estimates for small interval probability estimates.

plot_local_ess(fit = fit_alt1, par = which_min_ess + 100, nalpha = 20)

The efficiency estimate is good in all parts of the posterior. Further, we examine the sampling efficiency of different quantile estimates.

plot_quantile_ess(fit = fit_alt1, par = which_min_ess + 100, nalpha = 40)

The rank plots also look close to uniform across chains, which is consistent with good mixing.

samp <- as.array(fit_alt1)
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

In summary, the alternative parameterization produces results that look much better than for the nominal parameterization. There are still some differences in the tails, which we also identified via the tail-ESS.

2.1.3 Half-Cauchy with nominal parameterization

Half-Cauchy priors for non-negative parameters are common and, at least in Stan, usually specified via the nominal parameterization. In this example, we set independent half-Cauchy distributions on each element of the 50-dimensional vector \(x\) constrained to be positive (in Stan, <lower=0>). Stan then automatically switches to the unconstrained log(x) space, which changes the geometry crucially.

writeLines(readLines("half_cauchy_nom.stan"))
parameters {
  vector<lower=0>[50] x;
}

model {
  x ~ cauchy(0, 1);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

Run the half-Cauchy with nominal parameterization and positive constraint:

fit_half_nom <- stan(file = 'half_cauchy_nom.stan', seed = 7878, refresh = 0)

There are no warnings, and the sampling is much faster than for the Cauchy nominal model without the constraint.

mon <- monitor(fit_half_nom)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

          Q5    Q50    Q95   Mean      SD  Rhat Bulk_ESS Tail_ESS
x[1]    0.08   1.01  14.11   7.41  128.72  1.00     6697     1807
x[2]    0.08   0.99  13.81   9.58  159.58  1.00     5982     1757
x[3]    0.07   0.97  13.53   4.72   30.32  1.00     7459     2290
x[4]    0.09   0.99  11.20   4.99   43.10  1.00     7919     2115
x[5]    0.07   1.01  14.51   5.05   47.10  1.00     7558     2315
x[6]    0.08   1.02  13.91  91.28 5383.88  1.00     7483     2072
x[7]    0.08   0.99  13.46   6.57   93.25  1.00     7699     2010
x[8]    0.09   1.00  11.29   4.84   34.92  1.00     9338     2097
x[9]    0.07   0.99  13.40  10.80  162.43  1.00     7521     1955
x[10]   0.08   1.02  12.53   5.61   49.28  1.00     7445     2245
x[11]   0.09   1.00  10.67   6.78  116.46  1.01     7345     2430
x[12]   0.08   1.01  12.34   5.46   47.09  1.00     8026     2040
x[13]   0.08   1.01  11.84   4.77   62.16  1.00     7166     2076
x[14]   0.08   1.01  11.97   4.99   65.23  1.00     8478     2442
x[15]   0.08   0.98  13.84   6.31   61.42  1.00     8138     2055
x[16]   0.08   0.98  11.59   6.25   67.00  1.00     7628     2347
x[17]   0.08   0.97  11.95   3.92   19.12  1.00     6984     2278
x[18]   0.08   1.00  10.98   4.05   23.61  1.00     6992     2299
x[19]   0.08   1.03  12.08  38.38 2005.23  1.00     7604     2395
x[20]   0.07   1.00  14.24   5.68   39.23  1.00     8350     2060
x[21]   0.08   1.01  11.88   6.11  110.22  1.00     8152     2797
x[22]   0.09   1.02  11.18   7.01  169.77  1.00     7609     2000
x[23]   0.07   1.00  14.22   7.35   94.96  1.00     7406     2101
x[24]   0.07   0.99  12.30   6.63  130.07  1.00     7143     2012
x[25]   0.09   1.01  11.29   9.22  343.88  1.00     7952     2305
x[26]   0.07   1.05  13.90   7.72  130.68  1.00     7070     2414
x[27]   0.07   0.99  14.12   5.75   52.77  1.00     8364     1983
x[28]   0.07   1.00  15.56   7.49   73.53  1.00     8555     2076
x[29]   0.08   1.00  13.28   5.13   37.83  1.00     7542     2031
x[30]   0.06   0.99  14.14   7.69  139.13  1.00     6432     2053
x[31]   0.08   1.04  16.07   6.73   50.89  1.00     7127     2142
x[32]   0.09   1.01  13.23   5.32   43.21  1.00     7636     2527
x[33]   0.08   1.00  12.96   6.26   92.50  1.00     7312     2178
x[34]   0.09   0.97  11.01   3.78   23.01  1.00     7409     2395
x[35]   0.07   1.00  13.56   5.08   30.58  1.00     6836     2458
x[36]   0.07   1.01  13.18   6.52   80.50  1.00     7521     1799
x[37]   0.09   1.03  11.61   4.16   23.07  1.00     7219     2363
x[38]   0.08   0.99  15.73   5.35   58.79  1.00     7492     1877
x[39]   0.07   1.00  13.93  31.05  827.68  1.00     8001     1706
x[40]   0.07   1.03  12.88  10.80  173.36  1.00     7296     1772
x[41]   0.07   0.97  13.42   7.83   96.96  1.00     7199     1964
x[42]   0.09   0.99  11.94   8.61  152.58  1.00     7515     2440
x[43]   0.09   1.01  12.12   4.59   52.98  1.00     7372     2447
x[44]   0.08   1.02  12.66   4.62   34.61  1.00     6424     2012
x[45]   0.07   1.00  13.25  20.04  914.15  1.00     6957     2100
x[46]   0.07   1.03  14.46   4.87   27.22  1.00     7806     2132
x[47]   0.08   1.00  13.13   6.86   99.30  1.00     6564     2709
x[48]   0.07   1.02  12.49   6.80   87.98  1.00     7766     2590
x[49]   0.08   1.00  12.68   5.54   37.20  1.00     7903     2137
x[50]   0.10   0.99  11.95   8.59  155.18  1.00     8049     1715
I       0.00   0.00   1.00   0.50    0.50  1.00     5802     4000
lp__  -80.63 -69.45 -59.66 -69.65    6.45  1.00     1144     2008

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).

All values of Rhat are less than 1.01 and ESS exceeds 400 for all parameters, indicating good performance of the sampler despite using the nominal parameterization of the Cauchy distribution. More experiments with the half-Cauchy distribution can be found in Appendix C.

2.2 Hierarchical model: Eight Schools

This section relates to the examples presented in Section 5.2 of the paper.

The Eight Schools data is a classic example for hierarchical models (see Section 5.5 in Gelman et al., 2013), which even in its simplicity illustrates the typical problems in inference for hierarchical models. The Stan models are adapted from Michael Betancourt’s case study on Diagnosing Biased Inference with Divergences. Appendix D contains more detailed analysis of different algorithm variants.

2.2.1 A Centered Eight Schools model

writeLines(readLines("eight_schools_cp.stan"))
data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}

parameters {
  real mu;
  real<lower=0> tau;
  real theta[J];
}

model {
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 5);
  theta ~ normal(mu, tau);
  y ~ normal(theta, sigma);
}

2.2.1.1 Centered Eight Schools model

We directly run the centered parameterization model with an increased adapt_delta value to reduce the probability of getting divergent transitions.

eight_schools <- read_rdump("eight_schools.data.R")
fit_cp <- stan(
  file = 'eight_schools_cp.stan', data = eight_schools,
  iter = 2000, chains = 4, seed = 483892929, refresh = 0,
  control = list(adapt_delta = 0.95)
)
Warning: There were 28 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: Examine the pairs() plot to diagnose sampling problems

Still, we observe a lot of divergent transitions, which in itself is already sufficient indicator of convergence problems. We can use Rhat and ESS diagnostics to recognize problematic parts of the posterior. The latter two have the advantage over the divergent transitions diagnostic that they can be used with all MCMC algorithms not only with HMC.

mon <- monitor(fit_cp)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

             Q5    Q50   Q95   Mean   SD  Rhat Bulk_ESS Tail_ESS
mu        -0.93   4.53  9.85   4.50 3.26  1.01      698     1125
tau        0.71   3.02  9.79   3.86 3.12  1.02      219      232
theta[1]  -1.24   5.88 16.01   6.41 5.52  1.01      993     1595
theta[2]  -2.44   4.99 12.84   5.05 4.73  1.00     1252     2039
theta[3]  -4.99   4.32 12.13   4.06 5.36  1.00     1170     1792
theta[4]  -2.91   4.88 12.73   4.89 4.77  1.00     1032     1910
theta[5]  -3.99   3.95 10.80   3.75 4.59  1.00      957     2017
theta[6]  -3.97   4.38 11.45   4.16 4.71  1.01     1199     1706
theta[7]  -0.91   6.04 15.06   6.44 4.99  1.00      937     1589
theta[8]  -3.08   4.86 13.78   4.99 5.20  1.00     1292     1498
lp__     -24.50 -15.31 -4.61 -15.03 6.07  1.02      211      242

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).

See Appendix D for results of longer chains.

Bulk-ESS and Tail-ESS for the between school standard deviation tau are 219 and 232 respectively. Both are much less than 400, indicating we should investigate that parameter more carefully. We thus examine the sampling efficiency in different parts of the posterior by computing the efficiency estimate for small interval estimates for tau. These plots may either show quantiles or parameter values at the vertical axis. Red ticks show divergent transitions.

plot_local_ess(fit = fit_cp, par = "tau", nalpha = 20)

plot_local_ess(fit = fit_cp, par = "tau", nalpha = 20, rank = FALSE)

We see that the sampler has difficulties in exploring small tau values. As the sampling efficiency for estimating small tau values is practically zero, we may assume that we miss substantial amount of posterior mass and get biased estimates. Red ticks, which show iterations with divergences, have concentrated to small tau values, which gives us another indication of problems in exploring small values.

We examine also the sampling efficiency of different quantile estimates. Again, these plots may either show quantiles or parameter values at the vertical axis.

plot_quantile_ess(fit = fit_cp, par = "tau", nalpha = 40)

plot_quantile_ess(fit = fit_cp, par = "tau", nalpha = 40, rank = FALSE)

Most of the quantile estimates have worryingly low effective sample size.

Next we examine how the estimated effective sample size changes when we use more and more draws. Here we do not see sudden changes, but both bulk-ESS and tail-ESS are consistently low. See Appendix D for results of longer chains.

plot_change_ess(fit = fit_cp, par = "tau")

In line with the other findings, rank plots of tau clearly show problems in the mixing of the chains. In particular, the rank plot for the first chain indicates that it was unable to explore the lower-end of the posterior range, while the spike in the rank plot for chain 2 indicates that it spent too much time stuck in these values.

samp_cp <- as.array(fit_cp)
mcmc_hist_r_scale(samp_cp[, , "tau"])

2.2.2 Non-centered Eight Schools model

For hierarchical models, the non-centered parameterization often works better than the centered one:

writeLines(readLines("eight_schools_ncp.stan"))
data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}

parameters {
  real mu;
  real<lower=0> tau;
  real theta_tilde[J];
}

transformed parameters {
  real theta[J];
  for (j in 1:J)
    theta[j] = mu + tau * theta_tilde[j];
}

model {
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 5);
  theta_tilde ~ normal(0, 1);
  y ~ normal(theta, sigma);
}

For reasons of comparability, we also run the non-centered parameterization model with an increased adapt_delta value:

fit_ncp2 <- stan(
  file = 'eight_schools_ncp.stan', data = eight_schools,
  iter = 2000, chains = 4, seed = 483892929, refresh = 0,
  control = list(adapt_delta = 0.95)
)

We get zero divergences and no other warnings which is a first good sign.

mon <- monitor(fit_ncp2)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

                   Q5   Q50   Q95  Mean   SD  Rhat Bulk_ESS Tail_ESS
mu              -1.00  4.41  9.80  4.43 3.29     1     5199     3093
tau              0.24  2.73  9.46  3.56 3.17     1     2672     1967
theta_tilde[1]  -1.35  0.30  1.96  0.31 1.00     1     5589     2535
theta_tilde[2]  -1.45  0.11  1.63  0.10 0.92     1     5605     2956
theta_tilde[3]  -1.70 -0.09  1.60 -0.08 1.00     1     5600     2863
theta_tilde[4]  -1.44  0.09  1.63  0.08 0.93     1     5806     3246
theta_tilde[5]  -1.66 -0.17  1.39 -0.16 0.92     1     5021     3165
theta_tilde[6]  -1.60 -0.07  1.54 -0.06 0.96     1     5051     2866
theta_tilde[7]  -1.31  0.34  1.88  0.33 0.98     1     4559     2754
theta_tilde[8]  -1.50  0.07  1.66  0.08 0.96     1     5634     2990
theta[1]        -1.78  5.66 16.24  6.19 5.70     1     4330     3161
theta[2]        -2.33  4.83 12.35  4.89 4.54     1     5314     3418
theta[3]        -5.07  4.17 11.92  3.99 5.33     1     4353     3211
theta[4]        -2.77  4.76 12.52  4.85 4.76     1     5685     3159
theta[5]        -4.41  3.85 10.59  3.64 4.61     1     4822     3416
theta[6]        -3.82  4.28 11.45  4.07 4.80     1     4863     2816
theta[7]        -1.18  5.86 15.20  6.26 5.01     1     4743     3570
theta[8]        -3.36  4.78 12.93  4.89 5.15     1     5094     3335
lp__           -11.20 -6.67 -3.76 -6.97 2.35     1     1488     2254

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).

All Rhat < 1.01 and ESS > 400 indicate a much better performance of the non-centered parameterization.

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates for tau.

plot_local_ess(fit = fit_ncp2, par = 2, nalpha = 20)

Small tau values are still more difficult to explore, but the relative efficiency is good. We may also check this with a finer resolution:

plot_local_ess(fit = fit_ncp2, par = 2, nalpha = 50)

The sampling efficiency for different quantile estimates looks good as well.

plot_quantile_ess(fit = fit_ncp2, par = 2, nalpha = 40)

The rank plots of tau show no substantial differences between chains.

samp_ncp2 <- as.array(fit_ncp2)
mcmc_hist_r_scale(samp_ncp2[, , 2])

References

Betancourt, M. (2017) ‘A conceptual introduction to hamiltonian monte carlo’, arXiv preprint arXiv:1701.02434.

Brooks, S. P. and Gelman, A. (1998) ‘General methods for monitoring convergence of iterative simulations’, Journal of Computational and Graphical Statistics, 7(4), pp. 434–455.

Gelman, A. et al. (2013) Bayesian data analysis, third edition. CRC Press.

Appendices

The following abbreviations are used throughout the appendices:

  • N = total number of draws
  • Rhat = classic no-split-Rhat
  • sRhat = classic split-Rhat
  • zsRhat = rank-normalized split-Rhat
    • all chains are jointly ranked and z-transformed
    • can detect differences in location and trends
  • zfsRhat = rank-normalized folded split-Rhat
    • all chains are jointly “folded” by computing absolute deviation from median, ranked and z-transformed
    • can detect differences in scales
  • seff = no-split effective sample size
  • reff = seff / N
  • zsseff = rank-normalized split effective sample size
    • estimates the efficiency of mean estimate for rank normalized values
  • zsreff = zsseff / N
  • zfsseff = rank-normalized folded split effective sample size
    • estimates the efficiency of rank normalized mean absolute deviation
  • zfsreff = zfsseff / N
  • tailseff = minimum of rank-normalized split effective sample sizes of the 5% and 95% quantiles
  • tailreff = tailseff / N
  • medsseff = median split effective sample size
    • estimates the efficiency of the median
  • medsreff = medsseff / N
  • madsseff = mad split effective sample size
    • estimates the efficiency of the median absolute deviation
  • madsreff = madsseff / N

Appendix A: Computing the effective sample size

There are no examples or numerical experiments related to Appendix A in the paper.

Appendix B: Normal distributions with additional trend, shift or scaling

This part focuses on diagnostics for

  • all chains having a trend and a similar marginal distribution
  • one of the chains having a different mean
  • one of the chains having a lower marginal variance

To simplify, in this part, independent draws are used as a proxy for very efficient MCMC sampling. First, we sample draws from a standard-normal distribution. We will discuss the behavior for non-normal distributions later. See Appendix A for the abbreviations used.

Adding the same trend to all chains

All chains are from the same Normal(0, 1) distribution plus a linear trend added to all chains:

conds <- expand.grid(
  iters = c(250, 1000, 4000), 
  trend = c(0, 0.25, 0.5, 0.75, 1),
  rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
  iters <- conds[i, "iters"]
  trend <- conds[i, "trend"]
  rep <- conds[i, "rep"]
  r <- array(rnorm(iters * chains), c(iters, chains))
  r <- r + seq(-trend, trend, length.out = iters)
  rs <- as.data.frame(monitor_extra(r))
  res[[i]] <- cbind(iters, trend, rep, rs)
}
res <- bind_rows(res)

If we don’t split chains, Rhat misses the trends if all chains still have a similar marginal distribution.

ggplot(data = res, aes(y = Rhat, x = trend)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 1.005, linetype = 'dashed') + 
  geom_hline(yintercept = 1) + 
  ggtitle('Rhat without splitting chains')

Split-Rhat can detect trends, even if the marginals of the chains are similar.

ggplot(data = res, aes(y = zsRhat, x = trend)) + 
  geom_point() + geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 1.005, linetype = 'dashed') + 
  geom_hline(yintercept = 1) + 
  ggtitle('Split-Rhat')

Result: Split-Rhat is useful for detecting non-stationarity (i.e., trends) in the chains. If we use a threshold of \(1.01\), we can detect trends which account for 2% or more of the total marginal variance. If we use a threshold of \(1.1\), we detect trends which account for 30% or more of the total marginal variance.

The effective sample size is based on split Rhat and within-chain autocorrelation. We plot the relative efficiency \(R_{\rm eff}=S_{\rm eff}/S\) for easier comparison between different values of \(S\). In the plot below, dashed lines indicate the threshold at which we would consider the effective sample size to be sufficient (i.e., \(S_{\rm eff} > 400\)). Since we plot the relative efficiency instead of the effective sample size itself, this threshold is divided by \(S\), which we compute here as the number of iterations per chain (variable iter) times the number of chains (\(4\)).

ggplot(data = res, aes(y = zsreff, x = trend)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = c(0,1)) + 
  geom_hline(aes(yintercept = 400 / (4 * iters)), linetype = 'dashed') + 
  ggtitle('Relative Bulk-ESS (zsreff)') + 
  scale_y_continuous(breaks = seq(0, 1.5, by = 0.25))

Result: Split-Rhat is more sensitive to trends for small sample sizes, but effective sample size becomes more sensitive for larger samples sizes (as autocorrelations can be estimated more accurately).

Advice: If in doubt, run longer chains for more accurate convergence diagnostics.

Shifting one chain

Next we investigate the sensitivity to detect if one of the chains has not converged to the same distribution as the others, but has a different mean.

conds <- expand.grid(
  iters = c(250, 1000, 4000), 
  shift = c(0, 0.25, 0.5, 0.75, 1),
  rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
  iters <- conds[i, "iters"]
  shift <- conds[i, "shift"]
  rep <- conds[i, "rep"]
  r <- array(rnorm(iters * chains), c(iters, chains))
  r[, 1] <- r[, 1] + shift
  rs <- as.data.frame(monitor_extra(r))
  res[[i]] <- cbind(iters, shift, rep, rs)
}
res <- bind_rows(res)
ggplot(data = res, aes(y = zsRhat, x = shift)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 1.005, linetype = 'dashed') + 
  geom_hline(yintercept = 1) + 
  ggtitle('Split-Rhat')

Result: If we use a threshold of \(1.01\), we can detect shifts with a magnitude of one third or more of the marginal standard deviation. If we use a threshold of \(1.1\), we detect shifts with a magnitude equal to or larger than the marginal standard deviation.

ggplot(data = res, aes(y = zsreff, x = shift)) + 
  geom_point() +
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = c(0,1)) + 
  geom_hline(aes(yintercept = 400 / (4 * iters)), linetype = 'dashed') + 
  ggtitle('Relative Bulk-ESS (zsreff)') + 
  scale_y_continuous(breaks = seq(0, 1.5, by = 0.25))

Result: The effective sample size is not as sensitive, but a shift with a magnitude of half the marginal standard deviation or more will lead to very low relative efficiency when the total number of draws increases.

Rank plots can be used to visualize differences between chains. Here, we show rank plots for the case of 4 chains, 250 draws per chain, and a shift of 0.5.

iters = 250
chains = 4
shift = 0.5
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] + shift
colnames(r) <- 1:4
mcmc_hist_r_scale(r)

Although, Rhat was less than \(1.05\) for this situation, the rank plots clearly show that the first chains behaves differently.

Scaling one chain

Next, we investigate the sensitivity to detect if one of the chains has not converged to the same distribution as the others, but has lower marginal variance.

conds <- expand.grid(
  iters = c(250, 1000, 4000), 
  scale = c(0, 0.25, 0.5, 0.75, 1),
  rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
  iters <- conds[i, "iters"]
  scale <- conds[i, "scale"]
  rep <- conds[i, "rep"]
  r <- array(rnorm(iters * chains), c(iters, chains))
  r[, 1] <- r[, 1] * scale
  rs <- as.data.frame(monitor_extra(r))
  res[[i]] <- cbind(iters, scale, rep, rs)
}
res <- bind_rows(res)

We first look at the Rhat estimates:

ggplot(data = res, aes(y = zsRhat, x = scale)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 1.005, linetype = 'dashed') + 
  geom_hline(yintercept = 1) + 
  ggtitle('Split-Rhat')

Result: Split-Rhat is not able to detect scale differences between chains.

ggplot(data = res, aes(y = zfsRhat, x = scale)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 1.005, linetype = 'dashed') + 
  geom_hline(yintercept = 1) + 
  ggtitle('Folded-split-Rhat')

Result: Folded-Split-Rhat focuses on scales and detects scale differences.

Result: If we use a threshold of \(1.01\), we can detect a chain with scale less than \(3/4\) of the standard deviation of the others. If we use threshold of \(1.1\), we detect a chain with standard deviation less than \(1/4\) of the standard deviation of the others.

Next, we look at the effective sample size estimates:

ggplot(data = res, aes(y = zsreff, x = scale)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = c(0,1)) + 
  geom_hline(aes(yintercept = 400 / (4 * iters)), linetype = 'dashed') + 
  ggtitle('Relative Bulk-ESS (zsreff)') + 
  scale_y_continuous(breaks = seq(0, 1.5, by = 0.25))

Result: The bulk effective sample size of the mean does not see a problem as it focuses on location differences between chains.

Rank plots can be used to visualize differences between chains. Here, we show rank plots for the case of 4 chains, 250 draws per chain, and with one chain having a standard deviation of 0.75 as opposed to a standard deviation of 1 for the other chains.

iters = 250
chains = 4
scale = 0.75
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] * scale
colnames(r) <- 1:4
mcmc_hist_r_scale(r)

Although folded Rhat is \(1.06\), the rank plots clearly show that the first chains behaves differently.

Appendix C: More experiments with the Cauchy distribution

The classic split-Rhat is based on calculating within and between chain variances. If the marginal distribution of a chain is such that the variance is not defined (i.e. infinite), the classic split-Rhat is not well justified. In this section, we will use the Cauchy distribution as an example of such distribution. Also in cases where mean and variance are finite, the distribution can be far from Gaussian. Especially distributions with very long tails cause instability for variance and autocorrelation estimates. To alleviate these problems we will use Split-Rhat for rank-normalized draws.

The following Cauchy models are from Michael Betancourt’s case study Fitting The Cauchy Distribution

Nominal parameterization of Cauchy

We already looked at the nominal Cauchy model with direct parameterization in the main text, but for completeness, we take a closer look using different variants of the diagnostics.

writeLines(readLines("cauchy_nom.stan"))
parameters {
  vector[50] x;
}

model {
  x ~ cauchy(0, 1);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

Default Stan options

Run the nominal model:

fit_nom <- stan(file = 'cauchy_nom.stan', seed = 7878, refresh = 0)
Warning: There were 1421 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems

Treedepth exceedence and Bayesian Fraction of Missing Information are dynamic HMC specific diagnostics (Betancourt, 2017). We get warnings about very large number of transitions after warmup that exceeded the maximum treedepth, which is likely due to very long tails of the Cauchy distribution. All chains have low estimated Bayesian fraction of missing information also indicating slow mixing.

Trace plots for the first parameter look wild with occasional large values:

samp <- as.array(fit_nom) 
mcmc_trace(samp[, , 1])

Let’s check Rhat and ESS diagnostics.

res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)

For one parameter, Rhats exceed the classic threshold of 1.1. Depending on the Rhat estimate, a few others also exceed the threshold of 1.01. The rank normalized split-Rhat has several values over 1.01. Please note that the classic split-Rhat is not well defined in this example, because mean and variance of the Cauchy distribution are not finite.

plot_ess(res) 

Both classic and new effective sample size estimates have several very small values, and so the overall sample shouldn’t be trusted.

Result: Effective sample size is more sensitive than (rank-normalized) split-Rhat to detect problems of slow mixing.

We also check the log posterior value lp__ and find out that the effective sample size is worryingly low.

res <- monitor_extra(samp[, , 51:52]) 
cat('lp__: Bulk-ESS = ', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS =  252 
cat('lp__: Tail-ESS = ', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS =  453 

We can further analyze potential problems using local effective sample size and rank plots. We examine x[49], which has the smallest tail-ESS of 252.

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates (see Section Efficiency for small interval probability estimates). Each interval contains \(1/k\) of the draws (e.g., with \(k=20\)). The small interval efficiency measures mixing of an indicator function which indicates when the values are inside the specific small interval. This gives us a local efficiency measure which does not depend on the shape of the distribution.

plot_local_ess(fit = fit_nom, par = which_min_ess, nalpha = 20)

We see that the efficiency is worryingly low in the tails (which is caused by slow mixing in long tails of Cauchy). Orange ticks show draws that exceeded the maximum treedepth.

An alternative way to examine the effective sample size in different parts of the posterior is to compute effective sample size for quantiles (see Section Efficiency for quantiles). Each interval has a specified proportion of draws, and the efficiency measures mixing of an indicator function’s results which indicate when the values are inside the specific interval.

plot_quantile_ess(fit = fit_nom, par = which_min_ess, nalpha = 40)

We see that the efficiency is worryingly low in the tails (which is caused by slow mixing in long tails of Cauchy). Orange ticks show draws that exceeded the maximum treedepth.

We can further analyze potential problems using rank plots, from which we clearly see differences between chains.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

Default Stan options + increased maximum treedepth

We can try to improve the performance by increasing max_treedepth to \(20\):

fit_nom_td20 <- stan(
  file = 'cauchy_nom.stan', seed = 7878, 
  refresh = 0, control = list(max_treedepth = 20)
)
Warning: There were 1 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 20. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems

Trace plots for the first parameter still look wild with occasional large values.

samp <- as.array(fit_nom_td20)
mcmc_trace(samp[, , 1])

res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$tailseff)

We check the diagnostics for all \(x\).

plot_rhat(res)

All Rhats are below \(1.1\), but many are over \(1.01\). Classic split-Rhat has more variation than the rank normalized Rhat (note that the former is not well defined). The folded rank normalized Rhat shows that there is still more variation in the scale than in the location between different chains.

plot_ess(res) 

Some classic effective sample sizes are very small. If we wouldn’t realize that the variance is infinite, we might try to run longer chains, but in case of an infinite variance, zero relative efficiency (ESS/S) is the truth and longer chains won’t help with that. However other quantities can be well defined, and that’s why it is useful to also look at the rank normalized version as a generic transformation to achieve finite mean and variance. The smallest bulk-ESS is less than 1000, which is not that bad. The smallest median-ESS is larger than 2500, that is we are able to estimate the median efficiently. However, many tail-ESS’s are less than 400 indicating problems for estimating the scale of the posterior.

Result: The rank normalized effective sample size is more stable than classic effective sample size, which is not well defined for the Cauchy distribution.

Result: It is useful to look at both bulk- and tail-ESS.

We check also lp__. Although increasing max_treedepth improved bulk-ESS of x, the efficiency for lp__ didn’t change.

res <- monitor_extra(samp[, , 51:52])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 158 
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 481 

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.

plot_local_ess(fit = fit_nom_td20, par = which_min_ess, nalpha = 20)

It seems that increasing max_treedepth has not much improved the efficiency in the tails. We also examine the effective sample size of different quantile estimates.

plot_quantile_ess(fit = fit_nom_td20, par = which_min_ess, nalpha = 40)

The rank plot visualisation of x[8], which has the smallest tail-ESS of 481 among the \(x\), indicates clear convergence problems.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

The rank plot visualisation of lp__, which has an effective sample size 158, doesn’t look so good either.

mcmc_hist_r_scale(samp[, , "lp__"])

Default Stan options + increased maximum treedepth + longer chains

Let’s try running 8 times longer chains.

fit_nom_td20l <- stan(
  file = 'cauchy_nom.stan', seed = 7878, 
  refresh = 0, control = list(max_treedepth = 20), 
  warmup = 1000, iter = 9000
)
Warning: There were 2 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 20. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems

Trace plots for the first parameter still look wild with occasional large values.

samp <- as.array(fit_nom_td20l)
mcmc_trace(samp[, , 1])

res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$tailseff)

Let’s check the diagnostics for all \(x\).

plot_rhat(res)

All Rhats are below \(1.01\). The classic split-Rhat has more variation than the rank normalized Rhat (note that the former is not well defined in this case).

plot_ess(res) 

Most classic ESS’s are close to zero. Running longer chains just made most classic ESS’s even smaller.

The smallest bulk-ESS are around 5000, which is not that bad. The smallest median-ESS’s are larger than 25000, that is we are able to estimate the median efficiently. However, the smallest tail-ESS is 833 indicating problems for estimating the scale of the posterior.

Result: The rank normalized effective sample size is more stable than classic effective sample size even for very long chains.

Result: It is useful to look at both bulk- and tail-ESS.

We also check lp__. Although increasing the number of iterations improved bulk-ESS of the \(x\), the relative efficiency for lp__ didn’t change.

res <- monitor_extra(samp[, , 51:52])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 1289 
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 2108 

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.

plot_local_ess(fit = fit_nom_td20l, par = which_min_ess, nalpha = 20)

Increasing the chain length did not seem to change the relative efficiency. With more draws from the longer chains we can use a finer resolution for the local efficiency estimates.

plot_local_ess(fit = fit_nom_td20l, par = which_min_ess, nalpha = 100)

The sampling efficiency far in the tails is worryingly low. This was more difficult to see previously with less draws from the tails. We see similar problems in the plot of effective sample size for quantiles.

plot_quantile_ess(fit = fit_nom_td20l, par = which_min_ess, nalpha = 100)

Let’s look at the rank plot visualisation of x[48], which has the smallest tail-ESS 2108 among the \(x\).

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

Increasing the number of iterations couldn’t remove the mixing problems at the tails. The mixing problem is inherent to the nominal parameterization of Cauchy distribution.

First alternative parameterization of the Cauchy distribution

Next, we examine an alternative parameterization and consider the Cauchy distribution as a scale mixture of Gaussian distributions. The model has two parameters and the Cauchy distributed \(x\) can be computed from those. In addition to improved sampling performance, the example illustrates that focusing on diagnostics matters.

writeLines(readLines("cauchy_alt_1.stan"))
parameters {
  vector[50] x_a;
  vector<lower=0>[50] x_b;
}

transformed parameters {
  vector[50] x = x_a ./ sqrt(x_b);
}

model {
  x_a ~ normal(0, 1);
  x_b ~ gamma(0.5, 0.5);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

We run the alternative model:

fit_alt1 <- stan(file='cauchy_alt_1.stan', seed=7878, refresh = 0)

There are no warnings and the sampling is much faster.

samp <- as.array(fit_alt1)
res <- monitor_extra(samp[, , 101:150])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)

All Rhats are below \(1.01\). Classic split-Rhats also look good even though they are not well defined for the Cauchy distribution.

plot_ess(res) 

Result: Rank normalized ESS’s have less variation than classic one which is not well defined for Cauchy.

We check lp__:

res <- monitor_extra(samp[, , 151:152])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 1520 
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 2679 

The relative efficiencies for lp__ are also much better than with the nominal parameterization.

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.

plot_local_ess(fit = fit_alt1, par = 100 + which_min_ess, nalpha = 20)

The effective sample size is good in all parts of the posterior. We also examine the effective sample size of different quantile estimates.

plot_quantile_ess(fit = fit_alt1, par = 100 + which_min_ess, nalpha = 40)

We compare the mean relative efficiencies of the underlying parameters in the new parameterization and the actual \(x\) we are interested in.

res <- monitor_extra(samp[, , 101:150])
res1 <- monitor_extra(samp[, , 1:50])
res2 <- monitor_extra(samp[, , 51:100])
cat('Mean Bulk-ESS for x =' , round(mean(res[, 'zsseff']), 2), '\n')
Mean Bulk-ESS for x = 3773.76 
cat('Mean Tail-ESS for x =' , round(mean(res[, 'tailseff']), 2), '\n')
Mean Tail-ESS for x = 2312.1 
cat('Mean Bulk-ESS for x_a =' , round(mean(res1[, 'zsseff']), 2), '\n')
Mean Bulk-ESS for x_a = 4032.3 
cat('Mean Bulk-ESS for x_b =' , round(mean(res2[, 'zsseff']), 2), '\n')
Mean Bulk-ESS for x_b = 2638.76 

Result: We see that the effective sample size of the interesting \(x\) can be different from the effective sample size of the parameters \(x_a\) and \(x_b\) that we used to compute it.

The rank plot visualisation of x[2], which has the smallest tail-ESS of 1730 among the \(x\) looks better than for the nominal parameterization.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

Similarly, the rank plot visualisation of lp__, which has a relative efficiency of -81.62, 0.2, 7.89, -95.29, -81.24, -69.52, 1498, 0.37, 1511, 1507, 1520, 0.38, 1, 1, 1, 1, 1, 2933, 0.73, 2679, 0.67, 1942, 0.49, 2956, 0.74 looks better than for the nominal parameterization.

mcmc_hist_r_scale(samp[, , "lp__"])

Another alternative parameterization of the Cauchy distribution

Another alternative parameterization is obtained by a univariate transformation as shown in the following code (see also the 3rd alternative in Michael Betancourt’s case study).

writeLines(readLines("cauchy_alt_3.stan"))
parameters {
  vector<lower=0, upper=1>[50] x_tilde;
}

transformed parameters {
vector[50] x = tan(pi() * (x_tilde - 0.5));
}

model {
  // Implicit uniform prior on x_tilde
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

We run the alternative model:

fit_alt3 <- stan(file='cauchy_alt_3.stan', seed=7878, refresh = 0)

There are no warnings, and the sampling is much faster than for the nominal model.

samp <- as.array(fit_alt3)
res <- monitor_extra(samp[, , 51:100])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)

All Rhats except some folded Rhats are below 1.01. Classic split-Rhat’s look also good even though it is not well defined for the Cauchy distribution.

plot_ess(res) 

Result: Rank normalized relative efficiencies have less variation than classic ones. Bulk-ESS and median-ESS are slightly larger than 1, which is possible for antithetic Markov chains which have negative correlation for odd lags.

We also take a closer look at the lp__ value:

res <- monitor_extra(samp[, , 101:102])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 1338 
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 2065 

The effective sample size for these are also much better than with the nominal parameterization.

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.

plot_local_ess(fit = fit_alt3, par = 50 + which_min_ess, nalpha = 20)

We examine also the sampling efficiency of different quantile estimates.

plot_quantile_ess(fit = fit_alt3, par = 50 + which_min_ess, nalpha = 40)

The effective sample size in tails is worse than for the first alternative parameterization, although it’s still better than for the nominal parameterization.

We compare the mean effective sample size of the underlying parameter in the new parameterization and the actually Cauchy distributed \(x\) we are interested in.

res <- monitor_extra(samp[, , 51:100])
res1 <- monitor_extra(samp[, , 1:50])
cat('Mean bulk-seff for x =' , round(mean(res[, 'zsseff']), 2), '\n')
Mean bulk-seff for x = 4868.4 
cat('Mean tail-seff for x =' , round(mean(res[, 'zfsseff']), 2), '\n')
Mean tail-seff for x = 1604.48 
cat('Mean bulk-seff for x_tilde =' , round(mean(res1[, 'zsseff']), 2), '\n')
Mean bulk-seff for x_tilde = 4868.4 
cat('Mean tail-seff for x_tilde =' , round(mean(res1[, 'zfsseff']), 2), '\n')
Mean tail-seff for x_tilde = 1612.3 

The Rank plot visualisation of x[23], which has the smallest tail-ESS of 1925 among the \(x\) reveals shows good efficiency, similar to the results for lp__.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

mcmc_hist_r_scale(samp[, , "lp__"])

Half-Cauchy distribution with nominal parameterization

Half-Cauchy priors are common and, for example, in Stan usually set using the nominal parameterization. However, when the constraint <lower=0> is used, Stan does the sampling automatically in the unconstrained log(x) space, which changes the geometry crucially.

writeLines(readLines("half_cauchy_nom.stan"))
parameters {
  vector<lower=0>[50] x;
}

model {
  x ~ cauchy(0, 1);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

We run the half-Cauchy model with nominal parameterization (and positive constraint).

fit_half_nom <- stan(file = 'half_cauchy_nom.stan', seed = 7878, refresh = 0)

There are no warnings and the sampling is much faster than for the full Cauchy distribution with nominal parameterization.

samp <- as.array(fit_half_nom)
res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res) 

All Rhats are below \(1.01\). Classic split-Rhats also look good even though they are not well defined for the half-Cauchy distribution.

plot_ess(res)  

Result: Rank normalized effective sample size have less variation than classic ones. Some Bulk-ESS and median-ESS are larger than 1, which is possible for antithetic Markov chains which have negative correlation for odd lags.

Due to the <lower=0> constraint, the sampling was made in the log(x) space, and we can also check the performance in that space.

res <- monitor_extra(log(samp[, , 1:50]))
plot_ess(res) 

\(\log(x)\) is quite close to Gaussian, and thus classic effective sample size is also close to rank normalized ESS which is exactly the same as for the original \(x\) as rank normalization is invariant to bijective transformations.

Result: The rank normalized effective sample size is close to the classic effective sample size for transformations which make the distribution close to Gaussian.

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.

plot_local_ess(fit = fit_half_nom, par = which_min_ess, nalpha = 20)

The effective sample size is good overall, with only a small dip in tails. We can also examine the effective sample size of different quantile estimates.

plot_quantile_ess(fit = fit_half_nom, par = which_min_ess, nalpha = 40)

The rank plot visualisation of x[39], which has the smallest tail-ESS of 1706 among \(x\), looks good.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

The rank plot visualisation of lp__ reveals some small differences in the scales, but it’s difficult to know whether this small variation from uniform is relevant.

mcmc_hist_r_scale(samp[, , "lp__"])

Alternative parameterization of the half-Cauchy distribution

writeLines(readLines("half_cauchy_alt.stan"))
parameters {
  vector<lower=0>[50] x_a;
  vector<lower=0>[50] x_b;
}

transformed parameters {
  vector[50] x = x_a .* sqrt(x_b);
}

model {
  x_a ~ normal(0, 1);
  x_b ~ inv_gamma(0.5, 0.5);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

Run half-Cauchy with alternative parameterization

fit_half_reparam <- stan(
  file = 'half_cauchy_alt.stan', seed = 7878, refresh = 0
)

There are no warnings and the sampling is as fast for the half-Cauchy nominal model.

samp <- as.array(fit_half_reparam)
res <- monitor_extra(samp[, , 101:150])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)

plot_ess(res) 

Result: The Rank normalized relative efficiencies have less variation than classic ones which is not well defined for the Cauchy distribution. Based on bulk-ESS and median-ESS, the efficiency for central quantities is much lower, but based on tail-ESS and MAD-ESS, the efficiency in the tails is slightly better than for the half-Cauchy distribution with nominal parameterization. We also see that a parameterization which is good for the full Cauchy distribution is not necessarily good for the half-Cauchy distribution as the <lower=0> constraint additionally changes the parameterization.

We also check the lp__ values:

res <- monitor_extra(samp[, , 151:152])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 735 
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 1511 

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.

plot_local_ess(fit_half_reparam, par = 100 + which_min_ess, nalpha = 20)

We also examine the effective sample size for different quantile estimates.

plot_quantile_ess(fit_half_reparam, par = 100 + which_min_ess, nalpha = 40)

The effective sample size near zero is much worse than for the half-Cauchy distribution with nominal parameterization.

The Rank plot visualisation of x[13], which has the smallest tail-ESS of 1511 among the \(x\), reveals deviations from uniformity, which is expected with lower effective sample size.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

A similar result is obtained when looking at the rank plots of lp__.

mcmc_hist_r_scale(samp[, , "lp__"])

The Cauchy distribution with Jags

So far, we have run all models in Stan, but we want to also investigate whether similar problems arise with probabilistic programming languages that use other samplers than variants of Hamiltonian Monte-Carlo. Thus, we will fit the eight schools models also with Jags, which uses a dialect of the BUGS language to specify models. Jags uses a clever mix of Gibbs and Metropolis-Hastings sampling. This kind of sampling does not scale well to high dimensional posteriors of strongly interdependent parameters, but for the relatively simple models discussed in this case study it should work just fine.

The Jags code for the nominal parameteriztion of the cauchy distribution looks as follows:

writeLines(readLines("cauchy_nom.bugs"))
model {
  for (i in 1:50) {
    x[i] ~ dt(0, 1, 1)
  }
}

First, we initialize the Jags model for reusage later.

jags_nom <- jags.model(
  "cauchy_nom.bugs",
  n.chains = 4, n.adapt = 10000
)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 0
   Unobserved stochastic nodes: 50
   Total graph size: 52

Initializing model

Next, we sample 1000 iterations for each of the 4 chains for easy comparison with the corresponding Stan results.

samp_jags_nom <- coda.samples(
  jags_nom, variable.names = "x",
  n.iter = 1000
)
samp_jags_nom <- aperm(abind(samp_jags_nom, along = 3), c(1, 3, 2))
dimnames(samp_jags_nom)[[2]] <- paste0("chain:", 1:4)

We summarize the model as follows:

mon <- monitor(samp_jags_nom)
print(mon)
Inference for the input samples (4 chains: each with iter = 1000; warmup = 0):

         Q5   Q50  Q95  Mean      SD  Rhat Bulk_ESS Tail_ESS
x[1]  -6.93  0.02 6.51 -0.55   53.25     1     3830     3932
x[2]  -7.21  0.02 5.87 -0.72   60.59     1     3902     3853
x[3]  -6.06  0.04 6.68  0.94   50.81     1     3905     3747
x[4]  -5.71  0.01 6.29  1.45   64.58     1     3837     3893
x[5]  -5.75  0.01 6.50  4.60  265.33     1     3878     3648
x[6]  -5.80 -0.05 6.36  1.58  122.00     1     3947     3933
x[7]  -6.27  0.05 7.02  0.04  101.95     1     4052     3891
x[8]  -6.82  0.02 6.27 61.47 3828.43     1     4121     4045
x[9]  -5.79  0.01 6.28  1.01   58.57     1     3885     3774
x[10] -6.58 -0.01 6.03  0.71   59.11     1     3891     3838
x[11] -6.24  0.01 6.41  0.57   29.92     1     3795     3645
x[12] -6.15 -0.01 5.97 -1.21   36.69     1     4163     4011
x[13] -5.67 -0.01 6.61 -1.56   37.27     1     4113     4056
x[14] -6.42 -0.04 5.93 -1.21  112.06     1     3922     3877
x[15] -6.19  0.00 5.67 -2.54  145.67     1     4244     3839
x[16] -6.06 -0.01 5.78 -0.67   32.09     1     3740     3703
x[17] -6.06  0.02 7.02  0.16   92.19     1     3867     3785
x[18] -6.43 -0.01 5.87 -0.62   32.15     1     3854     3848
x[19] -7.24  0.02 5.98 -1.40   67.31     1     3925     4013
x[20] -6.68 -0.02 6.15 -0.81   45.61     1     3837     3866
x[21] -5.82  0.00 7.09 -1.03   61.63     1     3814     3870
x[22] -5.98  0.05 7.38  0.92   67.15     1     3667     3799
x[23] -6.41 -0.01 6.59  7.93  504.67     1     3850     4036
x[24] -5.67 -0.02 7.09 -0.22   38.16     1     3600     3844
x[25] -5.66  0.05 6.35  0.11   19.05     1     4106     3969
x[26] -6.22  0.01 6.97 -1.43   55.55     1     3588     3986
x[27] -6.75  0.00 6.89  1.40  474.30     1     3715     4094
x[28] -6.33  0.03 6.90  1.18  117.64     1     3718     3865
x[29] -6.03  0.01 6.87 -2.94  281.30     1     4012     3802
x[30] -6.76 -0.04 5.79  0.74   39.61     1     4024     3872
x[31] -5.51  0.01 6.40  0.03   50.73     1     3702     3763
x[32] -6.72  0.00 6.53 -1.34   95.60     1     4084     3865
x[33] -6.65  0.02 5.61 -0.28   34.47     1     3636     3292
x[34] -6.78  0.01 6.92 -0.59   44.36     1     3820     3890
x[35] -6.61 -0.02 6.84  1.06   39.50     1     4129     3787
x[36] -6.19  0.00 6.73  1.51   58.17     1     4245     3789
x[37] -6.22  0.00 5.98  5.07  236.63     1     3999     3931
x[38] -6.86 -0.01 6.40 -0.24   35.80     1     3701     3888
x[39] -6.61 -0.01 6.02  0.54   60.11     1     4140     3882
x[40] -6.25  0.00 5.82 -0.65   27.90     1     3834     3930
x[41] -6.37 -0.01 6.06  1.01  138.14     1     3721     3507
x[42] -6.28  0.04 6.69 -0.42   22.08     1     4102     4015
x[43] -5.71  0.00 5.95  1.18   87.59     1     4171     3730
x[44] -5.76 -0.03 6.01  0.95   31.31     1     4079     3949
x[45] -6.19 -0.05 6.39 -3.06  217.64     1     3655     3847
x[46] -6.70  0.00 7.10  0.69   48.83     1     3912     3775
x[47] -5.67  0.06 7.08  0.52   23.83     1     4000     3855
x[48] -6.56 -0.02 5.85 -0.69   45.41     1     3866     3893
x[49] -5.91  0.03 6.79 -0.30   79.41     1     4116     3971
x[50] -6.24 -0.01 5.96 -1.49   90.43     1     3764     3682

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).
which_min_ess <- which.min(mon[1:50, 'Bulk_ESS'])

The overall results look very promising with Rhats = 1 and ESS values close to the total number of draws of 4000. We take a detailed look at x[26], which has the smallest bulk-ESS of 3588.

We examine the sampling efficiency in different parts of the posterior by computing the efficiency estimates for small interval probability estimates.

plot_local_ess(fit = samp_jags_nom, par = which_min_ess, nalpha = 20)

The efficiency estimate is good in all parts of the posterior. Further, we examine the sampling efficiency of different quantile estimates.

plot_quantile_ess(fit = samp_jags_nom, par = which_min_ess, nalpha = 40)

Rank plots also look rather similar across chains.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp_jags_nom[, , xmin])

Result: Jags seems to be able to sample from the nominal parameterization of the Cauchy distribution just fine.

Appendix D: A centered eight schools model with very long chains and thinning

We continue with our discussion about hierarchical models on the Eight Schools data, which we started in Section Eight Schools. We also analyse the performance of different variants of the diagnostics.

A Centered Eight Schools model

writeLines(readLines("eight_schools_cp.stan"))
data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}

parameters {
  real mu;
  real<lower=0> tau;
  real theta[J];
}

model {
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 5);
  theta ~ normal(mu, tau);
  y ~ normal(theta, sigma);
}

In the main text, we observed that the centered parameterization of this hierarchical model did not work well with the default MCMC options of Stan plus increased adapt_delta, and so we directly try to fit the model with longer chains.

Centered parameterization with longer chains

Low efficiency can be sometimes compensated with longer chains. Let’s check 10 times longer chain.

fit_cp2 <- stan(
  file = 'eight_schools_cp.stan', data = eight_schools,
  iter = 20000, chains = 4, seed = 483892929, refresh = 0,
  control = list(adapt_delta = 0.95)
)
Warning: There were 736 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
monitor(fit_cp2)
Inference for the input samples (4 chains: each with iter = 20000; warmup = 10000):

             Q5    Q50   Q95   Mean   SD  Rhat Bulk_ESS Tail_ESS
mu        -0.99   4.43  9.73   4.39 3.29  1.00     4802    10367
tau        0.49   2.95 10.04   3.81 3.22  1.01      832      366
theta[1]  -1.43   5.69 16.40   6.31 5.68  1.00     6735    12099
theta[2]  -2.43   4.85 12.71   4.93 4.72  1.00     8848    16195
theta[3]  -5.02   4.15 11.97   3.90 5.37  1.00     8249    13861
theta[4]  -2.72   4.75 12.65   4.79 4.82  1.00     7967    15686
theta[5]  -4.45   3.83 10.65   3.56 4.68  1.00     7141    14542
theta[6]  -4.17   4.20 11.73   4.04 4.92  1.00     8512    14927
theta[7]  -0.81   5.92 15.58   6.44 5.15  1.00     5895    13519
theta[8]  -3.40   4.80 13.57   4.86 5.43  1.00     8356    14080
lp__     -25.06 -15.17 -2.25 -14.64 6.82  1.01      849      491

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).
res <- monitor_extra(fit_cp2)
print(res)
Inference for the input samples (4 chains: each with iter = 20000; warmup = 10000):

           mean se_mean   sd     Q5    Q50   Q95  seff reff sseff zseff zsseff zsreff  Rhat sRhat
mu         4.39    0.05 3.29  -0.99   4.43  9.73  4757 0.12  4808  4752   4802   0.12     1  1.00
tau        3.81    0.07 3.22   0.49   2.95 10.04  2037 0.05  2049   813    832   0.02     1  1.00
theta[1]   6.31    0.07 5.68  -1.43   5.69 16.40  7278 0.18  7292  6706   6735   0.17     1  1.00
theta[2]   4.93    0.05 4.72  -2.43   4.85 12.71  9687 0.24  9788  8531   8848   0.22     1  1.00
theta[3]   3.90    0.05 5.37  -5.02   4.15 11.97  9992 0.25 10062  8107   8249   0.21     1  1.00
theta[4]   4.79    0.05 4.82  -2.72   4.75 12.65  9072 0.23  9194  7989   7967   0.20     1  1.00
theta[5]   3.56    0.05 4.68  -4.45   3.83 10.65  8243 0.21  8217  7117   7141   0.18     1  1.00
theta[6]   4.04    0.05 4.92  -4.17   4.20 11.73  9597 0.24  9573  8412   8512   0.21     1  1.00
theta[7]   6.44    0.07 5.15  -0.81   5.92 15.58  6263 0.16  6345  5802   5895   0.15     1  1.00
theta[8]   4.86    0.05 5.43  -3.40   4.80 13.57 10220 0.26 10312  8301   8356   0.21     1  1.00
lp__     -14.64    0.25 6.82 -25.06 -15.17 -2.25   768 0.02   787   826    849   0.02     1  1.01
         zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff madsreff
mu           1   1.00       1    8927    0.22    10367     0.26     4544     0.11     7935     0.20
tau          1   1.01       1    7095    0.18      366     0.01     2258     0.06     6209     0.16
theta[1]     1   1.00       1    4159    0.10    12099     0.30     5832     0.15     7815     0.20
theta[2]     1   1.00       1    7250    0.18    16195     0.40     6006     0.15     7676     0.19
theta[3]     1   1.00       1    8040    0.20    13861     0.35     4841     0.12     7523     0.19
theta[4]     1   1.00       1    7715    0.19    15686     0.39     4902     0.12     7975     0.20
theta[5]     1   1.00       1    9201    0.23    14542     0.36     4300     0.11     8048     0.20
theta[6]     1   1.00       1    9238    0.23    14927     0.37     5131     0.13     7877     0.20
theta[7]     1   1.00       1    6526    0.16    13519     0.34     5464     0.14     7855     0.20
theta[8]     1   1.00       1    6593    0.16    14080     0.35     5677     0.14     7474     0.19
lp__         1   1.01       1    1489    0.04      491     0.01     2087     0.05     4927     0.12

We still get a whole bunch of divergent transitions so it’s clear that the results can’t be trusted even if all other diagnostics were good. Still, it may be worth looking at additional diagnostics to better understand what’s happening.

Some rank-normalized split-Rhats are still larger than \(1.01\). Bulk-ESS for tau and lp__ are around 800 which corresponds to low relative efficiency of \(1\%\), but is above our recommendation of ESS>400. In this kind of cases, it is useful to look at the local efficiency estimates, too (and the larger number of divergences is clear indication of problems, too).

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small intervals for tau.

plot_local_ess(fit = fit_cp2, par = "tau", nalpha = 50)

We see that the sampling has difficulties in exploring small tau values. As ESS<400 for small probability intervals in case of small tau values, we may suspect that we may miss substantial amount of posterior mass and get biased estimates.

We also examine the effective sample size of different quantile estimates.

plot_quantile_ess(fit = fit_cp2, par = "tau", nalpha = 100)

Several quantile estimates have ESS<400, which raises a doubt that there are convergence problems and we may have significant bias.

Let’s see how the Bulk-ESS and Tail-ESS changes when we use more and more draws.

plot_change_ess(fit = fit_cp2, par = "tau")