Demonstration of \(K\)-fold cross-validation using simulated data. See Chapter 11 in Regression and Other Stories.
library("MASS") # needed for mvrnorm()
library("rstanarm")
library("loo")
Set random seed for reproducability
SEED <- 1754
\(60\times 30\) matrix representing 30 predictors that are random but not independent; rather, we draw them from a multivariate normal distribution with correlations 0.8:
set.seed(SEED)
n <- 60
k <- 30
rho <- 0.8
Sigma <- rho*array(1, c(k,k)) + (1-rho)*diag(k)
X <- mvrnorm(n, rep(0,k), Sigma)
b <- c(c(-1, 1, 2), rep(0,k-3))
y <- X %*% b + rnorm(n)*2
fake <- data.frame(X, y)
fit_1 <- stan_glm(y ~ ., prior=normal(0, 10, autoscale=FALSE),
data=fake, seed=SEED, refresh=0)
(loo_1 <- loo(fit_1))
Computed from 4000 by 60 log-likelihood matrix
Estimate SE
elpd_loo -153.8 5.4
p_loo 28.7 3.6
looic 307.5 10.7
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 16 26.7% 829
(0.5, 0.7] (ok) 29 48.3% 169
(0.7, 1] (bad) 15 25.0% 21
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
In this case, Pareto smoothed importance sampling LOO fails, but the diagnostic recognizes this with many high Pareto k values. We can run slower, but more robust K-fold-CV
kfold_1 <- kfold(fit_1)
The regularized horseshoe prior hs()
is weakly informative, stating that it is likely that only small number of predictors are relevant, but we donโt know which ones.
k0 <- 2 # prior guess for the number of relevant variables
tau0 <- k0/(k-k0) * 1/sqrt(n)
hs_prior <- hs(df=1, global_df=1, global_scale=tau0, slab_scale=3, slab_df=7)
fit_2 <- stan_glm(y ~ ., prior=hs_prior, data=fake, seed=SEED,
control=list(adapt_delta=0.995), refresh=200)
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 4.5e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.45 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 5.98948 seconds (Warm-up)
Chain 1: 5.81409 seconds (Sampling)
Chain 1: 11.8036 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 3.5e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.35 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 4.87941 seconds (Warm-up)
Chain 2: 4.88352 seconds (Sampling)
Chain 2: 9.76293 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 3.4e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.34 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 4.87019 seconds (Warm-up)
Chain 3: 5.84669 seconds (Sampling)
Chain 3: 10.7169 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 0.000115 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 1.15 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 4.93557 seconds (Warm-up)
Chain 4: 5.97002 seconds (Sampling)
Chain 4: 10.9056 seconds (Total)
Chain 4:
(loo_2 <- loo(fit_2))
Computed from 4000 by 60 log-likelihood matrix
Estimate SE
elpd_loo -143.3 7.9
p_loo 10.8 2.9
looic 286.5 15.9
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 55 91.7% 390
(0.5, 0.7] (ok) 4 6.7% 225
(0.7, 1] (bad) 1 1.7% 60
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
PSIS-LOO performs better now, but there is still one bad diagnostic value, and thus we run again slower, but more robust K-fold-CV
kfold_2 <- kfold(fit_2)
loo_compare(loo_1,loo_2)
elpd_diff se_diff
fit_2 0.0 0.0
fit_1 -10.5 5.2
loo_compare(kfold_1,kfold_2)
elpd_diff se_diff
fit_2 0.0 0.0
fit_1 -17.7 5.6
As PSIS-LOO fails, PSIS-LOO comparison underestimates the difference between the models. The Pareto k diagnostic correctly identified the problem, and more robust K-fold-CV shows that by using a better prior we can get better predictions.