Load packages
library(rstanarm)
options(mc.cores = parallel::detectCores())
library(loo)
library(tidyverse)
library(GGally)
library(bayesplot)
theme_set(bayesplot::theme_default())
library(projpred)
SEED=87
This notebook was inspired by Andrew Tyre’s blog post Does model averaging make sense?. Tyre discusses problems in current statistical practices in ecology, focusing in multi-collinearity, model averaging and measuring the relative importance of variables. Tyre’s post is commenting a paper Model averaging and muddled multimodel inferences In his blog post he uses maximum likelihood and AIC_c. Here we provide a Bayesian approach for handling multicollinearity, model averaging and measuring relative importance of variables using packages rstanarm
, bayesplot
, loo
and projpred
. We demonstrate the benefits of Bayesian posterior analysis (Gelman et al., 2013) and projection predictive approach (Piironen and Vehtari, 2017; Piironen, Paasiniemi and Vehtari, 2020).
We generate the data used previously to illustrate multi-collinearity problems.
# all this data generation is from Cade 2015
# doesn't matter what this is -- if you use a different number your results will be different from mine.
set.seed(SEED)
df <- tibble(
pos.tot = runif(200,min=0.8,max=1.0),
urban.tot = pmin(runif(200,min=0.0,max=0.02),1.0 - pos.tot),
neg.tot = (1.0 - pmin(pos.tot + urban.tot,1)),
x1= pmax(pos.tot - runif(200,min=0.05,max=0.30),0),
x3= pmax(neg.tot - runif(200,min=0.0,max=0.10),0),
x2= pmax(pos.tot - x1 - x3/2,0),
x4= pmax(1 - x1 - x2 - x3 - urban.tot,0))
# true model and 200 Poisson observations
mean.y <- exp(-5.8 + 6.3*df$x1 + 15.2*df$x2)
df$y <- rpois(200,mean.y)
ggpairs(df,diag=list(continuous="barDiag"))
Tyre: “So there is a near perfect negative correlation between the things sage grouse like and the things they don’t like, although it gets less bad when considering the individual covariates.”
From this point onwards we switch to Bayesian approach. The rstanarm package provides stan_glm
function which accepts same arguments as glm
, but makes full Bayesian inference using Stan (mc-stan.org). By default a weakly informative Gaussian prior is used for weights.
fitg <- stan_glm(y ~ x1 + x2 + x3 + x4, data = df, na.action = na.fail, family=poisson(), seed=SEED)
Let’s look at the summary:
summary(fitg)
Model Info:
function: stan_glm
family: poisson [log]
formula: y ~ x1 + x2 + x3 + x4
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 200
predictors: 5
Estimates:
mean sd 10% 50% 90%
(Intercept) 2.5 5.7 -4.9 2.5 9.7
x1 -2.0 5.7 -9.2 -2.0 5.5
x2 6.5 5.7 -0.8 6.4 13.9
x3 -8.9 5.7 -16.2 -8.9 -1.5
x4 -8.8 5.9 -16.4 -8.7 -1.0
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 4.3 0.2 4.1 4.3 4.6
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.2 1.0 785
x1 0.2 1.0 786
x2 0.2 1.0 796
x3 0.2 1.0 857
x4 0.2 1.0 779
mean_PPD 0.0 1.0 2660
log-posterior 0.0 1.0 1237
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
We didn’t get divergences, Rhat’s are less than 1.1 and n_eff’s are useful (see, e.g., RStan workflow). However, when we know that covariats are correlating we can get even better performance by using QR decomposition (see, The QR Decomposition For Regression Models).
fitg <- stan_glm(y ~ x1 + x2 + x3 + x4, data = df, na.action = na.fail, family=poisson(), QR=TRUE, seed=SEED)
Let’s look at the summary and plot:
summary(fitg)
Model Info:
function: stan_glm
family: poisson [log]
formula: y ~ x1 + x2 + x3 + x4
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 200
predictors: 5
Estimates:
mean sd 10% 50% 90%
(Intercept) 2.5 6.1 -5.4 2.6 10.2
x1 -2.0 6.1 -9.8 -2.1 6.0
x2 6.5 6.1 -1.4 6.4 14.6
x3 -8.9 6.1 -16.5 -8.9 -1.0
x4 -8.7 6.3 -16.9 -8.8 -0.6
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 4.3 0.2 4.1 4.3 4.6
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.1 1.0 4112
x1 0.1 1.0 4114
x2 0.1 1.0 4116
x3 0.1 1.0 4259
x4 0.1 1.0 3946
mean_PPD 0.0 1.0 4127
log-posterior 0.0 1.0 1742
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
Use of QR decomposition greatly improved sampling efficiency and we continue with this model.
mcmc_areas(as.matrix(fitg),prob_outer = .99)
All 95% posterior intervals are overlapping 0 and it seems we have the same collinearity problem as with maximum likelihood estimates.
Looking at the pairwise posteriors we can see high correlations
mcmc_pairs(as.matrix(fitg),pars = c("x1","x2","x3","x4"))
If look more carefully on of the subplots, we see that although marginal posterior intervals overlap 0, the joint posterior is not overlapping 0.
mcmc_scatter(as.matrix(fitg), pars = c("x1", "x2"))+geom_vline(xintercept=0)+geom_hline(yintercept=0)
Based on the joint distributions all the variables would be relevant. To make predictions we don’t need to make variable selection, we just integrate over the uncertainty (kind of continuous model averaging).
In case of even more variables with some being relevant and some irrelevant, it will be difficult to analyse joint posterior to see which variables are jointly relevant. We can easily test whether any of the covariates are useful by using cross-validation to compare to a null model,
fitg0 <- stan_glm(y ~ 1, data = df, na.action = na.fail, family=poisson(), seed=SEED)
(loog <- loo(fitg))
Computed from 4000 by 200 log-likelihood matrix
Estimate SE
elpd_loo -383.3 12.1
p_loo 5.3 0.7
looic 766.6 24.2
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
(loog0 <- loo(fitg0))
Computed from 4000 by 200 log-likelihood matrix
Estimate SE
elpd_loo -714.5 43.8
p_loo 4.9 0.8
looic 1428.9 87.5
------
Monte Carlo SE of elpd_loo is 0.1.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
loo_compare(loog0, loog)
elpd_diff se_diff
fitg 0.0 0.0
fitg0 -331.2 42.7
Based on cross-validation covariates together contain significant information to improve predictions.
We might want to choose some variables 1) because we don’t want to observe all the variables in the future (e.g. due to the measurement cost), or 2) we want to most relevant variables which we define here as a minimal set of variables which can provide similar predictions to the full model.
Tyre used AIC_c to estimate the model performance. In Bayesian setting we could use Bayesian cross-validation or WAIC, but we don’t recommend thhem for variable selection as discussed by Piironen and Vehtari (2017). The reason for not using Bayesian CV or WAIC is that the selection process uses the data twice, and in case of large number variable combinations the selection process overfits and can produce really bad models. Using the usual posterior inference given the selected variables ignores that the selected variables are conditonal on the selection process and simply setting some variables to 0 ignores the uncertainty related to their relevance.
Piironen and Vehtari (2017) also show that a projection predictive approach can be used to make a model reduction, that is, choosing a smaller model with some coefficients set to 0. The projection predictive approach solves the problem how to do inference after the selection. The solution is to project the full model posterior to the restricted subspace. See more by Piironen, Paasiniemi and Vehtari (2020).
We make the projective predictive variable selection using the previous full model. A fast leave-one-out cross-validation approach (Vehtari, Gelman and Gabry, 2017) is used to choose the model size.
fitg_cv <- cv_varsel(fitg, method='forward', cv_method='LOO')
fitg_cv$vind
NULL
We can now look at the estimated predictive performance of smaller models compared to the full model.
plot(fitg_cv, stats = c('elpd', 'rmse'))
And we get a LOO based recommendation for the model size to choose
(nsel <- suggest_size(fitg_cv, alpha=0.1))
[1] 2
(vsel <- solution_terms(fitg_cv)[1:nsel])
[1] "x2" "x1"
We see that 1 variables is enough to get the same predictive accuracy as with all 4 variables.
Next we form the projected posterior for the chosen model.
projg <- project(fitg_cv, nv = nsel, ns = 4000)
projdraws <- as.matrix(projg)
round(colMeans(projdraws),1)
Intercept x2 x1
-6.1 15.2 6.7
round(posterior_interval(projdraws),1)
5% 95%
Intercept -7.0 -5.3
x2 14.1 16.3
x1 5.8 7.7
This looks good as the true values are intercept=-5.8, x2=15.2, x1=6.3.
mcmc_areas(projdraws, pars=c("Intercept",vsel))
Even if we started with a model which had due to a co-linearity difficult to interpret posterior, the projected posterior is able to match closely the true values. The necessary information was in the full model and with the projection we were able to form the projected posterior which we should use if x3 and x4 are set to 0.
Back to the Tyre’s question “Does model averaging make sense?”. If we are interested just in good predictions we can do continuous model averaging by using suitable priors and by integrating over the posterior. If we are intersted in predcitions, then we don’t first average weights (ie posterior mean), but use all weight values to compute predictions and do the averaging of the predictions. All this is automatic in Bayesian framework.
Tyre also commented on the problems of measuring variable importance. The projection predictive approach above is derived using decision theory and is very helpful for measuring relevancy and choosing relevant variables. Tyre did not comment about the inference after selection although it is also known problem in variable selection. The projection predictive approach above solves that problem, too.
sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Matrix products: default
BLAS/LAPACK: /opt/OpenBLAS/lib/libopenblas_haswellp-r0.3.17.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=fi_FI.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=fi_FI.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=fi_FI.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] GGally_2.1.2 fivethirtyeight_0.6.2 projpred_2.0.2 bayesplot_1.8.1
[5] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.8 purrr_0.3.4
[9] readr_2.1.2 tidyr_1.2.0 tibble_3.1.6 ggplot2_3.3.5
[13] tidyverse_1.3.1 loo_2.4.1 rstanarm_2.21.1 Rcpp_1.0.8
[17] rmarkdown_2.11 knitr_1.37
loaded via a namespace (and not attached):
[1] readxl_1.3.1 backports_1.4.1 plyr_1.8.6 igraph_1.2.11
[5] splines_4.1.2 crosstalk_1.2.0 usethis_2.1.5 rstantools_2.1.1
[9] inline_0.3.19 digest_0.6.29 htmltools_0.5.2 rsconnect_0.8.25
[13] fansi_1.0.2 magrittr_2.0.2 checkmate_2.0.0 memoise_2.0.1
[17] tzdb_0.2.0 remotes_2.4.2 modelr_0.1.8 RcppParallel_5.1.5
[21] matrixStats_0.61.0 xts_0.12.1 prettyunits_1.1.1 colorspace_2.0-2
[25] rvest_1.0.2 haven_2.4.3 xfun_0.29 callr_3.7.0
[29] crayon_1.4.2 jsonlite_1.7.3 lme4_1.1-28 survival_3.2-13
[33] zoo_1.8-9 glue_1.6.1 gtable_0.3.0 distributional_0.3.0
[37] pkgbuild_1.3.1 rstan_2.21.3 abind_1.4-5 scales_1.1.1
[41] DBI_1.1.2 miniUI_0.1.1.1 progress_1.2.2 xtable_1.8-4
[45] stats4_4.1.2 StanHeaders_2.21.0-7 DT_0.20 htmlwidgets_1.5.4
[49] httr_1.4.2 threejs_0.3.3 RColorBrewer_1.1-2 posterior_1.2.0
[53] ellipsis_0.3.2 reshape_0.8.8 pkgconfig_2.0.3 farver_2.1.0
[57] sass_0.4.0 dbplyr_2.1.1 utf8_1.2.2 labeling_0.4.2
[61] tidyselect_1.1.1 rlang_1.0.1 reshape2_1.4.4 later_1.3.0
[65] munsell_0.5.0 cellranger_1.1.0 tools_4.1.2 cachem_1.0.6
[69] cli_3.1.1 generics_0.1.2 devtools_2.4.3 broom_0.7.12
[73] ggridges_0.5.3 evaluate_0.14 fastmap_1.1.0 yaml_2.2.2
[77] processx_3.5.2 fs_1.5.2 nlme_3.1-155 mime_0.12
[81] xml2_1.3.3 brio_1.1.3 compiler_4.1.2 shinythemes_1.2.0
[85] rstudioapi_0.13 gamm4_0.2-6 testthat_3.1.2 reprex_2.0.1
[89] bslib_0.3.1 stringi_1.7.6 highr_0.9 ps_1.6.0
[93] desc_1.4.0 lattice_0.20-45 Matrix_1.4-0 nloptr_1.2.2.3
[97] markdown_1.1 shinyjs_2.1.0 tensorA_0.36.2 vctrs_0.3.8
[101] pillar_1.7.0 lifecycle_1.0.1 jquerylib_0.1.4 httpuv_1.6.5
[105] R6_2.5.1 promises_1.2.0.1 gridExtra_2.3 sessioninfo_1.2.2
[109] codetools_0.2-18 boot_1.3-28 colourpicker_1.1.1 MASS_7.3-55
[113] gtools_3.9.2 assertthat_0.2.1 pkgload_1.2.4 rprojroot_2.0.2
[117] withr_2.4.3 shinystan_2.5.0 mgcv_1.8-38 parallel_4.1.2
[121] hms_1.1.1 grid_4.1.2 minqa_1.2.4 shiny_1.7.1
[125] lubridate_1.8.0 base64enc_0.1-3 dygraphs_1.1.1.6