Setup

Load packages

library(rstanarm)
options(mc.cores = parallel::detectCores())
library(loo)
library(tidyverse)
library(GGally)
library(bayesplot)
theme_set(bayesplot::theme_default())
library(projpred)
SEED=87

1 Introduction

This notebook was inspired by Andrew Tyre’s blog post Does model averaging make sense?. Tyre discusses problems in current statistical practices in ecology, focusing in multi-collinearity, model averaging and measuring the relative importance of variables. Tyre’s post is commenting a paper Model averaging and muddled multimodel inferences In his blog post he uses maximum likelihood and AIC_c. Here we provide a Bayesian approach for handling multicollinearity, model averaging and measuring relative importance of variables using packages rstanarm, bayesplot, loo and projpred. We demonstrate the benefits of Bayesian posterior analysis (Gelman et al., 2013) and projection predictive approach (Piironen and Vehtari, 2017; Piironen, Paasiniemi and Vehtari, 2020).

2 Data

We generate the data used previously to illustrate multi-collinearity problems.

# all this data generation is from Cade 2015
# doesn't matter what this is -- if you use a different number your results will be different from mine.
set.seed(SEED)
df <- tibble(
  pos.tot = runif(200,min=0.8,max=1.0),
  urban.tot = pmin(runif(200,min=0.0,max=0.02),1.0 - pos.tot),
  neg.tot = (1.0 - pmin(pos.tot + urban.tot,1)),
  x1= pmax(pos.tot - runif(200,min=0.05,max=0.30),0),
  x3= pmax(neg.tot - runif(200,min=0.0,max=0.10),0),
  x2= pmax(pos.tot - x1 - x3/2,0),
  x4= pmax(1 - x1 - x2 - x3 - urban.tot,0))
# true model and 200 Poisson observations
mean.y <- exp(-5.8 + 6.3*df$x1 + 15.2*df$x2)
df$y <- rpois(200,mean.y)

ggpairs(df,diag=list(continuous="barDiag"))

Tyre: “So there is a near perfect negative correlation between the things sage grouse like and the things they don’t like, although it gets less bad when considering the individual covariates.”

3 Bayesian inference

From this point onwards we switch to Bayesian approach. The rstanarm package provides stan_glm function which accepts same arguments as glm, but makes full Bayesian inference using Stan (mc-stan.org). By default a weakly informative Gaussian prior is used for weights.

fitg <- stan_glm(y ~ x1 + x2 + x3 + x4, data = df, na.action = na.fail, family=poisson(), seed=SEED)

Let’s look at the summary:

summary(fitg)


Model Info:
 function:     stan_glm
 family:       poisson [log]
 formula:      y ~ x1 + x2 + x3 + x4
 algorithm:    sampling
 sample:       4000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 200
 predictors:   5

Estimates:
              mean   sd    10%   50%   90%
(Intercept)   2.5    5.7  -4.9   2.5   9.7
x1           -2.0    5.7  -9.2  -2.0   5.5
x2            6.5    5.7  -0.8   6.4  13.9
x3           -8.9    5.7 -16.2  -8.9  -1.5
x4           -8.8    5.9 -16.4  -8.7  -1.0

Fit Diagnostics:
           mean   sd   10%   50%   90%
mean_PPD 4.3    0.2  4.1   4.3   4.6  

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse Rhat n_eff
(Intercept)   0.2  1.0   785 
x1            0.2  1.0   786 
x2            0.2  1.0   796 
x3            0.2  1.0   857 
x4            0.2  1.0   779 
mean_PPD      0.0  1.0  2660 
log-posterior 0.0  1.0  1237 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

We didn’t get divergences, Rhat’s are less than 1.1 and n_eff’s are useful (see, e.g., RStan workflow). However, when we know that covariats are correlating we can get even better performance by using QR decomposition (see, The QR Decomposition For Regression Models).

fitg <- stan_glm(y ~ x1 + x2 + x3 + x4, data = df, na.action = na.fail, family=poisson(), QR=TRUE, seed=SEED)

Let’s look at the summary and plot:

summary(fitg)


Model Info:
 function:     stan_glm
 family:       poisson [log]
 formula:      y ~ x1 + x2 + x3 + x4
 algorithm:    sampling
 sample:       4000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 200
 predictors:   5

Estimates:
              mean   sd    10%   50%   90%
(Intercept)   2.5    6.1  -5.4   2.6  10.2
x1           -2.0    6.1  -9.8  -2.1   6.0
x2            6.5    6.1  -1.4   6.4  14.6
x3           -8.9    6.1 -16.5  -8.9  -1.0
x4           -8.7    6.3 -16.9  -8.8  -0.6

Fit Diagnostics:
           mean   sd   10%   50%   90%
mean_PPD 4.3    0.2  4.1   4.3   4.6  

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse Rhat n_eff
(Intercept)   0.1  1.0  4112 
x1            0.1  1.0  4114 
x2            0.1  1.0  4116 
x3            0.1  1.0  4259 
x4            0.1  1.0  3946 
mean_PPD      0.0  1.0  4127 
log-posterior 0.0  1.0  1742 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Use of QR decomposition greatly improved sampling efficiency and we continue with this model.

mcmc_areas(as.matrix(fitg),prob_outer = .99)

All 95% posterior intervals are overlapping 0 and it seems we have the same collinearity problem as with maximum likelihood estimates.

Looking at the pairwise posteriors we can see high correlations

mcmc_pairs(as.matrix(fitg),pars = c("x1","x2","x3","x4"))

If look more carefully on of the subplots, we see that although marginal posterior intervals overlap 0, the joint posterior is not overlapping 0.

mcmc_scatter(as.matrix(fitg), pars = c("x1", "x2"))+geom_vline(xintercept=0)+geom_hline(yintercept=0)

Based on the joint distributions all the variables would be relevant. To make predictions we don’t need to make variable selection, we just integrate over the uncertainty (kind of continuous model averaging).

In case of even more variables with some being relevant and some irrelevant, it will be difficult to analyse joint posterior to see which variables are jointly relevant. We can easily test whether any of the covariates are useful by using cross-validation to compare to a null model,

fitg0 <- stan_glm(y ~ 1, data = df, na.action = na.fail, family=poisson(), seed=SEED)

(loog <- loo(fitg))


Computed from 4000 by 200 log-likelihood matrix

         Estimate   SE
elpd_loo   -383.3 12.1
p_loo         5.3  0.7
looic       766.6 24.2
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

(loog0 <- loo(fitg0))


Computed from 4000 by 200 log-likelihood matrix

         Estimate   SE
elpd_loo   -714.5 43.8
p_loo         4.9  0.8
looic      1428.9 87.5
------
Monte Carlo SE of elpd_loo is 0.1.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

loo_compare(loog0, loog)

      elpd_diff se_diff
fitg     0.0       0.0 
fitg0 -331.2      42.7

Based on cross-validation covariates together contain significant information to improve predictions.

4 Variable selection

We might want to choose some variables 1) because we don’t want to observe all the variables in the future (e.g. due to the measurement cost), or 2) we want to most relevant variables which we define here as a minimal set of variables which can provide similar predictions to the full model.

Tyre used AIC_c to estimate the model performance. In Bayesian setting we could use Bayesian cross-validation or WAIC, but we don’t recommend thhem for variable selection as discussed by Piironen and Vehtari (2017). The reason for not using Bayesian CV or WAIC is that the selection process uses the data twice, and in case of large number variable combinations the selection process overfits and can produce really bad models. Using the usual posterior inference given the selected variables ignores that the selected variables are conditonal on the selection process and simply setting some variables to 0 ignores the uncertainty related to their relevance.

Piironen and Vehtari (2017) also show that a projection predictive approach can be used to make a model reduction, that is, choosing a smaller model with some coefficients set to 0. The projection predictive approach solves the problem how to do inference after the selection. The solution is to project the full model posterior to the restricted subspace. See more by Piironen, Paasiniemi and Vehtari (2020).

We make the projective predictive variable selection using the previous full model. A fast leave-one-out cross-validation approach (Vehtari, Gelman and Gabry, 2017) is used to choose the model size.

fitg_cv <- cv_varsel(fitg, method='forward', cv_method='LOO')

fitg_cv$vind

NULL

We can now look at the estimated predictive performance of smaller models compared to the full model.

plot(fitg_cv, stats = c('elpd', 'rmse'))

And we get a LOO based recommendation for the model size to choose

(nsel <- suggest_size(fitg_cv, alpha=0.1))

[1] 2

(vsel <- solution_terms(fitg_cv)[1:nsel])

[1] "x2" "x1"

We see that 1 variables is enough to get the same predictive accuracy as with all 4 variables.

Next we form the projected posterior for the chosen model.

projg <- project(fitg_cv, nv = nsel, ns = 4000)
projdraws <- as.matrix(projg)
round(colMeans(projdraws),1)

Intercept        x2        x1 
     -6.1      15.2       6.7

round(posterior_interval(projdraws),1)

            5%  95%
Intercept -7.0 -5.3
x2        14.1 16.3
x1         5.8  7.7

This looks good as the true values are intercept=-5.8, x2=15.2, x1=6.3.

mcmc_areas(projdraws, pars=c("Intercept",vsel))

Even if we started with a model which had due to a co-linearity difficult to interpret posterior, the projected posterior is able to match closely the true values. The necessary information was in the full model and with the projection we were able to form the projected posterior which we should use if x3 and x4 are set to 0.

Back to the Tyre’s question “Does model averaging make sense?”. If we are interested just in good predictions we can do continuous model averaging by using suitable priors and by integrating over the posterior. If we are intersted in predcitions, then we don’t first average weights (ie posterior mean), but use all weight values to compute predictions and do the averaging of the predictions. All this is automatic in Bayesian framework.

Tyre also commented on the problems of measuring variable importance. The projection predictive approach above is derived using decision theory and is very helpful for measuring relevancy and choosing relevant variables. Tyre did not comment about the inference after selection although it is also known problem in variable selection. The projection predictive approach above solves that problem, too.

References

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. and Rubin, D. B. (2013) Bayesian data analysis, third edition. CRC Press.

Piironen, J., Paasiniemi, M. and Vehtari, A. (2020) ‘Projective inference in high-dimensional problems: Prediction and feature selection’, Electronic Journal of Statistics, 14(1), pp. 2155–2197.

Piironen, J. and Vehtari, A. (2017) ‘Comparison of Bayesian predictive methods for model selection’, Statistics and Computing, 27(3), pp. 711–735. doi: 10.1007/s11222-016-9649-y.

Vehtari, A., Gelman, A. and Gabry, J. (2017) ‘Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC’, Statistics and Computing, 27(5), pp. 1413–1432. doi: 10.1007/s11222-016-9696-4.

Licenses

Code for generating the data by Andrew Tyre

Original Computing Environment

sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /opt/OpenBLAS/lib/libopenblas_haswellp-r0.3.17.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=fi_FI.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=fi_FI.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=fi_FI.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GGally_2.1.2          fivethirtyeight_0.6.2 projpred_2.0.2        bayesplot_1.8.1      
 [5] forcats_0.5.1         stringr_1.4.0         dplyr_1.0.8           purrr_0.3.4          
 [9] readr_2.1.2           tidyr_1.2.0           tibble_3.1.6          ggplot2_3.3.5        
[13] tidyverse_1.3.1       loo_2.4.1             rstanarm_2.21.1       Rcpp_1.0.8           
[17] rmarkdown_2.11        knitr_1.37           

loaded via a namespace (and not attached):
  [1] readxl_1.3.1         backports_1.4.1      plyr_1.8.6           igraph_1.2.11       
  [5] splines_4.1.2        crosstalk_1.2.0      usethis_2.1.5        rstantools_2.1.1    
  [9] inline_0.3.19        digest_0.6.29        htmltools_0.5.2      rsconnect_0.8.25    
 [13] fansi_1.0.2          magrittr_2.0.2       checkmate_2.0.0      memoise_2.0.1       
 [17] tzdb_0.2.0           remotes_2.4.2        modelr_0.1.8         RcppParallel_5.1.5  
 [21] matrixStats_0.61.0   xts_0.12.1           prettyunits_1.1.1    colorspace_2.0-2    
 [25] rvest_1.0.2          haven_2.4.3          xfun_0.29            callr_3.7.0         
 [29] crayon_1.4.2         jsonlite_1.7.3       lme4_1.1-28          survival_3.2-13     
 [33] zoo_1.8-9            glue_1.6.1           gtable_0.3.0         distributional_0.3.0
 [37] pkgbuild_1.3.1       rstan_2.21.3         abind_1.4-5          scales_1.1.1        
 [41] DBI_1.1.2            miniUI_0.1.1.1       progress_1.2.2       xtable_1.8-4        
 [45] stats4_4.1.2         StanHeaders_2.21.0-7 DT_0.20              htmlwidgets_1.5.4   
 [49] httr_1.4.2           threejs_0.3.3        RColorBrewer_1.1-2   posterior_1.2.0     
 [53] ellipsis_0.3.2       reshape_0.8.8        pkgconfig_2.0.3      farver_2.1.0        
 [57] sass_0.4.0           dbplyr_2.1.1         utf8_1.2.2           labeling_0.4.2      
 [61] tidyselect_1.1.1     rlang_1.0.1          reshape2_1.4.4       later_1.3.0         
 [65] munsell_0.5.0        cellranger_1.1.0     tools_4.1.2          cachem_1.0.6        
 [69] cli_3.1.1            generics_0.1.2       devtools_2.4.3       broom_0.7.12        
 [73] ggridges_0.5.3       evaluate_0.14        fastmap_1.1.0        yaml_2.2.2          
 [77] processx_3.5.2       fs_1.5.2             nlme_3.1-155         mime_0.12           
 [81] xml2_1.3.3           brio_1.1.3           compiler_4.1.2       shinythemes_1.2.0   
 [85] rstudioapi_0.13      gamm4_0.2-6          testthat_3.1.2       reprex_2.0.1        
 [89] bslib_0.3.1          stringi_1.7.6        highr_0.9            ps_1.6.0            
 [93] desc_1.4.0           lattice_0.20-45      Matrix_1.4-0         nloptr_1.2.2.3      
 [97] markdown_1.1         shinyjs_2.1.0        tensorA_0.36.2       vctrs_0.3.8         
[101] pillar_1.7.0         lifecycle_1.0.1      jquerylib_0.1.4      httpuv_1.6.5        
[105] R6_2.5.1             promises_1.2.0.1     gridExtra_2.3        sessioninfo_1.2.2   
[109] codetools_0.2-18     boot_1.3-28          colourpicker_1.1.1   MASS_7.3-55         
[113] gtools_3.9.2         assertthat_0.2.1     pkgload_1.2.4        rprojroot_2.0.2     
[117] withr_2.4.3          shinystan_2.5.0      mgcv_1.8-38          parallel_4.1.2      
[121] hms_1.1.1            grid_4.1.2           minqa_1.2.4          shiny_1.7.1         
[125] lubridate_1.8.0      base64enc_0.1-3      dygraphs_1.1.1.6

Bayesian version of Does model averaging make sense?

Aki Vehtari

First version 2017-08-09. Last modified 2022-02-21.