Setup

Load packages

library(here)
library(rstanarm)
options(mc.cores = parallel::detectCores())
library(loo)
library(projpred)
library(ggplot2)
library(bayesplot)
theme_set(bayesplot::theme_default(base_family = "sans"))
library(corrplot)
library(knitr)
SEED=1513306866

1 Introduction

This notebook was inspired by the article Heinze, Wallisch, and Dunkler (2018). Variable selection – A review and recommendations for the practicing statistician. They provide ``an overview of various available variable selection methods that are based on significance or information criteria, penalized likelihood, the change-in-estimate criterion, background knowledge, or combinations thereof.’’ I agree that they provide sensible recommendations and warnings for those methods. Similar recommendations and warnings hold for information criterion and naive cross-validation based variable selection in Bayesian framework as demonstrated by Piironen and Vehtari (2017a).

Piironen and Vehtari (2017a) demonstrate also the superior stability of projection predictive variable selection (see specially figures 4 and 10). In this notebook I demonstrate the projection predictive variable selection method as presented by Piironen, Paasiniemi and Vehtari (2020) and implemented in R package projpred. I use the same body fat data as used in Section 3.3 of the article by Heinze, Wallisch, and Dunkler (2017). The dataset with the background information is available here but Heinze, Wallisch, and Dunkler have made some data cleaning and I have used the same data and some bits of the code they provide in the supplementary material. There still are some strange values like the one person with zero fat percentage, but I didn’t do additional cleaning.

This notebook was initially created in 2018, but has later been extended to an article “Using reference models in variable selection” by Pavone et al. (2020).

The excellent performance of the projection predictive variable selection comes from following parts

  1. Bayesian inference using priors and integration over all the uncertainties makes it easy to get good predictive performance with all variables included in the model (Piironen and Vehtari, 2017b, 2018; Piironen, Paasiniemi and Vehtari, 2020)
  2. Projection of the information from the full model to a smaller model is able to include information and uncertainty from the left out variables (while conditioning of the smaller model to data would ignore left out variables) (Piironen, Paasiniemi and Vehtari, 2020; Pavone et al., 2020).
  3. During the search through the model space comparing the predictive distributions of projected smaller models to the predictive distribution of the full model reduces greatly the variance in model comparisons (Piironen and Vehtari, 2017a).
  4. Even with greatly reduced variance in model comparison, the selection process slightly overfits to the data, but we can cross-validate this effect using the fast Pareto smoothed importance sampling leave-one-out cross-validation (Vehtari, Gelman and Gabry, 2017; Vehtari et al., 2019)

Excellent performance of projection predictive variable selection compared to other Bayesian variable selection methods was presented by Piironen and Vehtari (2017a). Piironen, Paasiniemi and Vehtari (2020) present further improvements such as improved model size selection and several options to make the approach faster for larger number of variables or bigger data sets. Vehtari and Ojanen (2012) present theoretical justification for projection predictive model selection and inference after selection.

Note that if the goal is only the prediction no variable selection is needed. The projection predictive variable selection can be used to learn which are the most useful variables for making predictions and potentially reduce the future measurement costs. In the bodyfat example, most of the measurements have time cost and there is a benefit of finding the smallest set of variables to be used in the future for the predictions.

This notebook presents a linear regression example, but the projection predictive approach can be used also, for example, with generalized linear models (Piironen and Vehtari, 2017a; Piironen, Paasiniemi and Vehtari, 2020), Gaussian processes (Piironen and Vehtari, 2016), and generalized linear and additive multilevel models (Catalina, Bürkner and Vehtari, 2020).

2 Bodyfat data

Load data and scale it. Heinze, Wallisch, and Dunkler (2018) used unscaled data, but we scale it for easier comparison of the effect sizes. In theory this scaling should not have detectable difference in the predictions and I did run the results also without scaling and there is no detectable difference in practice.

df <- read.table(here("bodyfat.txt"), header = T, sep = ";")
df[,4:19] <- scale(df[,4:19])
# no-one can have 0% body fat
df <- df[df$siri>0,]
y <- df[,"siri"]
df <- as.data.frame(df)
n <- nrow(df)

Lists of predictive and target variables, and formula.

pred <- c("age", "weight", "height", "neck", "chest", "abdomen", "hip", 
          "thigh", "knee", "ankle", "biceps", "forearm", "wrist")
target <- "siri"
formula <- formula(paste("siri ~", paste(pred, collapse = " + ")))
p <- length(pred)

Plot correlation structure

corrplot(cor(df[, c(target,pred)]))

3 Regression model with regularized horseshoe prior

Fit full Bayesian model. We use weakly informative regularized horseshoe prior [Piironen+Vehtari:RHS:2017] to include prior assumption that some of the variables might be irrelevant. The selected \(p_0\) is mean for the prior assumption. See Piironen and Vehtari (2017b) for the uncertainty around the implied prior for effectively non-zero coefficients.

p0 <- 5 # prior guess for the number of relevant variables
tau0 <- p0/(p-p0) * 1/sqrt(n)
rhs_prior <- hs(global_scale=tau0)
fitrhs <- stan_glm(formula, data = df, prior=rhs_prior, QR=TRUE, 
                   seed=SEED, refresh=0)
summary(fitrhs)

Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      siri ~ age + weight + height + neck + chest + abdomen + hip + 
       thigh + knee + ankle + biceps + forearm + wrist
 algorithm:    sampling
 sample:       4000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 250
 predictors:   14

Estimates:
              mean   sd   10%   50%   90%
(Intercept) 19.1    0.3 18.8  19.1  19.5 
age          0.8    0.4  0.3   0.8   1.3 
weight       0.1    1.7 -2.0   0.1   2.2 
height      -0.9    0.5 -1.5  -0.9  -0.2 
neck        -0.7    0.5 -1.4  -0.7   0.0 
chest       -1.4    0.8 -2.4  -1.5  -0.3 
abdomen      8.9    0.9  7.7   8.9  10.0 
hip         -0.6    0.8 -1.7  -0.5   0.3 
thigh        0.5    0.7 -0.4   0.4   1.4 
knee         0.0    0.4 -0.6   0.0   0.5 
ankle        0.3    0.3 -0.1   0.3   0.6 
biceps       0.4    0.4 -0.1   0.4   1.0 
forearm      0.4    0.4  0.0   0.3   0.9 
wrist       -1.5    0.5 -2.2  -1.5  -0.9 
sigma        4.3    0.2  4.0   4.3   4.5 

Fit Diagnostics:
           mean   sd   10%   50%   90%
mean_PPD 19.2    0.4 18.7  19.2  19.7 

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse Rhat n_eff
(Intercept)   0.0  1.0  4881 
age           0.0  1.0  3894 
weight        0.0  1.0  3593 
height        0.0  1.0  3767 
neck          0.0  1.0  3950 
chest         0.0  1.0  3188 
abdomen       0.0  1.0  4154 
hip           0.0  1.0  3086 
thigh         0.0  1.0  3119 
knee          0.0  1.0  4622 
ankle         0.0  1.0  4605 
biceps        0.0  1.0  3078 
forearm       0.0  1.0  4010 
wrist         0.0  1.0  3185 
sigma         0.0  1.0  5386 
mean_PPD      0.0  1.0  4455 
log-posterior 0.2  1.0  1074 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Make graphical posterior predictive check to check that the model predictions make sense.

yrep <- posterior_predict(fitrhs, draws = 50)
ppc_dens_overlay(y, yrep)

Kernel density estimate for the data and posterior predictive replicates are similar. Although data do not have any negative \(y\) values, the kernel density estimate is smoothing over to the negative side. The predictive distribution of the model is noy constrained to be positive, but the error is small and we continue with the variable selection.

Plot marginal posterior of the coefficients.

mcmc_areas(as.matrix(fitrhs)[,2:14])

We can see that the posterior of abdomen coefficient is far away from zero, but it’s not as clear what other variables should be included. weight has wide marginal overlapping zero, which hints potentially relevant variable with correlation in joint posterior.

Looking at the marginals has the problem that correlating variables may have marginal posteriors overlapping zero while joint posterior typical set does not include zero. Compare, for example, marginals of height and height above to their joint distribution below.

mcmc_scatter(as.matrix(fitrhs), pars = c("height", "weight"))+geom_vline(xintercept=0)+geom_hline(yintercept=0)

Projection predictive variable selection is easily made with cv_varsel function, which also computes an LOO-CV estimate of the predictive performance for the best models with certain number of variables. Heinze, Wallisch, and Dunkler (2018) ``consider abdomen and height as two central IVs [independent variables] for estimating body fat proportion, and will not subject these two to variable selection.’’ We subject all variables to selection.

fitrhs_cvvs <- cv_varsel(fitrhs, method = 'forward', cv_method = 'loo',
                         nloo = n, verbose = FALSE)

And the estimated predictive performance of smaller models compared to the full model.

plot(fitrhs_cvvs, stats = c('elpd', 'rmse'), deltas=FALSE)

Based on the plot 2 variables and projected posterior provide practically the same predictive performance as the full model. We can get a PSIS-LOO (Vehtari, Gelman and Gabry, 2017) based recommendation for the model size to choose.

(nsel <- suggest_size(fitrhs_cvvs, alpha=0.1))
[1] 2
(vsel <- solution_terms(fitrhs_cvvs)[1:nsel])
[1] "abdomen" "weight" 

Based on this recommendation we continue with two variables abdomen and weight. The model selected by Heinze, Wallisch, and Dunkler (2018) had seven variables height (fixed), abdomen (fixed), wrist, age, neck, forearm, and chest-

Form the projected posterior for the selected model.

projrhs <- project(fitrhs_cvvs, nv = nsel, ns = 4000)

Plot the marginals of the projected posterior.

mcmc_areas(as.matrix(projrhs), pars = vsel)

So far we have seen that projpred selected a smaller set of variables which have very similar predictive performance as the full model. Let’s compare next the stability of the approaches. Heinze, Wallisch, and Dunkler (2018) repeated the model selection using 1000 bootstrapped datasets. Top 20 models selected have 5–9 variables, the highest selection frequency is 3.2%, and cumulative selection frequency for top 20 models is 29.5%. These results clearly illustrate instability of the selection method they used.

Before looking at the corresponding bootstrap results we can look at the stability of selection process based on the LOO-CV selection paths computed by cv_varsel (the code to make the following plot will be included in projpred package).

source("projpredpct.R")
rows <- nrow(fitrhs_cvvs[["pct_solution_terms_cv"]])
col <- nrow(fitrhs_cvvs[["pct_solution_terms_cv"]])
pctch <- round(fitrhs_cvvs[["pct_solution_terms_cv"]], 2)
colnames(pctch)[1] <- ".size"
pct <- get_pct_arr(pctch, 13)
col_brks <- get_col_brks()
pct$val_grp <- as.character(sapply(pct$val, function(x) sum(x >= col_brks$breaks)))
if (identical(rows, 0)) rows <- pct$var[1]
pct$sel <- (pct$.size == col) & (pct$var %in% rows)
brks <- sort(unique(as.numeric(pct$val_grp)) + 1)
ggplot(pct, aes_(x = ~.size, y = ~var)) +
    geom_tile(aes_(fill = ~val_grp, color = ~sel),
              width = 1, height = 0.9, size = 1) +
        geom_text(aes_(label = ~val, fontface = ~sel+1)) +
    coord_cartesian(expand = FALSE) +
    scale_y_discrete(limits = rev(levels(pct$var))) +
    scale_x_discrete(limits = seq(1,col)) +
    scale_color_manual(values = c("white", "black")) +
    labs(x = "Model size", y = "",
         title = "Fraction of cv-folds that select the given variable") +
    scale_fill_manual(breaks = brks, values = col_brks$pal[brks]) +
    theme_proj() +
    theme(legend.position = "none",
          axis.text.y = element_text(angle = 45))

For model sizes 1-3 selection paths in different LOO-CV cases are always the same abdomen?,weight, andwrist`. For larger model sizes there are some small variation, but mostly the order is quite consistent.

Running stan_glm with prior=hs() and cv_varsel do not take much time when run only once, but for a notebook running them 1000 times would take hours. The code for running the above variable selection procedure for 100 different bootstrapped datasets is as follows.

writeLines(readLines("bodyfat_bootstrap.R"))
library(here)
library(rstanarm)
options(mc.cores = parallel::detectCores())
library(loo)
library(projpred)
library(dplyr)

#' Bootstrap iterations for projpred for bodyfat
df <- read.table(here("bodyfat.txt"), header = T, sep = ";")
df[,4:19] <- scale(df[,4:19])
df <- as.data.frame(df)
n <- nrow(df)
colnames(df[c("weight_kg", "height")]) <- c("weight", "height")

pred <- c("age", "weight", "height", "neck", "chest", "abdomen", "hip", 
          "thigh", "knee", "ankle", "biceps", "forearm", "wrist")
target <- "siri"
formula <- paste("siri~", paste(pred, collapse = "+"))
p <- length(pred)

p0 <- 2 # prior guess for the number of relevant variables
tau0 <- p0/(p-p0) * 1/sqrt(n)
rhs_prior <- hs(global_scale=tau0)

bootnum <- 100
boot_est <-  boot_se <- matrix(0, ncol = length(pred) + 1, nrow = bootnum,
                               dimnames = list(NULL, c("(Intercept)", pred)))
boot_nvs <- matrix(0, ncol=1, nrow=bootnum)
bbn <- matrix(0, ncol=1, nrow=bootnum)
for (i in 1:bootnum) {
  set.seed(5437854+i)
  data_id <- sample(1:dim(df)[1], replace = T)
  bbn[i,] <- length(unique(data_id))
  fitb <- stan_glm(formula, data = df[data_id, ], 
                      prior=rhs_prior, QR=TRUE, seed=i, refresh=0)
  fitb_cv <- cv_varsel(fitb, method='forward', cv_method='LOO', nloo=n,
                       verbose = FALSE)
  print(nv <- suggest_size(fitb_cv,alpha=0.1))
  boot_nvs[i,] <- nv
  print(fitb_cv$varsel$vind[1:nv])
  projb <- project(fitb_cv, nv = nv, ns = 4000)
  boot_est[i, colnames(as.matrix(projb)[,-(nv+2)])] <- colMeans(as.matrix(projb)[,-(nv+2)])
}
boot_01 <- (boot_est != 0) * 1
boot_inclusion <- data.frame(projpred_incp=(apply(boot_01, 2, function(x) sum(x) / length(x) * 100)))
boot_01 <- data.frame(boot_01)

bn <- data.frame(boot_nvs) %>% group_by_all() %>% count(sort=TRUE)
bd <- boot_01 %>% group_by_at(vars(-X.Intercept.)) %>% count(sort=TRUE)
boot_inclusion <- boot_inclusion %>% tibble::rownames_to_column(var="variable") %>% filter(variable != "X.Intercept.") %>% arrange(-projpred_incp)
boot_inclusion$steplm_incp <- c(100, 28, 98, 100, 85, 63, 51, 48, 34, 43, 54, 41, 18)
boot_inclusion <- boot_inclusion %>% rename(projpred=projpred_incp, steplm=steplm_incp)
cumsum(bd$n)
for (i in 1:20) {
    print(paste(paste0(colnames(bd)[c(as.logical(bd[i,1:13]),FALSE)], collapse=", "),bd$n[i],sep=", "))
}

save(boot_01, boot_inclusion, boot_nvs, bbn, file = "bodyfat_bootstrap.RData")

In theory LOO-CV should have smaller variation than bootstrap, and also in practice we see much more variation in the bootstrap results. In the basic bootstrap on average only 63% of the original data is included, that is, in this case on average the amount of unique observations in bootstrapped data is 159 while full data has n=251, which explains increased variability in variable selection. From 100 bootstrap iterations model size 2 was selected 32 times and model size 3 28 times.

The bootstrap inclusion frequencies with projpred and with step(lm(...))) (resuls from Heinze, Wallisch, and Dunkler (2018)) are shown in the following table

Bootstrap inclusion probabilities.
variable projpred steplm
abdomen 100 100
weight 58 28
wrist 45 98
height 35 100
age 10 85
neck 10 63
chest 5 51
thigh 5 48
ankle 5 34
biceps 4 43
forearm 3 54
hip 0 41
knee 0 18

Heinze, Wallisch, and Dunkler (2018) had fixed that abdomen and height are always included. projpred selects abdomen always, but height is included only in 35% iterations. Coefficients of weight and height are strongly correlated as shown above, and thus it is not that surprising that projpred selects weight instead ogf height. Five most often selected variables by projpred in bootstrap iterations are the same five and in the same order as by cv_varsel function with full data. Overall projpred selects smaller models and thus the bootstrap inclusion probabilities are smaller.

The following table shows top 10 projpred models and model selection frequencies from bootstrap iterations.

projpred model selection frequencies in bootstrap
model variables frequency
1 abdomen, weight 37
2 abdomen, wrist 10
3 abdomen, height 10
4 abdomen, height, wrist 9
5 abdomen, weight, wrist 8
6 abdomen, chest, height, wrist 3
7 abdomen, height, neck, wrist 2
8 abdomen, age, wrist 2
9 abdomen, age, height, neck, thigh, wrist 2
10 abdomen, chest 1

Top 10 models selected have 2–5 variables, the highest selection frequency is 38%, and the cumulative selection frequency for top 10 models is 84% and for top 20 models 94%. This demonstrates that projpred is much more stable.

Heinze, Wallisch, and Dunkler (2017) focused on which variables were selected and coefficient estimates, but they did not consider the predictive performance of the models. We can estimate the predictive performance of the selected model via cross-validation (taking into account the effect of the selection process, too). As part of the variable selection cv_varsel computes also PSIS-LOO estimates for the full model and all submodels taking into account the selection process. For Bayesian models we would usually report expected log predictive density as it assesses the goodnes of the whole predictive distribution. Since we now compare results to lm we estimate also root mean square error (rmse) of mean predictions.

  loormse_full <- sqrt(mean((df$siri-fitrhs_cvvs$summaries$ref$mu)^2))
  loormse_proj <- sqrt(mean((df$siri-fitrhs_cvvs$summaries$sub[[nsel]]$mu)^2))
  print(paste('PSIS-LOO RMSE Bayesian full model: ', round(loormse_full,1)))
[1] "PSIS-LOO RMSE Bayesian full model:  4.3"
  print(paste('PSIS-LOO RMSE selected projpred model: ', round(loormse_proj,1)))
[1] "PSIS-LOO RMSE selected projpred model:  4.7"

Since we do get these cross-validation estimates using PSIS-LOO, we do not need to run K-fold-CV, but since we need to run K-fold-CV for lm + step, we did run also K-fold-CV for projpred selection process. As this take some time, hre only the code is shown and we load seprately run results. The following code computes 20-fold-CV for the Bayesian model and projpred selected model.

writeLines(readLines("bodyfat_kfoldcv.R"))
#' K-fold-CV for projpred for bodyfat

df <- read.table(here("bodyfat.txt"), header = T, sep = ";")
df[,4:19] <- scale(df[,4:19])
df <- as.data.frame(df)
n <- nrow(df)
colnames(df[c("weight_kg", "height")]) <- c("weight", "height")

pred <- c("age", "weight", "height", "neck", "chest", "abdomen", "hip", 
          "thigh", "knee", "ankle", "biceps", "forearm", "wrist")
target <- "siri"
formula <- paste("siri~", paste(pred, collapse = "+"))
p <- length(pred)

p0 <- 5 # prior guess for the number of relevant variables
tau0 <- p0/(p-p0) * 1/sqrt(n)
rhs_prior <- hs(global_scale=tau0)
fitrhs <- stan_glm(formula, data = df, prior=rhs_prior, QR=TRUE, 
                   seed=1513306866, refresh=0)

set.seed(5437854)
perm <- sample.int(n)
K <- 20
idx <- ceiling(seq(from = 1, to = n, length.out = K + 1))
bin <- .bincode(perm, breaks = idx, right = FALSE, include.lowest = TRUE)

muss <- list()
vsmuss <- list()
vsnvss <- list()
vsnvss2 <- list()
for (k in 1:K) {
    message("Fitting model ", k, " out of ", K)
    omitted <- which(bin == k)
    fit_k <- update(
        object = fitrhs,
        data = df[-omitted,, drop=FALSE],
        weights = NULL,
        refresh = 0
    )
    fit_cvvs_k <- cv_varsel(fit_k, method='forward', cv_method='LOO',
                            nloo=nrow(df[-omitted,, drop=FALSE]))
    nvk <- suggest_size(fit_cvvs_k,alpha=0.1)
    vsnvss[[k]] <- nvk
    proj_k <- project(fit_cvvs_k, nv = nvk, ns = 4000)
    muss[[k]] <-
        colMeans(posterior_linpred(fit_k,
                                   newdata = df[omitted, , drop = FALSE]))
    vsmuss[[k]] <-
        colMeans(proj_linpred(proj_k, xnew = df[omitted, , drop = FALSE]))
}
mus<-unlist(muss)[order(as.integer(names(unlist(muss))))]
vsmus<-unlist(vsmuss)[order(as.integer(names(unlist(vsmuss))))]
vsnvs <- unlist(vsnvss)
rmse_full <- sqrt(mean((df$siri-mus)^2))
rmse_proj <- sqrt(mean((df$siri-vsmus)^2))
save(vsnvs, rmse_full, rmse_proj, file = "bodyfat_kfoldcv.RData")

We load and print the results

load(file="bodyfat_kfoldcv.RData")
print(paste('20-fold-CV RMSE Bayesian full model: ', round(rmse_full,1)))
[1] "20-fold-CV RMSE Bayesian full model:  4.4"
print(paste('20-fold-CV RMSE Bayesian projpred model: ', round(rmse_proj,1)))
[1] "20-fold-CV RMSE Bayesian projpred model:  4.4"

The results are very close to PSIS-LOO results. Going from the full Bayesian model to a smaller projection predictive model, we get practically the same 20-fold-CV RMSE, which very good performance considering we dropped from 13 covariates to 2 covariates.

Then compute 20-fold-CV for the lm model and step selected model (this is fast enough to include in a notebook).

set.seed(SEED)
perm <- sample.int(n)
K <- 20
idx <- ceiling(seq(from = 1, to = n, length.out = K + 1))
bin <- .bincode(perm, breaks = idx, right = FALSE, include.lowest = TRUE)
lmmuss <- list()
lmvsmuss <- list()
lmvsnvss <- list()
lmvsnlmuss <- list()
lmvsnlnvss <- list()
for (k in 1:K) {
    message("Fitting model ", k, " out of ", K)
    omitted <- which(bin == k)
    lmfit_k <- lm(formula, data = df[-omitted,, drop=FALSE],  x=T,y=T)
    lmmuss[[k]] <- predict(lmfit_k, newdata = df[omitted, , drop = FALSE])
    sel_k <- step(lm(formula, data = df[-omitted,, drop=FALSE],  x=T,y=T), 
                direction = "backward",
                scope = list(upper = formula, 
                             lower = formula(siri~abdomen+height)),
                trace = 0)
    lmvsmuss[[k]] <- predict(sel_k, newdata = df[omitted, , drop = FALSE])
    lmvsnvss[[k]] <- length(coef(sel_k))-1
    # compute also a version without fixing abdomen and height
    selnl_k <- step(lm(formula, data = df[-omitted,, drop=FALSE],  x=T,y=T), 
                direction = "backward",
                trace = 0)
    lmvsnlmuss[[k]] <- predict(selnl_k, newdata = df[omitted, , drop = FALSE])
    lmvsnlnvss[[k]] <- length(coef(selnl_k))-1
  }
lmmus<-unlist(lmmuss)[order(as.integer(names(unlist(lmmuss))))]
lmvsmus<-unlist(lmvsmuss)[order(as.integer(names(unlist(lmvsmuss))))]
lmvsnvs <- unlist(lmvsnvss)
lmvsnlmus<-unlist(lmvsnlmuss)[order(as.integer(names(unlist(lmvsnlmuss))))]
lmvsnlnvs <- unlist(lmvsnlnvss)
rmse_lmfull <- sqrt(mean((df$siri-lmmus)^2))
rmse_lmsel <- sqrt(mean((df$siri-lmvsmus)^2))
rmse_lmselnl <- sqrt(mean((df$siri-lmvsnlmus)^2))
  print(paste('20-fold-CV RMSE lm full model: ', round(rmse_lmfull,1)))
[1] "20-fold-CV RMSE lm full model:  4.4"
  print(paste('20-fold-CV RMSE lm step selected model: ', round(rmse_lmsel,1)))
[1] "20-fold-CV RMSE lm step selected model:  4.5"

We see that simpler maximum likelihood could provide similar 20-fold-CV RMSE as much slower Bayesian inference with a fancy regularized horseshoe prior. Also the model selection process did not overfit and the model selection with step has similar 20-fold-CV RMSE as projpred result. Neither of these results are not surprising as \(n=251 \gg p=13\). However projpred has much more stable selection process and produced much smaller models with the same accuracy.

Heinze, Wallisch, and Dunkler (2018) also write ``In routine work, however, it is not known a priori which covariates should be included in a model, and often we are confronted with the number of candidate variables in the range 10-30. This number is often too large to be considered in a statistical model.’’ I strongly disagree with this as there are many statistical models working with more than million candidate variables (see, e.g., Peltola, Marttinen and Vehtari (2012)). As the bodyfat dataset proved to be quite easy in that sense that maximum likelihood performed well compared to Bayesian approach, let’s make the dataset a bit more challenging.

We add 87 variables which are random normal distributed noise and thus are not related to bodyfat in any way. We have now total of 100 variables.

set.seed(SEED)
noise <- array(rnorm(87*n), c(n,87))
dfr<-cbind(df,noise=noise)
formula2<-formula(paste(paste("siri~", paste(pred, collapse = "+")),
                        "+",paste(colnames(dfr[,20:106]), collapse = "+")))

Given this new dataset compute do variable selection with full data and compute 20-fold-CV for the lm model and step selected model.

sel2 <- step(lm(formula2, data = dfr,  x=T,y=T),
                direction = "backward",
                scope = list(upper = formula2, 
                lower = formula(siri~abdomen+height)),
                trace = 0)
lmmuss <- list()
lmvsmuss <- list()
lmvsnvss <- list()
lmvsnlmuss <- list()
lmvsnlnvss <- list()
for (k in 1:K) {
  message("Fitting model ", k, " out of ", K)
  omitted <- which(bin == k)
  lmfit_k <- lm(formula2, data = dfr[-omitted,, drop=FALSE],  x=T,y=T)
  lmmuss[[k]] <- predict(lmfit_k, newdata = dfr[omitted, , drop = FALSE])
  sel_k <- step(lm(formula2, data = dfr[-omitted,, drop=FALSE],  x=T,y=T), 
                direction = "backward",
                scope = list(upper = formula2, 
                lower = formula(siri~abdomen+height)),
                trace = 0)
  lmvsmuss[[k]] <- predict(sel_k, newdata = dfr[omitted, , drop = FALSE])
  lmvsnvss[[k]] <- length(coef(sel_k))-1
  selnl_k <- step(lm(formula2, data = dfr[-omitted,, drop=FALSE],  x=T,y=T), 
                  direction = "backward",
                  trace = 0)
  lmvsnlmuss[[k]] <- predict(selnl_k, newdata = dfr[omitted, , drop = FALSE])
  lmvsnlnvss[[k]] <- length(coef(selnl_k))-1
}
lmmus<-unlist(lmmuss)[order(as.integer(names(unlist(lmmuss))))]
lmvsmus<-unlist(lmvsmuss)[order(as.integer(names(unlist(lmvsmuss))))]
lmvsnvs <- unlist(lmvsnvss)
lmvsnlmus<-unlist(lmvsnlmuss)[order(as.integer(names(unlist(lmvsnlmuss))))]
lmvsnlnvs <- unlist(lmvsnlnvss)
rmse_lmfull <- sqrt(mean((df$siri-lmmus)^2))
rmse_lmsel <- sqrt(mean((df$siri-lmvsmus)^2))
rmse_lmselnl <- sqrt(mean((df$siri-lmvsnlmus)^2))
print(names(coef(sel2)))
 [1] "(Intercept)" "age"         "height"      "neck"        "abdomen"     "thigh"      
 [7] "forearm"     "wrist"       "noise.1"     "noise.7"     "noise.8"     "noise.11"   
[13] "noise.13"    "noise.15"    "noise.16"    "noise.17"    "noise.26"    "noise.29"   
[19] "noise.30"    "noise.31"    "noise.34"    "noise.36"    "noise.37"    "noise.54"   
[25] "noise.55"    "noise.59"    "noise.60"    "noise.65"    "noise.75"    "noise.76"   
[31] "noise.84"    "noise.87"   
print(paste('20-fold-CV RMSE lm full model: ', round(rmse_lmfull,1)))
[1] "20-fold-CV RMSE lm full model:  5.4"
print(paste('20-fold-CV RMSE lm step selected model: ', round(rmse_lmsel,1)))
[1] "20-fold-CV RMSE lm step selected model:  5"

Variable selection with step has selected 38 variables from which 32 are random noise. 20-fold-CV RMSE for full lm model and selected model are higher than with the original data.

How about then Bayesian model with regularized horseshoe and projection predictive variable selection?

p <- 100
p0 <- 5 # prior guess for the number of relevant variables
tau0 <- p0/(p-p0) * 1/sqrt(n)
hs_prior <- hs(global_scale=tau0)
fitrhs2 <- stan_glm(formula2, data = dfr, prior = hs_prior, QR = TRUE, 
                   seed=SEED, refresh=0)
fitrhs2_cvvs <- cv_varsel(fitrhs2, method = 'forward', cv_method = 'loo',
                          nloo=n, verbose = FALSE)
nsel2 <- suggest_size(fitrhs2_cvvs, alpha=0.1)
vsel2 <- solution_terms(fitrhs2_cvvs)[1:nsel2]
loormse_full2 <- sqrt(mean((df$siri-fitrhs2_cvvs$summaries$ref$mu)^2))
loormse_proj2 <- sqrt(mean((df$siri-fitrhs2_cvvs$summaries$sub[[nsel2]]$mu)^2))
print(paste('PSIS-LOO RMSE Bayesian full model: ', round(loormse_full2,1)))
[1] "PSIS-LOO RMSE Bayesian full model:  4.4"
print(paste('PSIS-LOO RMSE selected projpred model: ', round(loormse_proj2,1)))
[1] "PSIS-LOO RMSE selected projpred model:  4.7"

Variable selection with projpred has selected 2 variables from which 0 are random noise. PSIS-LOO RMSE for the full Bayesian model and the selected projpred model are similar as with the original data.

If you don’t trust PSIS-LOO you can run the following K-fold-CV to get almost the same result.

writeLines(readLines("bodyfat_kfoldcv2.R"))
#' K-fold-CV for projpred for bodyfat

df <- read.table(here("bodyfat.txt"), header = T, sep = ";")
df[,4:19] <- scale(df[,4:19])
df <- as.data.frame(df)
n <- nrow(df)
colnames(df[c("weight_kg", "height")]) <- c("weight", "height")

pred <- c("age", "weight", "height", "neck", "chest", "abdomen", "hip", 
          "thigh", "knee", "ankle", "biceps", "forearm", "wrist")
target <- "siri"
formula <- paste("siri~", paste(pred, collapse = "+"))

set.seed(1513306866)
noise <- array(rnorm(87*n), c(n,87))
dfr<-cbind(df,noise=noise)
formula2<-paste(formula,"+",paste(colnames(dfr[,20:106]), collapse = "+"))
p <- 100

p0 <- 5 # prior guess for the number of relevant variables
tau0 <- p0/(p-p0) * 1/sqrt(n)
rhs_prior <- hs(global_scale=tau0)

set.seed(1513306866)
perm <- sample.int(n)
K <- 20
idx <- ceiling(seq(from = 1, to = n, length.out = K + 1))
bin <- .bincode(perm, breaks = idx, right = FALSE, include.lowest = TRUE)

fitrhs2 <- stan_glm(formula2, data = dfr, prior=rhs_prior, QR=TRUE, 
                    seed=1513306866, refresh=0)
muss <- list()
vsmuss <- list()
vsnvss <- list()
vsnvss2 <- list()
vsnvss3 <- list()
fitcvs <- list()
for (k in 1:K) {
    message("Fitting model ", k, " out of ", K)
    omitted <- which(bin == k)
    fit_k <- update(
        object = fitrhs2,
        data = dfr[-omitted,, drop=FALSE],
        weights = NULL,
        refresh = 0
    )
    muss[[k]] <-
        colMeans(posterior_linpred(fit_k,
                                   newdata = dfr[omitted, , drop = FALSE]))
    fit_cvvs_k <- cv_varsel(fit_k, method='forward', cv_method='LOO',
                            nloo = length(which(bin != k)), nv_max=10,
                            verbose = FALSE)
    fitcvs[[k]] <- fit_cvvs_k
}
for (k in 1:K) {
    omitted <- which(bin == k)
    fit_cvvs_k <- fitcvs[[k]]
    print(nvk <- suggest_size(fit_cvvs_k, alpha=0.1))
    vsnvss[[k]] <- nvk
    fit_cvvs_k$varsel$vind[1:nvk]
    proj_k <- project(fit_cvvs_k, nv = nvk, ns = 4000)
    vsmuss[[k]] <-
        colMeans(proj_linpred(proj_k, xnew = dfr[omitted, , drop = FALSE]))
}
mus<-unlist(muss)[order(as.integer(names(unlist(muss))))]
vsmus<-unlist(vsmuss)[order(as.integer(names(unlist(vsmuss))))]
vsnvs2 <- unlist(vsnvss)
rmse_full2 <- sqrt(mean((df$siri-mus)^2))
(rmse_proj2 <- sqrt(mean((df$siri-vsmus)^2)))
save(vsnvs2, rmse_full2, rmse_proj2, file = "bodyfat_kfoldcv2.RData")

For this notebook this was run separately and the saved results are shown below

load(file="bodyfat_kfoldcv2.RData")
print(paste('10-fold-CV RMSE Bayesian full model: ', round(rmse_full,1)))
[1] "10-fold-CV RMSE Bayesian full model:  4.4"
print(paste('10-fold-CV RMSE Bayesian projpred model: ', round(rmse_proj,1)))
[1] "10-fold-CV RMSE Bayesian projpred model:  4.4"


References

Catalina, A., Bürkner, P.-C. and Vehtari, A. (2020) ‘Projection predictive inference for generalized linear and additive multilevel models’, arxiv preprint:2010.06994.
Pavone, F., Piironen, J., Bürkner, P.-C. and Vehtari, A. (2020) ‘Using reference models in variable selection’, arXiv preprint arXiv:2004.13118.
Peltola, T., Marttinen, P. and Vehtari, A. (2012) ‘Finite adaptation and multistep moves in the metropolis-hastings algorithm for variable selection in genome-wide association analysis’, PloS one, 7(11), p. e49445. doi: doi.org/10.1371/journal.pone.0049445.
Piironen, J., Paasiniemi, M. and Vehtari, A. (2020) ‘Projective inference in high-dimensional problems: Prediction and feature selection’, Electronic Journal of Statistics, 14(1), pp. 2155–2197.
Piironen, J. and Vehtari, A. (2016) ‘Projection predictive model selection for Gaussian processes’, in 2016 IEEE 26th international workshop on machine learning for signal processing (MLSP), pp. 1–6.
Piironen, J. and Vehtari, A. (2017a) ‘Comparison of Bayesian predictive methods for model selection’, Statistics and Computing, 27(3), pp. 711–735. doi: 10.1007/s11222-016-9649-y.
Piironen, J. and Vehtari, A. (2017b) ‘Sparsity information and regularization in the horseshoe and other shrinkage priors’, Electronic journal of Statistics, 11(2), pp. 5018–5051. doi: 10.1214/17-EJS1337SI.
Piironen, J. and Vehtari, A. (2018) ‘Iterative supervised principal components’, in Storkey, A. and Perez-Cruz, F. (eds) Proceedings of the 21st international conference on artificial intelligence and statistics. (Proceedings of machine learning research), pp. 106–114. Available at: http://proceedings.mlr.press/v84/piironen18a.html.
Vehtari, A., Gelman, A. and Gabry, J. (2017) ‘Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC, Statistics and Computing, 27(5), pp. 1413–1432. doi: 10.1007/s11222-016-9696-4.
Vehtari, A. and Ojanen, J. (2012) ‘A survey of Bayesian predictive methods for model assessment, selection and comparison’, Statistics Surveys, 6, pp. 142–228. doi: 10.1214/12-SS102.
Vehtari, A., Simpson, D., Gelman, A., Yao, Y. and Gabry, J. (2019) ‘Pareto smoothed importance sampling’, arXiv preprint arXiv:1507.02646. Available at: https://arxiv.org/abs/1507.02646v6.

Licenses

  • Code © 2018, Aki Vehtari, licensed under BSD-3.
  • Text © 2018-2021, Aki Vehtari, licensed under CC-BY-NC 4.0.

Original Computing Environment

sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /opt/OpenBLAS/lib/libopenblas_haswellp-r0.3.17.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=fi_FI.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=fi_FI.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=fi_FI.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bridgesampling_1.1-2  ggridges_0.5.3        RColorBrewer_1.1-2    corrplot_0.92        
 [5] here_1.0.1            arm_1.12-2            lme4_1.1-28           Matrix_1.4-0         
 [9] MASS_7.3-55           GGally_2.1.2          fivethirtyeight_0.6.2 projpred_2.0.2       
[13] bayesplot_1.8.1       forcats_0.5.1         stringr_1.4.0         dplyr_1.0.8          
[17] purrr_0.3.4           readr_2.1.2           tidyr_1.2.0           tibble_3.1.6         
[21] ggplot2_3.3.5         tidyverse_1.3.1       loo_2.4.1             rstanarm_2.21.1      
[25] Rcpp_1.0.8            rmarkdown_2.11        knitr_1.37           

loaded via a namespace (and not attached):
  [1] readxl_1.3.1         backports_1.4.1      plyr_1.8.6           igraph_1.2.11       
  [5] splines_4.1.2        crosstalk_1.2.0      usethis_2.1.5        rstantools_2.1.1    
  [9] inline_0.3.19        digest_0.6.29        htmltools_0.5.2      rsconnect_0.8.25    
 [13] fansi_1.0.2          magrittr_2.0.2       checkmate_2.0.0      memoise_2.0.1       
 [17] tzdb_0.2.0           remotes_2.4.2        modelr_0.1.8         RcppParallel_5.1.5  
 [21] matrixStats_0.61.0   xts_0.12.1           prettyunits_1.1.1    colorspace_2.0-2    
 [25] rvest_1.0.2          haven_2.4.3          xfun_0.29            callr_3.7.0         
 [29] crayon_1.4.2         jsonlite_1.7.3       survival_3.2-13      zoo_1.8-9           
 [33] glue_1.6.1           gtable_0.3.0         distributional_0.3.0 pkgbuild_1.3.1      
 [37] rstan_2.21.3         abind_1.4-5          scales_1.1.1         mvtnorm_1.1-3       
 [41] DBI_1.1.2            miniUI_0.1.1.1       progress_1.2.2       xtable_1.8-4        
 [45] stats4_4.1.2         StanHeaders_2.21.0-7 DT_0.20              htmlwidgets_1.5.4   
 [49] httr_1.4.2           threejs_0.3.3        posterior_1.2.0      ellipsis_0.3.2      
 [53] reshape_0.8.8        pkgconfig_2.0.3      farver_2.1.0         sass_0.4.0          
 [57] dbplyr_2.1.1         utf8_1.2.2           labeling_0.4.2       tidyselect_1.1.1    
 [61] rlang_1.0.1          reshape2_1.4.4       later_1.3.0          munsell_0.5.0       
 [65] cellranger_1.1.0     tools_4.1.2          cachem_1.0.6         cli_3.1.1           
 [69] generics_0.1.2       devtools_2.4.3       broom_0.7.12         evaluate_0.14       
 [73] fastmap_1.1.0        yaml_2.2.2           processx_3.5.2       fs_1.5.2            
 [77] nlme_3.1-155         mime_0.12            xml2_1.3.3           brio_1.1.3          
 [81] compiler_4.1.2       shinythemes_1.2.0    rstudioapi_0.13      gamm4_0.2-6         
 [85] testthat_3.1.2       reprex_2.0.1         bslib_0.3.1          stringi_1.7.6       
 [89] highr_0.9            ps_1.6.0             Brobdingnag_1.2-7    desc_1.4.0          
 [93] lattice_0.20-45      nloptr_1.2.2.3       markdown_1.1         shinyjs_2.1.0       
 [97] tensorA_0.36.2       vctrs_0.3.8          pillar_1.7.0         lifecycle_1.0.1     
[101] jquerylib_0.1.4      httpuv_1.6.5         R6_2.5.1             promises_1.2.0.1    
[105] gridExtra_2.3        sessioninfo_1.2.2    codetools_0.2-18     boot_1.3-28         
[109] colourpicker_1.1.1   gtools_3.9.2         assertthat_0.2.1     pkgload_1.2.4       
[113] rprojroot_2.0.2      withr_2.4.3          shinystan_2.5.0      mgcv_1.8-38         
[117] parallel_4.1.2       hms_1.1.1            grid_4.1.2           coda_0.19-4         
[121] minqa_1.2.4          shiny_1.7.1          lubridate_1.8.0      base64enc_0.1-3     
[125] dygraphs_1.1.1.6