Load packages
library(here)
library(rstanarm)
options(mc.cores = parallel::detectCores())
library(loo)
library(projpred)
library(ggplot2)
library(bayesplot)
theme_set(bayesplot::theme_default(base_family = "sans"))
library(corrplot)
library(knitr)
SEED=1513306866
This notebook was inspired by the article Heinze, Wallisch, and Dunkler (2018). Variable selection – A review and recommendations for the practicing statistician. They provide ``an overview of various available variable selection methods that are based on significance or information criteria, penalized likelihood, the change-in-estimate criterion, background knowledge, or combinations thereof.’’ I agree that they provide sensible recommendations and warnings for those methods. Similar recommendations and warnings hold for information criterion and naive cross-validation based variable selection in Bayesian framework as demonstrated by Piironen and Vehtari (2017a).
Piironen and Vehtari (2017a) demonstrate also the superior stability of projection predictive variable selection (see specially figures 4 and 10). In this notebook I demonstrate the projection predictive variable selection method as presented by Piironen, Paasiniemi and Vehtari (2020) and implemented in R package projpred
. I use the same body fat data as used in Section 3.3 of the article by Heinze, Wallisch, and Dunkler (2017). The dataset with the background information is available here but Heinze, Wallisch, and Dunkler have made some data cleaning and I have used the same data and some bits of the code they provide in the supplementary material. There still are some strange values like the one person with zero fat percentage, but I didn’t do additional cleaning.
This notebook was initially created in 2018, but has later been extended to an article “Using reference models in variable selection” by Pavone et al. (2020).
The excellent performance of the projection predictive variable selection comes from following parts
Excellent performance of projection predictive variable selection compared to other Bayesian variable selection methods was presented by Piironen and Vehtari (2017a). Piironen, Paasiniemi and Vehtari (2020) present further improvements such as improved model size selection and several options to make the approach faster for larger number of variables or bigger data sets. Vehtari and Ojanen (2012) present theoretical justification for projection predictive model selection and inference after selection.
Note that if the goal is only the prediction no variable selection is needed. The projection predictive variable selection can be used to learn which are the most useful variables for making predictions and potentially reduce the future measurement costs. In the bodyfat example, most of the measurements have time cost and there is a benefit of finding the smallest set of variables to be used in the future for the predictions.
This notebook presents a linear regression example, but the projection predictive approach can be used also, for example, with generalized linear models (Piironen and Vehtari, 2017a; Piironen, Paasiniemi and Vehtari, 2020), Gaussian processes (Piironen and Vehtari, 2016), and generalized linear and additive multilevel models (Catalina, Bürkner and Vehtari, 2020).
Load data and scale it. Heinze, Wallisch, and Dunkler (2018) used unscaled data, but we scale it for easier comparison of the effect sizes. In theory this scaling should not have detectable difference in the predictions and I did run the results also without scaling and there is no detectable difference in practice.
df <- read.table(here("bodyfat.txt"), header = T, sep = ";")
df[,4:19] <- scale(df[,4:19])
# no-one can have 0% body fat
df <- df[df$siri>0,]
y <- df[,"siri"]
df <- as.data.frame(df)
n <- nrow(df)
Lists of predictive and target variables, and formula.
pred <- c("age", "weight", "height", "neck", "chest", "abdomen", "hip",
"thigh", "knee", "ankle", "biceps", "forearm", "wrist")
target <- "siri"
formula <- formula(paste("siri ~", paste(pred, collapse = " + ")))
p <- length(pred)
Plot correlation structure
corrplot(cor(df[, c(target,pred)]))
Fit full Bayesian model. We use weakly informative regularized horseshoe prior [Piironen+Vehtari:RHS:2017] to include prior assumption that some of the variables might be irrelevant. The selected \(p_0\) is mean for the prior assumption. See Piironen and Vehtari (2017b) for the uncertainty around the implied prior for effectively non-zero coefficients.
p0 <- 5 # prior guess for the number of relevant variables
tau0 <- p0/(p-p0) * 1/sqrt(n)
rhs_prior <- hs(global_scale=tau0)
fitrhs <- stan_glm(formula, data = df, prior=rhs_prior, QR=TRUE,
seed=SEED, refresh=0)
summary(fitrhs)
Model Info:
function: stan_glm
family: gaussian [identity]
formula: siri ~ age + weight + height + neck + chest + abdomen + hip +
thigh + knee + ankle + biceps + forearm + wrist
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 250
predictors: 14
Estimates:
mean sd 10% 50% 90%
(Intercept) 19.1 0.3 18.8 19.1 19.5
age 0.8 0.4 0.3 0.8 1.3
weight 0.1 1.7 -2.0 0.1 2.2
height -0.9 0.5 -1.5 -0.9 -0.2
neck -0.7 0.5 -1.4 -0.7 0.0
chest -1.4 0.8 -2.4 -1.5 -0.3
abdomen 8.9 0.9 7.7 8.9 10.0
hip -0.6 0.8 -1.7 -0.5 0.3
thigh 0.5 0.7 -0.4 0.4 1.4
knee 0.0 0.4 -0.6 0.0 0.5
ankle 0.3 0.3 -0.1 0.3 0.6
biceps 0.4 0.4 -0.1 0.4 1.0
forearm 0.4 0.4 0.0 0.3 0.9
wrist -1.5 0.5 -2.2 -1.5 -0.9
sigma 4.3 0.2 4.0 4.3 4.5
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 19.2 0.4 18.7 19.2 19.7
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 4881
age 0.0 1.0 3894
weight 0.0 1.0 3593
height 0.0 1.0 3767
neck 0.0 1.0 3950
chest 0.0 1.0 3188
abdomen 0.0 1.0 4154
hip 0.0 1.0 3086
thigh 0.0 1.0 3119
knee 0.0 1.0 4622
ankle 0.0 1.0 4605
biceps 0.0 1.0 3078
forearm 0.0 1.0 4010
wrist 0.0 1.0 3185
sigma 0.0 1.0 5386
mean_PPD 0.0 1.0 4455
log-posterior 0.2 1.0 1074
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
Make graphical posterior predictive check to check that the model predictions make sense.
yrep <- posterior_predict(fitrhs, draws = 50)
ppc_dens_overlay(y, yrep)
Kernel density estimate for the data and posterior predictive replicates are similar. Although data do not have any negative \(y\) values, the kernel density estimate is smoothing over to the negative side. The predictive distribution of the model is noy constrained to be positive, but the error is small and we continue with the variable selection.
Plot marginal posterior of the coefficients.
mcmc_areas(as.matrix(fitrhs)[,2:14])
We can see that the posterior of abdomen coefficient is far away from zero, but it’s not as clear what other variables should be included. weight
has wide marginal overlapping zero, which hints potentially relevant variable with correlation in joint posterior.
Looking at the marginals has the problem that correlating variables may have marginal posteriors overlapping zero while joint posterior typical set does not include zero. Compare, for example, marginals of height
and height
above to their joint distribution below.
mcmc_scatter(as.matrix(fitrhs), pars = c("height", "weight"))+geom_vline(xintercept=0)+geom_hline(yintercept=0)
Projection predictive variable selection is easily made with cv_varsel
function, which also computes an LOO-CV estimate of the predictive performance for the best models with certain number of variables. Heinze, Wallisch, and Dunkler (2018) ``consider abdomen and height as two central IVs [independent variables] for estimating body fat proportion, and will not subject these two to variable selection.’’ We subject all variables to selection.
fitrhs_cvvs <- cv_varsel(fitrhs, method = 'forward', cv_method = 'loo',
nloo = n, verbose = FALSE)
And the estimated predictive performance of smaller models compared to the full model.
plot(fitrhs_cvvs, stats = c('elpd', 'rmse'), deltas=FALSE)
Based on the plot 2 variables and projected posterior provide practically the same predictive performance as the full model. We can get a PSIS-LOO (Vehtari, Gelman and Gabry, 2017) based recommendation for the model size to choose.
(nsel <- suggest_size(fitrhs_cvvs, alpha=0.1))
[1] 2
(vsel <- solution_terms(fitrhs_cvvs)[1:nsel])
[1] "abdomen" "weight"
Based on this recommendation we continue with two variables abdomen
and weight
. The model selected by Heinze, Wallisch, and Dunkler (2018) had seven variables height
(fixed), abdomen
(fixed), wrist
, age
, neck
, forearm
, and chest
-
Form the projected posterior for the selected model.
projrhs <- project(fitrhs_cvvs, nv = nsel, ns = 4000)
Plot the marginals of the projected posterior.
mcmc_areas(as.matrix(projrhs), pars = vsel)
So far we have seen that projpred
selected a smaller set of variables which have very similar predictive performance as the full model. Let’s compare next the stability of the approaches. Heinze, Wallisch, and Dunkler (2018) repeated the model selection using 1000 bootstrapped datasets. Top 20 models selected have 5–9 variables, the highest selection frequency is 3.2%, and cumulative selection frequency for top 20 models is 29.5%. These results clearly illustrate instability of the selection method they used.
Before looking at the corresponding bootstrap results we can look at the stability of selection process based on the LOO-CV selection paths computed by cv_varsel
(the code to make the following plot will be included in projpred package).
source("projpredpct.R")
rows <- nrow(fitrhs_cvvs[["pct_solution_terms_cv"]])
col <- nrow(fitrhs_cvvs[["pct_solution_terms_cv"]])
pctch <- round(fitrhs_cvvs[["pct_solution_terms_cv"]], 2)
colnames(pctch)[1] <- ".size"
pct <- get_pct_arr(pctch, 13)
col_brks <- get_col_brks()
pct$val_grp <- as.character(sapply(pct$val, function(x) sum(x >= col_brks$breaks)))
if (identical(rows, 0)) rows <- pct$var[1]
pct$sel <- (pct$.size == col) & (pct$var %in% rows)
brks <- sort(unique(as.numeric(pct$val_grp)) + 1)
ggplot(pct, aes_(x = ~.size, y = ~var)) +
geom_tile(aes_(fill = ~val_grp, color = ~sel),
width = 1, height = 0.9, size = 1) +
geom_text(aes_(label = ~val, fontface = ~sel+1)) +
coord_cartesian(expand = FALSE) +
scale_y_discrete(limits = rev(levels(pct$var))) +
scale_x_discrete(limits = seq(1,col)) +
scale_color_manual(values = c("white", "black")) +
labs(x = "Model size", y = "",
title = "Fraction of cv-folds that select the given variable") +
scale_fill_manual(breaks = brks, values = col_brks$pal[brks]) +
theme_proj() +
theme(legend.position = "none",
axis.text.y = element_text(angle = 45))
For model sizes 1-3 selection paths in different LOO-CV cases are always the same abdomen?,
weight, and
wrist`. For larger model sizes there are some small variation, but mostly the order is quite consistent.
Running stan_glm
with prior=hs()
and cv_varsel
do not take much time when run only once, but for a notebook running them 1000 times would take hours. The code for running the above variable selection procedure for 100 different bootstrapped datasets is as follows.
writeLines(readLines("bodyfat_bootstrap.R"))
library(here)
library(rstanarm)
options(mc.cores = parallel::detectCores())
library(loo)
library(projpred)
library(dplyr)
#' Bootstrap iterations for projpred for bodyfat
df <- read.table(here("bodyfat.txt"), header = T, sep = ";")
df[,4:19] <- scale(df[,4:19])
df <- as.data.frame(df)
n <- nrow(df)
colnames(df[c("weight_kg", "height")]) <- c("weight", "height")
pred <- c("age", "weight", "height", "neck", "chest", "abdomen", "hip",
"thigh", "knee", "ankle", "biceps", "forearm", "wrist")
target <- "siri"
formula <- paste("siri~", paste(pred, collapse = "+"))
p <- length(pred)
p0 <- 2 # prior guess for the number of relevant variables
tau0 <- p0/(p-p0) * 1/sqrt(n)
rhs_prior <- hs(global_scale=tau0)
bootnum <- 100
boot_est <- boot_se <- matrix(0, ncol = length(pred) + 1, nrow = bootnum,
dimnames = list(NULL, c("(Intercept)", pred)))
boot_nvs <- matrix(0, ncol=1, nrow=bootnum)
bbn <- matrix(0, ncol=1, nrow=bootnum)
for (i in 1:bootnum) {
set.seed(5437854+i)
data_id <- sample(1:dim(df)[1], replace = T)
bbn[i,] <- length(unique(data_id))
fitb <- stan_glm(formula, data = df[data_id, ],
prior=rhs_prior, QR=TRUE, seed=i, refresh=0)
fitb_cv <- cv_varsel(fitb, method='forward', cv_method='LOO', nloo=n,
verbose = FALSE)
print(nv <- suggest_size(fitb_cv,alpha=0.1))
boot_nvs[i,] <- nv
print(fitb_cv$varsel$vind[1:nv])
projb <- project(fitb_cv, nv = nv, ns = 4000)
boot_est[i, colnames(as.matrix(projb)[,-(nv+2)])] <- colMeans(as.matrix(projb)[,-(nv+2)])
}
boot_01 <- (boot_est != 0) * 1
boot_inclusion <- data.frame(projpred_incp=(apply(boot_01, 2, function(x) sum(x) / length(x) * 100)))
boot_01 <- data.frame(boot_01)
bn <- data.frame(boot_nvs) %>% group_by_all() %>% count(sort=TRUE)
bd <- boot_01 %>% group_by_at(vars(-X.Intercept.)) %>% count(sort=TRUE)
boot_inclusion <- boot_inclusion %>% tibble::rownames_to_column(var="variable") %>% filter(variable != "X.Intercept.") %>% arrange(-projpred_incp)
boot_inclusion$steplm_incp <- c(100, 28, 98, 100, 85, 63, 51, 48, 34, 43, 54, 41, 18)
boot_inclusion <- boot_inclusion %>% rename(projpred=projpred_incp, steplm=steplm_incp)
cumsum(bd$n)
for (i in 1:20) {
print(paste(paste0(colnames(bd)[c(as.logical(bd[i,1:13]),FALSE)], collapse=", "),bd$n[i],sep=", "))
}
save(boot_01, boot_inclusion, boot_nvs, bbn, file = "bodyfat_bootstrap.RData")
In theory LOO-CV should have smaller variation than bootstrap, and also in practice we see much more variation in the bootstrap results. In the basic bootstrap on average only 63% of the original data is included, that is, in this case on average the amount of unique observations in bootstrapped data is 159 while full data has n=251, which explains increased variability in variable selection. From 100 bootstrap iterations model size 2 was selected 32 times and model size 3 28 times.
The bootstrap inclusion frequencies with projpred
and with step(lm(...)))
(resuls from Heinze, Wallisch, and Dunkler (2018)) are shown in the following table
variable | projpred | steplm |
---|---|---|
abdomen | 100 | 100 |
weight | 58 | 28 |
wrist | 45 | 98 |
height | 35 | 100 |
age | 10 | 85 |
neck | 10 | 63 |
chest | 5 | 51 |
thigh | 5 | 48 |
ankle | 5 | 34 |
biceps | 4 | 43 |
forearm | 3 | 54 |
hip | 0 | 41 |
knee | 0 | 18 |
Heinze, Wallisch, and Dunkler (2018) had fixed that abdomen and height are always included. projpred
selects abdomen always, but height is included only in 35% iterations. Coefficients of weight and height are strongly correlated as shown above, and thus it is not that surprising that projpred
selects weight instead ogf height. Five most often selected variables by projpred
in bootstrap iterations are the same five and in the same order as by cv_varsel
function with full data. Overall projpred
selects smaller models and thus the bootstrap inclusion probabilities are smaller.
The following table shows top 10 projpred
models and model selection frequencies from bootstrap iterations.
model | variables | frequency |
---|---|---|
1 | abdomen, weight | 37 |
2 | abdomen, wrist | 10 |
3 | abdomen, height | 10 |
4 | abdomen, height, wrist | 9 |
5 | abdomen, weight, wrist | 8 |
6 | abdomen, chest, height, wrist | 3 |
7 | abdomen, height, neck, wrist | 2 |
8 | abdomen, age, wrist | 2 |
9 | abdomen, age, height, neck, thigh, wrist | 2 |
10 | abdomen, chest | 1 |
Top 10 models selected have 2–5 variables, the highest selection frequency is 38%, and the cumulative selection frequency for top 10 models is 84% and for top 20 models 94%. This demonstrates that projpred
is much more stable.
Heinze, Wallisch, and Dunkler (2017) focused on which variables were selected and coefficient estimates, but they did not consider the predictive performance of the models. We can estimate the predictive performance of the selected model via cross-validation (taking into account the effect of the selection process, too). As part of the variable selection cv_varsel
computes also PSIS-LOO estimates for the full model and all submodels taking into account the selection process. For Bayesian models we would usually report expected log predictive density as it assesses the goodnes of the whole predictive distribution. Since we now compare results to lm
we estimate also root mean square error (rmse) of mean predictions.
loormse_full <- sqrt(mean((df$siri-fitrhs_cvvs$summaries$ref$mu)^2))
loormse_proj <- sqrt(mean((df$siri-fitrhs_cvvs$summaries$sub[[nsel]]$mu)^2))
print(paste('PSIS-LOO RMSE Bayesian full model: ', round(loormse_full,1)))
[1] "PSIS-LOO RMSE Bayesian full model: 4.3"
print(paste('PSIS-LOO RMSE selected projpred model: ', round(loormse_proj,1)))
[1] "PSIS-LOO RMSE selected projpred model: 4.7"
Since we do get these cross-validation estimates using PSIS-LOO, we do not need to run K-fold-CV, but since we need to run K-fold-CV for lm + step, we did run also K-fold-CV for projpred
selection process. As this take some time, hre only the code is shown and we load seprately run results. The following code computes 20-fold-CV for the Bayesian model and projpred
selected model.
writeLines(readLines("bodyfat_kfoldcv.R"))
#' K-fold-CV for projpred for bodyfat
df <- read.table(here("bodyfat.txt"), header = T, sep = ";")
df[,4:19] <- scale(df[,4:19])
df <- as.data.frame(df)
n <- nrow(df)
colnames(df[c("weight_kg", "height")]) <- c("weight", "height")
pred <- c("age", "weight", "height", "neck", "chest", "abdomen", "hip",
"thigh", "knee", "ankle", "biceps", "forearm", "wrist")
target <- "siri"
formula <- paste("siri~", paste(pred, collapse = "+"))
p <- length(pred)
p0 <- 5 # prior guess for the number of relevant variables
tau0 <- p0/(p-p0) * 1/sqrt(n)
rhs_prior <- hs(global_scale=tau0)
fitrhs <- stan_glm(formula, data = df, prior=rhs_prior, QR=TRUE,
seed=1513306866, refresh=0)
set.seed(5437854)
perm <- sample.int(n)
K <- 20
idx <- ceiling(seq(from = 1, to = n, length.out = K + 1))
bin <- .bincode(perm, breaks = idx, right = FALSE, include.lowest = TRUE)
muss <- list()
vsmuss <- list()
vsnvss <- list()
vsnvss2 <- list()
for (k in 1:K) {
message("Fitting model ", k, " out of ", K)
omitted <- which(bin == k)
fit_k <- update(
object = fitrhs,
data = df[-omitted,, drop=FALSE],
weights = NULL,
refresh = 0
)
fit_cvvs_k <- cv_varsel(fit_k, method='forward', cv_method='LOO',
nloo=nrow(df[-omitted,, drop=FALSE]))
nvk <- suggest_size(fit_cvvs_k,alpha=0.1)
vsnvss[[k]] <- nvk
proj_k <- project(fit_cvvs_k, nv = nvk, ns = 4000)
muss[[k]] <-
colMeans(posterior_linpred(fit_k,
newdata = df[omitted, , drop = FALSE]))
vsmuss[[k]] <-
colMeans(proj_linpred(proj_k, xnew = df[omitted, , drop = FALSE]))
}
mus<-unlist(muss)[order(as.integer(names(unlist(muss))))]
vsmus<-unlist(vsmuss)[order(as.integer(names(unlist(vsmuss))))]
vsnvs <- unlist(vsnvss)
rmse_full <- sqrt(mean((df$siri-mus)^2))
rmse_proj <- sqrt(mean((df$siri-vsmus)^2))
save(vsnvs, rmse_full, rmse_proj, file = "bodyfat_kfoldcv.RData")
We load and print the results
load(file="bodyfat_kfoldcv.RData")
print(paste('20-fold-CV RMSE Bayesian full model: ', round(rmse_full,1)))
[1] "20-fold-CV RMSE Bayesian full model: 4.4"
print(paste('20-fold-CV RMSE Bayesian projpred model: ', round(rmse_proj,1)))
[1] "20-fold-CV RMSE Bayesian projpred model: 4.4"
The results are very close to PSIS-LOO results. Going from the full Bayesian model to a smaller projection predictive model, we get practically the same 20-fold-CV RMSE, which very good performance considering we dropped from 13 covariates to 2 covariates.
Then compute 20-fold-CV for the lm model and step selected model (this is fast enough to include in a notebook).
set.seed(SEED)
perm <- sample.int(n)
K <- 20
idx <- ceiling(seq(from = 1, to = n, length.out = K + 1))
bin <- .bincode(perm, breaks = idx, right = FALSE, include.lowest = TRUE)
lmmuss <- list()
lmvsmuss <- list()
lmvsnvss <- list()
lmvsnlmuss <- list()
lmvsnlnvss <- list()
for (k in 1:K) {
message("Fitting model ", k, " out of ", K)
omitted <- which(bin == k)
lmfit_k <- lm(formula, data = df[-omitted,, drop=FALSE], x=T,y=T)
lmmuss[[k]] <- predict(lmfit_k, newdata = df[omitted, , drop = FALSE])
sel_k <- step(lm(formula, data = df[-omitted,, drop=FALSE], x=T,y=T),
direction = "backward",
scope = list(upper = formula,
lower = formula(siri~abdomen+height)),
trace = 0)
lmvsmuss[[k]] <- predict(sel_k, newdata = df[omitted, , drop = FALSE])
lmvsnvss[[k]] <- length(coef(sel_k))-1
# compute also a version without fixing abdomen and height
selnl_k <- step(lm(formula, data = df[-omitted,, drop=FALSE], x=T,y=T),
direction = "backward",
trace = 0)
lmvsnlmuss[[k]] <- predict(selnl_k, newdata = df[omitted, , drop = FALSE])
lmvsnlnvss[[k]] <- length(coef(selnl_k))-1
}
lmmus<-unlist(lmmuss)[order(as.integer(names(unlist(lmmuss))))]
lmvsmus<-unlist(lmvsmuss)[order(as.integer(names(unlist(lmvsmuss))))]
lmvsnvs <- unlist(lmvsnvss)
lmvsnlmus<-unlist(lmvsnlmuss)[order(as.integer(names(unlist(lmvsnlmuss))))]
lmvsnlnvs <- unlist(lmvsnlnvss)
rmse_lmfull <- sqrt(mean((df$siri-lmmus)^2))
rmse_lmsel <- sqrt(mean((df$siri-lmvsmus)^2))
rmse_lmselnl <- sqrt(mean((df$siri-lmvsnlmus)^2))
print(paste('20-fold-CV RMSE lm full model: ', round(rmse_lmfull,1)))
[1] "20-fold-CV RMSE lm full model: 4.4"
print(paste('20-fold-CV RMSE lm step selected model: ', round(rmse_lmsel,1)))
[1] "20-fold-CV RMSE lm step selected model: 4.5"
We see that simpler maximum likelihood could provide similar 20-fold-CV RMSE as much slower Bayesian inference with a fancy regularized horseshoe prior. Also the model selection process did not overfit and the model selection with step has similar 20-fold-CV RMSE as projpred
result. Neither of these results are not surprising as \(n=251 \gg p=13\). However projpred
has much more stable selection process and produced much smaller models with the same accuracy.
Heinze, Wallisch, and Dunkler (2018) also write ``In routine work, however, it is not known a priori which covariates should be included in a model, and often we are confronted with the number of candidate variables in the range 10-30. This number is often too large to be considered in a statistical model.’’ I strongly disagree with this as there are many statistical models working with more than million candidate variables (see, e.g., Peltola, Marttinen and Vehtari (2012)). As the bodyfat dataset proved to be quite easy in that sense that maximum likelihood performed well compared to Bayesian approach, let’s make the dataset a bit more challenging.
We add 87 variables which are random normal distributed noise and thus are not related to bodyfat in any way. We have now total of 100 variables.
set.seed(SEED)
noise <- array(rnorm(87*n), c(n,87))
dfr<-cbind(df,noise=noise)
formula2<-formula(paste(paste("siri~", paste(pred, collapse = "+")),
"+",paste(colnames(dfr[,20:106]), collapse = "+")))
Given this new dataset compute do variable selection with full data and compute 20-fold-CV for the lm model and step selected model.
sel2 <- step(lm(formula2, data = dfr, x=T,y=T),
direction = "backward",
scope = list(upper = formula2,
lower = formula(siri~abdomen+height)),
trace = 0)
lmmuss <- list()
lmvsmuss <- list()
lmvsnvss <- list()
lmvsnlmuss <- list()
lmvsnlnvss <- list()
for (k in 1:K) {
message("Fitting model ", k, " out of ", K)
omitted <- which(bin == k)
lmfit_k <- lm(formula2, data = dfr[-omitted,, drop=FALSE], x=T,y=T)
lmmuss[[k]] <- predict(lmfit_k, newdata = dfr[omitted, , drop = FALSE])
sel_k <- step(lm(formula2, data = dfr[-omitted,, drop=FALSE], x=T,y=T),
direction = "backward",
scope = list(upper = formula2,
lower = formula(siri~abdomen+height)),
trace = 0)
lmvsmuss[[k]] <- predict(sel_k, newdata = dfr[omitted, , drop = FALSE])
lmvsnvss[[k]] <- length(coef(sel_k))-1
selnl_k <- step(lm(formula2, data = dfr[-omitted,, drop=FALSE], x=T,y=T),
direction = "backward",
trace = 0)
lmvsnlmuss[[k]] <- predict(selnl_k, newdata = dfr[omitted, , drop = FALSE])
lmvsnlnvss[[k]] <- length(coef(selnl_k))-1
}
lmmus<-unlist(lmmuss)[order(as.integer(names(unlist(lmmuss))))]
lmvsmus<-unlist(lmvsmuss)[order(as.integer(names(unlist(lmvsmuss))))]
lmvsnvs <- unlist(lmvsnvss)
lmvsnlmus<-unlist(lmvsnlmuss)[order(as.integer(names(unlist(lmvsnlmuss))))]
lmvsnlnvs <- unlist(lmvsnlnvss)
rmse_lmfull <- sqrt(mean((df$siri-lmmus)^2))
rmse_lmsel <- sqrt(mean((df$siri-lmvsmus)^2))
rmse_lmselnl <- sqrt(mean((df$siri-lmvsnlmus)^2))
print(names(coef(sel2)))
[1] "(Intercept)" "age" "height" "neck" "abdomen" "thigh"
[7] "forearm" "wrist" "noise.1" "noise.7" "noise.8" "noise.11"
[13] "noise.13" "noise.15" "noise.16" "noise.17" "noise.26" "noise.29"
[19] "noise.30" "noise.31" "noise.34" "noise.36" "noise.37" "noise.54"
[25] "noise.55" "noise.59" "noise.60" "noise.65" "noise.75" "noise.76"
[31] "noise.84" "noise.87"
print(paste('20-fold-CV RMSE lm full model: ', round(rmse_lmfull,1)))
[1] "20-fold-CV RMSE lm full model: 5.4"
print(paste('20-fold-CV RMSE lm step selected model: ', round(rmse_lmsel,1)))
[1] "20-fold-CV RMSE lm step selected model: 5"
Variable selection with step
has selected 38 variables from which 32 are random noise. 20-fold-CV RMSE for full lm model and selected model are higher than with the original data.
How about then Bayesian model with regularized horseshoe and projection predictive variable selection?
p <- 100
p0 <- 5 # prior guess for the number of relevant variables
tau0 <- p0/(p-p0) * 1/sqrt(n)
hs_prior <- hs(global_scale=tau0)
fitrhs2 <- stan_glm(formula2, data = dfr, prior = hs_prior, QR = TRUE,
seed=SEED, refresh=0)
fitrhs2_cvvs <- cv_varsel(fitrhs2, method = 'forward', cv_method = 'loo',
nloo=n, verbose = FALSE)
nsel2 <- suggest_size(fitrhs2_cvvs, alpha=0.1)
vsel2 <- solution_terms(fitrhs2_cvvs)[1:nsel2]
loormse_full2 <- sqrt(mean((df$siri-fitrhs2_cvvs$summaries$ref$mu)^2))
loormse_proj2 <- sqrt(mean((df$siri-fitrhs2_cvvs$summaries$sub[[nsel2]]$mu)^2))
print(paste('PSIS-LOO RMSE Bayesian full model: ', round(loormse_full2,1)))
[1] "PSIS-LOO RMSE Bayesian full model: 4.4"
print(paste('PSIS-LOO RMSE selected projpred model: ', round(loormse_proj2,1)))
[1] "PSIS-LOO RMSE selected projpred model: 4.7"
Variable selection with projpred
has selected 2 variables from which 0 are random noise. PSIS-LOO RMSE for the full Bayesian model and the selected projpred
model are similar as with the original data.
If you don’t trust PSIS-LOO you can run the following K-fold-CV to get almost the same result.
writeLines(readLines("bodyfat_kfoldcv2.R"))
#' K-fold-CV for projpred for bodyfat
df <- read.table(here("bodyfat.txt"), header = T, sep = ";")
df[,4:19] <- scale(df[,4:19])
df <- as.data.frame(df)
n <- nrow(df)
colnames(df[c("weight_kg", "height")]) <- c("weight", "height")
pred <- c("age", "weight", "height", "neck", "chest", "abdomen", "hip",
"thigh", "knee", "ankle", "biceps", "forearm", "wrist")
target <- "siri"
formula <- paste("siri~", paste(pred, collapse = "+"))
set.seed(1513306866)
noise <- array(rnorm(87*n), c(n,87))
dfr<-cbind(df,noise=noise)
formula2<-paste(formula,"+",paste(colnames(dfr[,20:106]), collapse = "+"))
p <- 100
p0 <- 5 # prior guess for the number of relevant variables
tau0 <- p0/(p-p0) * 1/sqrt(n)
rhs_prior <- hs(global_scale=tau0)
set.seed(1513306866)
perm <- sample.int(n)
K <- 20
idx <- ceiling(seq(from = 1, to = n, length.out = K + 1))
bin <- .bincode(perm, breaks = idx, right = FALSE, include.lowest = TRUE)
fitrhs2 <- stan_glm(formula2, data = dfr, prior=rhs_prior, QR=TRUE,
seed=1513306866, refresh=0)
muss <- list()
vsmuss <- list()
vsnvss <- list()
vsnvss2 <- list()
vsnvss3 <- list()
fitcvs <- list()
for (k in 1:K) {
message("Fitting model ", k, " out of ", K)
omitted <- which(bin == k)
fit_k <- update(
object = fitrhs2,
data = dfr[-omitted,, drop=FALSE],
weights = NULL,
refresh = 0
)
muss[[k]] <-
colMeans(posterior_linpred(fit_k,
newdata = dfr[omitted, , drop = FALSE]))
fit_cvvs_k <- cv_varsel(fit_k, method='forward', cv_method='LOO',
nloo = length(which(bin != k)), nv_max=10,
verbose = FALSE)
fitcvs[[k]] <- fit_cvvs_k
}
for (k in 1:K) {
omitted <- which(bin == k)
fit_cvvs_k <- fitcvs[[k]]
print(nvk <- suggest_size(fit_cvvs_k, alpha=0.1))
vsnvss[[k]] <- nvk
fit_cvvs_k$varsel$vind[1:nvk]
proj_k <- project(fit_cvvs_k, nv = nvk, ns = 4000)
vsmuss[[k]] <-
colMeans(proj_linpred(proj_k, xnew = dfr[omitted, , drop = FALSE]))
}
mus<-unlist(muss)[order(as.integer(names(unlist(muss))))]
vsmus<-unlist(vsmuss)[order(as.integer(names(unlist(vsmuss))))]
vsnvs2 <- unlist(vsnvss)
rmse_full2 <- sqrt(mean((df$siri-mus)^2))
(rmse_proj2 <- sqrt(mean((df$siri-vsmus)^2)))
save(vsnvs2, rmse_full2, rmse_proj2, file = "bodyfat_kfoldcv2.RData")
For this notebook this was run separately and the saved results are shown below
load(file="bodyfat_kfoldcv2.RData")
print(paste('10-fold-CV RMSE Bayesian full model: ', round(rmse_full,1)))
[1] "10-fold-CV RMSE Bayesian full model: 4.4"
print(paste('10-fold-CV RMSE Bayesian projpred model: ', round(rmse_proj,1)))
[1] "10-fold-CV RMSE Bayesian projpred model: 4.4"
sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Matrix products: default
BLAS/LAPACK: /opt/OpenBLAS/lib/libopenblas_haswellp-r0.3.17.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=fi_FI.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=fi_FI.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=fi_FI.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bridgesampling_1.1-2 ggridges_0.5.3 RColorBrewer_1.1-2 corrplot_0.92
[5] here_1.0.1 arm_1.12-2 lme4_1.1-28 Matrix_1.4-0
[9] MASS_7.3-55 GGally_2.1.2 fivethirtyeight_0.6.2 projpred_2.0.2
[13] bayesplot_1.8.1 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.8
[17] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.6
[21] ggplot2_3.3.5 tidyverse_1.3.1 loo_2.4.1 rstanarm_2.21.1
[25] Rcpp_1.0.8 rmarkdown_2.11 knitr_1.37
loaded via a namespace (and not attached):
[1] readxl_1.3.1 backports_1.4.1 plyr_1.8.6 igraph_1.2.11
[5] splines_4.1.2 crosstalk_1.2.0 usethis_2.1.5 rstantools_2.1.1
[9] inline_0.3.19 digest_0.6.29 htmltools_0.5.2 rsconnect_0.8.25
[13] fansi_1.0.2 magrittr_2.0.2 checkmate_2.0.0 memoise_2.0.1
[17] tzdb_0.2.0 remotes_2.4.2 modelr_0.1.8 RcppParallel_5.1.5
[21] matrixStats_0.61.0 xts_0.12.1 prettyunits_1.1.1 colorspace_2.0-2
[25] rvest_1.0.2 haven_2.4.3 xfun_0.29 callr_3.7.0
[29] crayon_1.4.2 jsonlite_1.7.3 survival_3.2-13 zoo_1.8-9
[33] glue_1.6.1 gtable_0.3.0 distributional_0.3.0 pkgbuild_1.3.1
[37] rstan_2.21.3 abind_1.4-5 scales_1.1.1 mvtnorm_1.1-3
[41] DBI_1.1.2 miniUI_0.1.1.1 progress_1.2.2 xtable_1.8-4
[45] stats4_4.1.2 StanHeaders_2.21.0-7 DT_0.20 htmlwidgets_1.5.4
[49] httr_1.4.2 threejs_0.3.3 posterior_1.2.0 ellipsis_0.3.2
[53] reshape_0.8.8 pkgconfig_2.0.3 farver_2.1.0 sass_0.4.0
[57] dbplyr_2.1.1 utf8_1.2.2 labeling_0.4.2 tidyselect_1.1.1
[61] rlang_1.0.1 reshape2_1.4.4 later_1.3.0 munsell_0.5.0
[65] cellranger_1.1.0 tools_4.1.2 cachem_1.0.6 cli_3.1.1
[69] generics_0.1.2 devtools_2.4.3 broom_0.7.12 evaluate_0.14
[73] fastmap_1.1.0 yaml_2.2.2 processx_3.5.2 fs_1.5.2
[77] nlme_3.1-155 mime_0.12 xml2_1.3.3 brio_1.1.3
[81] compiler_4.1.2 shinythemes_1.2.0 rstudioapi_0.13 gamm4_0.2-6
[85] testthat_3.1.2 reprex_2.0.1 bslib_0.3.1 stringi_1.7.6
[89] highr_0.9 ps_1.6.0 Brobdingnag_1.2-7 desc_1.4.0
[93] lattice_0.20-45 nloptr_1.2.2.3 markdown_1.1 shinyjs_2.1.0
[97] tensorA_0.36.2 vctrs_0.3.8 pillar_1.7.0 lifecycle_1.0.1
[101] jquerylib_0.1.4 httpuv_1.6.5 R6_2.5.1 promises_1.2.0.1
[105] gridExtra_2.3 sessioninfo_1.2.2 codetools_0.2-18 boot_1.3-28
[109] colourpicker_1.1.1 gtools_3.9.2 assertthat_0.2.1 pkgload_1.2.4
[113] rprojroot_2.0.2 withr_2.4.3 shinystan_2.5.0 mgcv_1.8-38
[117] parallel_4.1.2 hms_1.1.1 grid_4.1.2 coda_0.19-4
[121] minqa_1.2.4 shiny_1.7.1 lubridate_1.8.0 base64enc_0.1-3
[125] dygraphs_1.1.1.6