Predicting the yields of mesquite bushes. See Chapter 12 in Regression and Other Stories.


Load packages

library("rprojroot")
root<-has_dirname("ROS-Examples")$make_fix_file()
library("foreign")
library("rstanarm")
library("loo")
library("ggplot2")
library("bayesplot")
theme_set(bayesplot::theme_default(base_family = "sans"))

Set random seed for reproducability

SEED <- 4587

Load data

mesquite <- read.table(root("Mesquite/data","mesquite.dat"), header=TRUE)
summary(mesquite)
      obs        group        diam1           diam2        total_height  
 Min.   : 1.00   ALS:20   Min.   :0.800   Min.   :0.400   Min.   :0.650  
 1st Qu.:12.25   MCD:26   1st Qu.:1.400   1st Qu.:1.000   1st Qu.:1.200  
 Median :23.50            Median :1.950   Median :1.525   Median :1.500  
 Mean   :23.50            Mean   :2.099   Mean   :1.572   Mean   :1.482  
 3rd Qu.:34.75            3rd Qu.:2.475   3rd Qu.:1.900   3rd Qu.:1.700  
 Max.   :46.00            Max.   :5.200   Max.   :4.000   Max.   :3.000  
 canopy_height       density          weight      
 Min.   :0.5000   Min.   :1.000   Min.   :  60.2  
 1st Qu.:0.8625   1st Qu.:1.000   1st Qu.: 219.6  
 Median :1.1000   Median :1.000   Median : 361.9  
 Mean   :1.1107   Mean   :1.674   Mean   : 559.7  
 3rd Qu.:1.3000   3rd Qu.:2.000   3rd Qu.: 688.7  
 Max.   :2.5000   Max.   :9.000   Max.   :4052.0  

Regress weight on all of the predictors

fit_1 <- stan_glm(weight ~ diam1 + diam2 + canopy_height + total_height +
                      density + group, data=mesquite, seed=SEED, refresh=0)
print(fit_1)
stan_glm
 family:       gaussian [identity]
 formula:      weight ~ diam1 + diam2 + canopy_height + total_height + density + 
       group
 observations: 46
 predictors:   7
------
              Median  MAD_SD 
(Intercept)   -1089.3   183.1
diam1           193.5   115.3
diam2           365.8   124.7
canopy_height   347.4   214.3
total_height    -93.5   181.6
density         131.9    34.7
groupMCD        359.8   102.5

Auxiliary parameter(s):
      Median MAD_SD
sigma 274.4   29.9 

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
(loo_1 <- loo(fit_1))

Computed from 4000 by 46 log-likelihood matrix

         Estimate   SE
elpd_loo   -334.1 12.3
p_loo        15.8  8.2
looic       668.2 24.6
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     43    93.5%   798       
 (0.5, 0.7]   (ok)        0     0.0%   <NA>      
   (0.7, 1]   (bad)       1     2.2%   95        
   (1, Inf)   (very bad)  2     4.3%   7         
See help('pareto-k-diagnostic') for details.

We get warnings about high Pareto k values, which indicates that the importance sampling approximation used in loo is in this case unreliable. We thus use more robust K-fold-CV.

kfold_1 <- kfold(fit_1, K=10)
kfold_1

Based on 10-fold cross-validation

           Estimate   SE
elpd_kfold   -363.1 40.2
p_kfold          NA   NA
kfoldic       726.2 80.4

Regress log(weight) on all of the log transformed predictors

fit_2 <- stan_glm(log(weight) ~ log(diam1) + log(diam2) + log(canopy_height) +
                      log(total_height) + log(density) + group,
                  data=mesquite, seed=SEED, refresh=0)
(loo_2 <- loo(fit_2))

Computed from 4000 by 46 log-likelihood matrix

         Estimate   SE
elpd_loo    -19.3  5.4
p_loo         7.5  1.5
looic        38.5 10.7
------
Monte Carlo SE of elpd_loo is 0.1.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     44    95.7%   1056      
 (0.5, 0.7]   (ok)        2     4.3%   525       
   (0.7, 1]   (bad)       0     0.0%   <NA>      
   (1, Inf)   (very bad)  0     0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.

Jacobian correction to make the models comparable
Jacobian correction is needed as model 1 is modeling y and model 2 is modeling log(y).

loo_2_with_jacobian <- loo_2
loo_2_with_jacobian$pointwise[,1] <- loo_2_with_jacobian$pointwise[,1]-
                                     log(mesquite$weight)
(elpd_loo_2_with_jacobian <- sum(loo_2_with_jacobian$pointwise[,1]))
[1] -291.5719

there will be a warning that the target data is not the same same, this is ok because we have the jacobian correction

loo_compare(kfold_1, loo_2_with_jacobian)
      elpd_diff se_diff
fit_2   0.0       0.0  
fit_1 -71.5      37.8  

Posterior predictive checking for model in original scale

yrep_1 <- posterior_predict(fit_1)
n_sims <- nrow(yrep_1)
sims_display <- sample(n_sims, 100)
ppc_1 <- ppc_dens_overlay(mesquite$weight, yrep_1[sims_display,])

Posterior predictive checking for model in log scale

yrep_2 <- posterior_predict(fit_2)
ppc_2 <- ppc_dens_overlay(log(mesquite$weight), yrep_2[sims_display,])
bpg <- bayesplot_grid(
  ppc_1, ppc_2,
  grid_args = list(ncol = 2),
  titles = c("Model for weight", "Model for log(weight)")
)

Display posterior predictive checking plots

bpg

Plot marginal posteriors

mcmc_areas(as.matrix(fit_2), regex_pars = "^log|^gro")