Predicting the yields of mesquite bushes. See Chapter 12 in Regression and Other Stories.


Load packages

library("rprojroot")
root<-has_dirname("ROS-Examples")$make_fix_file()
library("foreign")
library("rstanarm")
library("loo")
library("ggplot2")
library("bayesplot")
theme_set(bayesplot::theme_default(base_family = "sans"))

Set random seed for reproducability

SEED <- 4587

Load data

mesquite <- read.table(root("Mesquite/data","mesquite.dat"), header=TRUE)
head(mesquite)
  obs group diam1 diam2 total_height canopy_height density weight
1   1   MCD   1.8  1.15         1.30          1.00       1  401.3
2   2   MCD   1.7  1.35         1.35          1.33       1  513.7
3   3   MCD   2.8  2.55         2.16          0.60       1 1179.2
4   4   MCD   1.3  0.85         1.80          1.20       1  308.0
5   5   MCD   3.3  1.90         1.55          1.05       1  855.2
6   6   MCD   1.4  1.40         1.20          1.00       1  268.7

Predict weight given all the predictors

fit_1 <- stan_glm(weight ~ diam1 + diam2 + canopy_height + total_height +
                    density + group, data=mesquite, seed=SEED, refresh=0)
print(fit_1)
stan_glm
 family:       gaussian [identity]
 formula:      weight ~ diam1 + diam2 + canopy_height + total_height + density + 
       group
 observations: 46
 predictors:   7
------
              Median  MAD_SD 
(Intercept)   -1094.4   179.6
diam1           192.9   118.4
diam2           370.4   127.1
canopy_height   361.8   210.4
total_height   -100.1   183.8
density         130.5    34.2
groupMCD        360.9   103.1

Auxiliary parameter(s):
      Median MAD_SD
sigma 272.6   31.0 

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
(loo_1 <- loo(fit_1))

Computed from 4000 by 46 log-likelihood matrix

         Estimate   SE
elpd_loo   -335.0 13.4
p_loo        16.7  9.3
looic       669.9 26.7
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     43    93.5%   818       
 (0.5, 0.7]   (ok)        0     0.0%   <NA>      
   (0.7, 1]   (bad)       2     4.3%   37        
   (1, Inf)   (very bad)  1     2.2%   2         
See help('pareto-k-diagnostic') for details.

We get warnings about high Pareto k values, which indicates that the importance sampling approximation used in loo is in this case unreliable. We thus use more robust K-fold-CV.

kfold_1 <- kfold(fit_1, K=10)
kfold_1

Based on 10-fold cross-validation

           Estimate   SE
elpd_kfold   -349.1 29.6
p_kfold          NA   NA
kfoldic       698.2 59.2

Predict log(weight) given log transformed predictors

fit_2 <- stan_glm(log(weight) ~ log(diam1) + log(diam2) + log(canopy_height) +
                      log(total_height) + log(density) + group,
                  data=mesquite, seed=SEED, refresh=0)
(loo_2 <- loo(fit_2))

Computed from 4000 by 46 log-likelihood matrix

         Estimate   SE
elpd_loo    -19.4  5.4
p_loo         7.7  1.6
looic        38.7 10.8
------
Monte Carlo SE of elpd_loo is 0.1.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     43    93.5%   871       
 (0.5, 0.7]   (ok)        3     6.5%   370       
   (0.7, 1]   (bad)       0     0.0%   <NA>      
   (1, Inf)   (very bad)  0     0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.

Jacobian correction to make the models comparable

Jacobian correction is needed as model 1 is modeling y and model 2 is modeling log(y).

loo_2_with_jacobian <- loo_2
loo_2_with_jacobian$pointwise[,1] <- loo_2_with_jacobian$pointwise[,1]-
                                     log(mesquite$weight)
(elpd_loo_2_with_jacobian <- sum(loo_2_with_jacobian$pointwise[,1]))
[1] -291.6678

there will be a warning that the target data is not the same same, this is ok because we have the jacobian correction

loo_compare(kfold_1, loo_2_with_jacobian)
      elpd_diff se_diff
fit_2   0.0       0.0  
fit_1 -57.4      27.3  

Posterior predictive checking for model in original scale

yrep_1 <- posterior_predict(fit_1)
n_sims <- nrow(yrep_1)
sims_display <- sample(n_sims, 100)
ppc_1 <- ppc_dens_overlay(mesquite$weight, yrep_1[sims_display,]) +
    theme(axis.line.y = element_blank())

Posterior predictive checking for model in log scale

yrep_2 <- posterior_predict(fit_2)
ppc_2 <- ppc_dens_overlay(log(mesquite$weight), yrep_2[sims_display,]) +
  theme(axis.line.y = element_blank())
bpg <- bayesplot_grid(
  ppc_1, ppc_2,
  grid_args = list(ncol = 2),
  titles = c("Model for weight", "Model for log(weight)")
)

Display posterior predictive checking plots

bpg