Predicting the yields of mesquite bushes. See Chapter 12 in Regression and Other Stories.

``````library("rprojroot")
root<-has_dirname("ROS-Examples")\$make_fix_file()
library("foreign")
library("rstanarm")
library("loo")
library("ggplot2")
library("bayesplot")
theme_set(bayesplot::theme_default(base_family = "sans"))``````

Set random seed for reproducability

``SEED <- 4587``

``````mesquite <- read.table(root("Mesquite/data","mesquite.dat"), header=TRUE)
``````  obs group diam1 diam2 total_height canopy_height density weight
1   1   MCD   1.8  1.15         1.30          1.00       1  401.3
2   2   MCD   1.7  1.35         1.35          1.33       1  513.7
3   3   MCD   2.8  2.55         2.16          0.60       1 1179.2
4   4   MCD   1.3  0.85         1.80          1.20       1  308.0
5   5   MCD   3.3  1.90         1.55          1.05       1  855.2
6   6   MCD   1.4  1.40         1.20          1.00       1  268.7``````

## Predict weight given all the predictors

``````fit_1 <- stan_glm(weight ~ diam1 + diam2 + canopy_height + total_height +
density + group, data=mesquite, seed=SEED, refresh=0)
print(fit_1)``````
``````stan_glm
family:       gaussian [identity]
formula:      weight ~ diam1 + diam2 + canopy_height + total_height + density +
group
observations: 46
predictors:   7
------
(Intercept)   -1094.4   179.6
diam1           192.9   118.4
diam2           370.4   127.1
canopy_height   361.8   210.4
total_height   -100.1   183.8
density         130.5    34.2
groupMCD        360.9   103.1

Auxiliary parameter(s):
sigma 272.6   31.0

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg``````
``(loo_1 <- loo(fit_1))``
``````
Computed from 4000 by 46 log-likelihood matrix

Estimate   SE
elpd_loo   -335.0 13.4
p_loo        16.7  9.3
looic       669.9 26.7
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     43    93.5%   818
(0.5, 0.7]   (ok)        0     0.0%   <NA>
(0.7, 1]   (bad)       2     4.3%   37
(1, Inf)   (very bad)  1     2.2%   2
See help('pareto-k-diagnostic') for details.``````

We get warnings about high Pareto k values, which indicates that the importance sampling approximation used in loo is in this case unreliable. We thus use more robust K-fold-CV.

``````kfold_1 <- kfold(fit_1, K=10)
kfold_1``````
``````
Based on 10-fold cross-validation

Estimate   SE
elpd_kfold   -349.1 29.6
p_kfold          NA   NA
kfoldic       698.2 59.2``````

## Predict log(weight) given log transformed predictors

``````fit_2 <- stan_glm(log(weight) ~ log(diam1) + log(diam2) + log(canopy_height) +
log(total_height) + log(density) + group,
data=mesquite, seed=SEED, refresh=0)
(loo_2 <- loo(fit_2))``````
``````
Computed from 4000 by 46 log-likelihood matrix

Estimate   SE
elpd_loo    -19.4  5.4
p_loo         7.7  1.6
looic        38.7 10.8
------
Monte Carlo SE of elpd_loo is 0.1.

Pareto k diagnostic values:
Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     43    93.5%   871
(0.5, 0.7]   (ok)        3     6.5%   370
(0.7, 1]   (bad)       0     0.0%   <NA>
(1, Inf)   (very bad)  0     0.0%   <NA>

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.``````

#### Jacobian correction to make the models comparable

Jacobian correction is needed as model 1 is modeling y and model 2 is modeling log(y).

``````loo_2_with_jacobian <- loo_2
loo_2_with_jacobian\$pointwise[,1] <- loo_2_with_jacobian\$pointwise[,1]-
log(mesquite\$weight)
(elpd_loo_2_with_jacobian <- sum(loo_2_with_jacobian\$pointwise[,1]))``````
`` -291.6678``

there will be a warning that the target data is not the same same, this is ok because we have the jacobian correction

``loo_compare(kfold_1, loo_2_with_jacobian)``
``````      elpd_diff se_diff
fit_2   0.0       0.0
fit_1 -57.4      27.3  ``````

#### Posterior predictive checking for model in original scale

``````yrep_1 <- posterior_predict(fit_1)
n_sims <- nrow(yrep_1)
sims_display <- sample(n_sims, 100)
ppc_1 <- ppc_dens_overlay(mesquite\$weight, yrep_1[sims_display,]) +
theme(axis.line.y = element_blank())``````

#### Posterior predictive checking for model in log scale

``````yrep_2 <- posterior_predict(fit_2)
ppc_2 <- ppc_dens_overlay(log(mesquite\$weight), yrep_2[sims_display,]) +
theme(axis.line.y = element_blank())
bpg <- bayesplot_grid(
ppc_1, ppc_2,
grid_args = list(ncol = 2),
titles = c("Model for weight", "Model for log(weight)")
)``````

#### Display posterior predictive checking plots

``bpg``