Setup

Load packages

library("rprojroot")
root<-has_dirname("modelselection_tutorial")$make_fix_file()
library("loo")
library("rstanarm")
options(mc.cores = parallel::detectCores())
library("ggplot2")
library("bayesplot")
theme_set(bayesplot::theme_default(base_family = "sans"))

1 Introduction

We demonstrate different cross-validation variants for hierarchical models. Related video discusses the examples used here starting at 22:53. For time series specific cross-validation, see Bürkner, Gabry and Vehtari (2019).

2 Data

Here we use a simple grouped data. The example data is taken from section 6 of Gelfand et al. (1990), and concerns 30 young rats whose weights were measured weekly for five week.

Load data

sourceToList = function(file){
  source(file, local = TRUE)
  d = mget(ls())
  d$file = NULL
  d
}
#
rats = sourceToList(root("rats.data.R"))
rats = with(rats, list(
  N = N,
  Npts = length(y),
  rat = rep(1:nrow(y), ncol(y)),
  x = rep(x, each = nrow(y)),
  y = as.numeric(y),
  xbar = xbar
  ))

Create dataframe

dfrats <- with(rats, data.frame(age=x, age_c=x-22, weight=y, rat=rat))
N <- dim(dfrats)[1]

Plot data

pr <- ggplot(data=dfrats, aes(x=age, y=weight)) +
    geom_line(aes(group=rat), color="black", size=0.1) +
    geom_point(color="black", size=2) +
    labs(x="Age (days)", y="Weight (g)", title="Rats data")
pr

Just looking at the data it seems that if the rat growth would be modelled with linear model (up to age 36 days), individual intercept are likely and possibly also individual slopes.

3 Models

We compare three models. One with population effect only, another with additional hierarchical intercept term, and third one with hierarchical intercept and slope terms.

Simple linear model

fit_1 <- stan_glm(weight ~ age, data=dfrats, refresh=0)

Linear model with hierarchical intercept

fit_2 <- stan_glmer(weight ~ age + (1 | rat), data=dfrats, refresh=0)

Linear model with hierarchical intercept and slope

fit_3 <- stan_glmer(weight ~ age + (age | rat), data=dfrats, refresh=0)

4 Leave-one-out cross-validation

In leave-one-out cross-validation, one observation is left out at time and predicted given all the other observations

pr1 <- pr +
    geom_point(data=dfrats[69,], color="red", size=5, shape=1) +
    ggtitle('Leave-one-out')
pr1