Workflow for iterative building of a time series model.
We analyse the relative number of births per day in USA 1969-1988 using Gaussian process time series model with several model components that can explain the long term, seasonal, weekly, day of year, and special floatind day variation.
Models for relative number of birthdays
As the relative number of births is positive it’s natural to model the logarithm value. The generic form of the models is \[
y \sim \mbox{normal}(f(x), \sigma),
\] where \(f\) is different and gradually more complex function conditional on \(x\) that includes running day number, day of year, day of week and eventually some special floating US bank holidays.
Model 1: Slow trend
The model 1 is just the slow trend over the years using Hilbert space basis function approximated Gaussian process \[
f = \mbox{intercept} + f_1\\
\mbox{intercept} \sim \mbox{normal}(0,1)\\
f_1 \sim \mbox{GP}(0,K_1)
\] where GP has exponentiated quadratic covariance function.
In this phase the code from Riutort-Mayol et al.(2020) was cleaned and written to be more efficient, but only the one GP component was included to make the testing easier. Although the code was made more efficient, the aim wasn’t to make it the fastest possible as the later model changes may have bigger effect on the performance (it’s good o avoid premature optimization). We also use quite small number of basis functions to make the code run faster, and only later examine more carefully whether the number of basis function is sufficient compared to the posterior of the length scale (see, Riutort-Mayol et al., 2020).
Compile Stan model gpbf1.stan which includes gpbasisfun_functions1.stan
model1 <- cmdstan_model(stan_file = root("Birthdays", "gpbf1.stan"),
include_paths = root("Birthdays"))
Data to be passed to Stan
standata1 <- list(x=data$id,
y=log(data$births_relative100),
N=length(data$id),
c_f1=1.5, # factor c of basis functions for GP for f1
M_f1=20) # number of basis functions for GP for f1
As the basis function approximation and priors restrict the complexity of GP, we can safely use optimization to get a very quick initial result to check that the model code is computing what we intended. As there are only 14 parameters and 7305 observations it’s likely that the posterior is close to normal (in unconstrained space). In this case the optimization takes less than one second while MCMC sampling with default options would have taken several minutes. Although this result can be useful in a quick workflow, the result should not be used as the final result.
opt1 <- model1$optimize(data = standata1, init=0, algorithm='bfgs')
Check whether parameters have reasonable values
odraws1 <- opt1$draws()
subset(odraws1, variable=c('intercept','sigma_f1','lengthscale_f1','sigma'))
# A draws_matrix: 1 iterations, 1 chains, and 4 variables
variable
draw intercept sigma_f1 lengthscale_f1 sigma
1 -0.048 1.1 0.16 0.81
Compare the model to the data
oEf <- exp(as.numeric(subset(odraws1, variable='f')))
data %>%
mutate(oEf = oEf) %>%
ggplot(aes(x=date, y=births_relative100)) +
geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=oEf), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
After we get the model working using optimization we can compare the result to using short MCMC chains which will also provide us additional information on speed of different code implementations for the same model. We intentionally use just 1/10th length from the usual recommendation, as during the iterative model building a rough results are sufficient. When testing the code we initially used just one chain, but at this point running four chains with four core CPU doesn’t add much to the wall clock time, but gives more information of how easy it is sample from the posterior and can reveal if there are multiple modes. Although the result from short chains can be useful in a quick workflow, the result should not be used as th final result.
fit1 <- model1$sample(data=standata1, iter_warmup=100, iter_sampling=100,
chains=4, parallel_chains=4, seed=3891)
Depending on the random seed and luck, we sometimes observed that some of the chains got stuck in different modes. We could see this in high Rhat and low ESS diagnostic values.
draws1 <- fit1$draws()
summarise_draws(subset(draws1, variable=c('intercept','sigma_f1','lengthscale_f1','sigma')))
# A tibble: 4 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept 0.083 0.096 0.15 0.083 -0.20 0.28 1.7 64. 108.
2 sigma_f1 0.39 0.29 0.22 0.18 0.17 0.75 2.2 5.6 24.
3 lengthscale_f1 1.7 1.7 1.5 2.2 0.19 3.2 1.8 6.3 30.
4 sigma 0.91 0.90 0.093 0.14 0.80 1.0 1.9 6.2 55.
Examining the trace plots shows the multimodality clearly.
mcmc_trace(draws1, regex_pars=c('intercept','sigma_f1','lengthscale_f1','sigma'))
In this case it was easy to figure out that some of the chains got stuck in qualitatively much worse modes. We don’t in general recommend to start from the mode as the mode is not usually representative point in hierarchical model posterior or in high dimensional posterior, but we can use this again to speed up the iterative model building as long as we check that the optimization result is sensible and later do more careful inference. Although the result from short chains can be useful in a quick workflow, the result should not be used as the final result.
init1 <- sapply(c('intercept','sigma_f1','lengthscale_f1','beta_f1','sigma'),
function(variable) {as.numeric(subset(odraws1, variable=variable))})
fit1 <- model1$sample(data=standata1, iter_warmup=100, iter_sampling=100,
chains=4, parallel_chains=4,
init=function() { init1 })
We now observe better Rhat and ESS diagnostic values, although due to very short chains they are not yet perfect. We are likely to also observe Hamiltonian Monte Carlo divergences and treedepth exceedences in dynamic building of the Hamiltonian trajectory, but there is no need to worry about those as long as the model results are qualitatively sensible as these computational issues can also go away when the model itself is improved. In all the following short MCMC samplings we get some or many divergences and usually very large number of treedepth exceedences. Divergences indicate possible bias and should be eventually investigated carefully. Treedepth exceedences indicate strong posterior dependencies and slow mixing and sometimes the posterior can be much improved by changing the parameterization or priors, but as the treedepth exceedences don’t indicate bias there is no need for more careful analysis if the resulting ESS and MCSE values are good for the purpose in hand. We’ll come back later to more careful analysis of the final models.
draws1 <- fit1$draws()
summarise_draws(subset(draws1, variable=c('intercept','sigma_f1','lengthscale_f1','sigma')))
# A tibble: 4 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept 0.050 0.060 0.24 0.23 -0.37 0.38 1.0 296. 277.
2 sigma_f1 0.58 0.56 0.12 0.11 0.43 0.81 1.0 218. 336.
3 lengthscale_f1 0.23 0.23 0.039 0.037 0.17 0.29 1.0 224. 231.
4 sigma 0.81 0.81 0.0064 0.0068 0.80 0.82 1.0 400. 232.
Trace plot shows slow mixing but no multimodality.
mcmc_trace(draws1, regex_pars=c('intercept','sigma_f1','lengthscale_f1','sigma'))
The model result from short MCMC chains looks very similar to the optimization result.
draws1 <- as_draws_matrix(draws1)
Ef <- exp(apply(subset(draws1, variable='f'), 2, median))
data %>%
mutate(Ef = Ef) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
If we compare the result from short sampling to optimizing, we don’t see practical difference in the predictions (although we see later more differences between optimization and MCMC).
data %>%
mutate(Ef = Ef,
oEf = oEf) %>%
ggplot(aes(x=Ef, y=oEf)) + geom_point(color=set1[2]) +
geom_abline() +
labs(x="Ef from short Markov chain", y="Ef from optimizing")
After the first version of this notebook, Nikolas Siccha examined more carefully the posterior correlations and noticed strong correlation between intercept and the first basis function. Stan’s dynamic HMC is so efficient that the inference is succesful anyway. Nikolas suggested removing the intercept term. The intercept term is not necessarily needed as the data has been centered. We test a model without the explicit intercept term.
Compile Stan model gpbf1b.stan
model1b <- cmdstan_model(stan_file = root("Birthdays", "gpbf1b.stan"),
include_paths = root("Birthdays"))
We sample using the default initialization.
fit1b <- model1b$sample(data=standata1, iter_warmup=100, iter_sampling=100,
chains=4, parallel_chains=4, seed=3891)
The sampling performs better, indicating that the strong posterior correlation in the first model was causing troubles for the adaptation in the short warmup leading some chains to stay stuck.
draws1b <- fit1b$draws()
summarise_draws(subset(draws1b, variable=c('sigma_f1','lengthscale_f1','sigma')))
# A tibble: 3 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 sigma_f1 0.60 0.58 0.14 0.12 0.41 0.87 1.0 221. 215.
2 lengthscale_f1 0.23 0.23 0.042 0.036 0.15 0.29 1.0 198. 252.
3 sigma 0.81 0.81 0.0068 0.0067 0.80 0.82 1.0 401. 313.
Examining the trace plots don’t show multimodality
mcmc_trace(draws1b, regex_pars=c('sigma_f1','lengthscale_f1','sigma'))
We drop global intercept from the rest of the models, but continue using (early stopped) optimization to initialize the sampling.
Model 2: Slow trend + yearly seasonal trend
The model 2 adds yearly seasonal trend using GP with periodic covariance function. \[
f = \mbox{intercept} + f_1 + f_2 \\
\mbox{intercept} \sim \mbox{normal}(0,1)\\
f_1 \sim \mbox{GP}(0,K_1)\\
f_2 \sim \mbox{GP}(0,K_2)
\] where the first GP uses the exponentiated quadratic covariance function, and the second one a periodic covariance function. Most years have 365 calendar days and every four years (during the data range) there are 366 days, and thus we simplify and use period of 365.25 for the periodic component,
The first version of model 2 with the added periodic component following from Riutort-Mayol (2020) turned out be very slow. With the default MCMC options the inference would have taken hours, but with the short chains it was possible to infer that something has to be wrong. The model output was sensible, but diagnostics indicated very slow mixing. By more careful examination of the model it turned out that the periodic component was including another intercept term and with two intercept terms their sum was well informed by the data, but individually they were not well informed and thus the posteriors were wide, which lead to very slow mixing. This bad model is not shown here, but the optimization, short MCMC chains and sampling diagnostic tools were crucial for fast experimentation and solving the problem.
Compile Stan model 2 (the fixed version) gpbf2.stan
model2 <- cmdstan_model(stan_file = root("Birthdays", "gpbf2.stan"),
include_paths = root("Birthdays"))
Data to be passed to Stan
standata2 <- list(x=data$id,
y=log(data$births_relative100),
N=length(data$id),
c_f1=1.5, # factor c of basis functions for GP for f1
M_f1=20, # number of basis functions for GP for f1
J_f2=20) # number of basis functions for periodic f2
Optimizing is faster than sampling (although this result can be useful in a quick workflow, the result should not be used as the final result).
opt2 <- model2$optimize(data=standata2, init=0, algorithm='bfgs')
Check whether parameters have reasonable values
odraws2 <- opt2$draws()
subset(odraws2, variable=c('sigma_','lengthscale_','sigma'), regex=TRUE)
# A draws_matrix: 1 iterations, 1 chains, and 5 variables
variable
draw sigma_f1 sigma_f2 lengthscale_f1 lengthscale_f2 sigma
1 1.4 1.3 0.16 0.087 0.75
Compare the model to the data
Ef <- exp(as.numeric(subset(odraws2, variable='f')))
Ef1 <- as.numeric(subset(odraws2, variable='f1'))
Ef1 <- exp(Ef1 - mean(Ef1) + mean(log(data$births_relative100)))
Ef2 <- as.numeric(subset(odraws2, variable='f2'))
Ef2 <- exp(Ef2 - mean(Ef2) + mean(log(data$births_relative100)))
pf <- data %>%
mutate(Ef = Ef) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
pf1 <- data %>%
mutate(Ef1 = Ef1) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef1), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
pf2 <- data %>%
mutate(Ef2 = Ef2) %>%
group_by(day_of_year2) %>%
summarise(meanbirths=mean(births_relative100), meanEf2=mean(Ef2)) %>%
ggplot(aes(x=as.Date("1987-12-31")+day_of_year2, y=meanbirths)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_line(aes(y=meanEf2), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year")
pf / (pf1 + pf2)
Sample short chains using the optimization result as initial values (although the result from short chains can be useful in a quick workflow, the result should not be used as the final result).
init2 <- sapply(c('lengthscale_f1','lengthscale_f2','sigma_f1','sigma_f2','sigma','beta_f1','beta_f2'),
function(variable) {as.numeric(subset(odraws2, variable=variable))})
fit2 <- model2$sample(data=standata2, iter_warmup=100, iter_sampling=100,
chains=4, parallel_chains=4,
init=function() { init2 })
Check whether parameters have reasonable values
draws2 <- fit2$draws()
summarise_draws(subset(draws2, variable=c('sigma_','lengthscale_','sigma'), regex=TRUE))
# A tibble: 5 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 sigma_f1 0.57 0.56 0.12 0.12 0.39 0.82 1.0 231. 382.
2 sigma_f2 0.29 0.29 0.053 0.044 0.23 0.38 1.0 276. 408.
3 lengthscale_f1 0.21 0.21 0.039 0.036 0.14 0.26 1.0 333. 361.
4 lengthscale_f2 0.24 0.24 0.026 0.028 0.20 0.29 1.0 274. 300.
5 sigma 0.75 0.75 0.0059 0.0055 0.74 0.76 1.0 398. 260.
Compare the model to the data
draws2 <- as_draws_matrix(draws2)
Ef <- exp(apply(subset(draws2, variable='f'), 2, median))
Ef1 <- apply(subset(draws2, variable='f1'), 2, median)
Ef1 <- exp(Ef1 - mean(Ef1) + mean(log(data$births_relative100)))
Ef2 <- apply(subset(draws2, variable='f2'), 2, median)
Ef2 <- exp(Ef2 - mean(Ef2) + mean(log(data$births_relative100)))
pf <- data %>%
mutate(Ef = Ef) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
pf1 <- data %>%
mutate(Ef1 = Ef1) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef1), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
pf2 <- data %>%
mutate(Ef2 = Ef2) %>%
group_by(day_of_year2) %>%
summarise(meanbirths=mean(births_relative100), meanEf2=mean(Ef2)) %>%
ggplot(aes(x=as.Date("1987-12-31")+day_of_year2, y=meanbirths)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_line(aes(y=meanEf2), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year")
pf / (pf1 + pf2)
Seasonal component has reasonable fit to the data.
Model 3: Slow trend + yearly seasonal trend + day of week
Based on the quick plotting of the data above, day of week has a clear effect and there are less babies born on Saturday and Sunday. This can be taken into account with simple additive coefficients. We fix the effect of Monday to 0 and have additional coefficients for other weekdays. \[
f = \mbox{intercept} + f_1 + f_2 + \beta_{\mbox{day of week}} \\
\mbox{intercept} \sim \mbox{normal}(0,1)\\
f_1 \sim \mbox{GP}(0,K_1)\\
f_2 \sim \mbox{GP}(0,K_2)\\
\beta_{\mbox{day of week}} = 0 \quad \mbox{if day of week is Monday}\\
\beta_{\mbox{day of week}} \sim \mbox{normal}(0,1) \quad \mbox{if day of week is not Monday}
\]
Compile Stan model 3 gpbf3.stan
model3 <- cmdstan_model(stan_file = root("Birthdays", "gpbf3.stan"),
include_paths = root("Birthdays"))
Data to be passed to Stan
standata3 <- list(x=data$id,
y=log(data$births_relative100),
N=length(data$id),
c_f1=1.5, # factor c of basis functions for GP for f1
M_f1=20, # number of basis functions for GP for f1
J_f2=20, # number of basis functions for periodic f2
day_of_week=data$day_of_week)
Optimizing is faster than sampling (although this result can be useful in a quick workflow, the result should not be used as the final result).
opt3 <- model3$optimize(data=standata3, init=0, algorithm='bfgs')
Check whether parameters have reasonable values
odraws3 <- opt3$draws()
subset(odraws3, variable=c('sigma_','lengthscale_','sigma'), regex=TRUE)
# A draws_matrix: 1 iterations, 1 chains, and 5 variables
variable
draw sigma_f1 sigma_f2 lengthscale_f1 lengthscale_f2 sigma
1 1.6 1.3 0.16 0.087 0.33
subset(odraws3, variable=c('beta_f3'))
# A draws_matrix: 1 iterations, 1 chains, and 6 variables
variable
draw beta_f3[1] beta_f3[2] beta_f3[3] beta_f3[4] beta_f3[5] beta_f3[6]
1 0.36 0.12 0.04 0.17 -1.1 -1.5
Compare the model to the data
Ef <- exp(as.numeric(subset(odraws3, variable='f')))
Ef1 <- as.numeric(subset(odraws3, variable='f1'))
Ef1 <- exp(Ef1 - mean(Ef1) + mean(log(data$births_relative100)))
Ef2 <- as.numeric(subset(odraws3, variable='f2'))
Ef2 <- exp(Ef2 - mean(Ef2) + mean(log(data$births_relative100)))
Ef_day_of_week <- as.numeric(subset(odraws3, variable='f_day_of_week'))
Ef_day_of_week <- exp(Ef_day_of_week - mean(Ef_day_of_week) + mean(log(data$births_relative100)))
pf <- data %>%
mutate(Ef = Ef) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef), color=set1[1], alpha=0.75) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
pf1 <- data %>%
mutate(Ef1 = Ef1) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef1), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
pf2 <- data %>%
mutate(Ef2 = Ef2) %>%
group_by(day_of_year2) %>%
summarise(meanbirths=mean(births_relative100), meanEf2=mean(Ef2)) %>%
ggplot(aes(x=as.Date("1987-12-31")+day_of_year2, y=meanbirths)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_line(aes(y=meanEf2), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year")
pf3 <- ggplot(data=data, aes(x=day_of_week, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_continuous(breaks = 1:7, labels=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')) +
geom_line(data=data.frame(x=1:7,y=Ef_day_of_week), aes(x=x, y=Ef_day_of_week), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of week")
(pf + pf1) / (pf2 + pf3)
Sample short chains using the optimization result as initial values (although the result from short chains can be useful in a quick workflow, the result should not be used as the final result).
init3 <- sapply(c('lengthscale_f1','lengthscale_f2','sigma_f1','sigma_f2','sigma',
'beta_f1','beta_f2','beta_f3'),
function(variable) {as.numeric(subset(odraws3, variable=variable))})
fit3 <- model3$sample(data=standata3, iter_warmup=100, iter_sampling=100,
chains=4, parallel_chains=4,
init=function() { init3 })
Check whether parameters have reasonable values
draws3 <- fit3$draws()
summarise_draws(subset(draws3, variable=c('sigma_','lengthscale_','sigma'), regex=TRUE))
# A tibble: 5 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 sigma_f1 0.62 0.59 0.14 0.13 0.44 0.88 1.0 306. 333.
2 sigma_f2 0.28 0.28 0.047 0.040 0.22 0.36 1.0 137. 341.
3 lengthscale_f1 0.21 0.21 0.038 0.034 0.14 0.26 1.0 340. 303.
4 lengthscale_f2 0.21 0.21 0.019 0.019 0.18 0.24 1.0 148. 226.
5 sigma 0.33 0.33 0.0027 0.0026 0.33 0.33 1.0 475. 320.
summarise_draws(subset(draws3, variable=c('beta_f3')))
# A tibble: 6 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 beta_f3[1] 0.36 0.36 0.014 0.014 0.33 0.38 1.0 381. 246.
2 beta_f3[2] 0.13 0.13 0.015 0.016 0.10 0.15 1.0 434. 339.
3 beta_f3[3] 0.041 0.042 0.014 0.014 0.018 0.065 1.0 380. 285.
4 beta_f3[4] 0.17 0.17 0.015 0.015 0.15 0.20 1.0 369. 334.
5 beta_f3[5] -1.1 -1.1 0.015 0.014 -1.1 -1.1 1.0 387. 340.
6 beta_f3[6] -1.5 -1.5 0.014 0.015 -1.5 -1.5 1.0 433. 339.
Compare the model to the data
draws3 <- as_draws_matrix(draws3)
Ef <- exp(apply(subset(draws3, variable='f'), 2, median))
Ef1 <- apply(subset(draws3, variable='f1'), 2, median)
Ef1 <- exp(Ef1 - mean(Ef1) + mean(log(data$births_relative100)))
Ef2 <- apply(subset(draws3, variable='f2'), 2, median)
Ef2 <- exp(Ef2 - mean(Ef2) + mean(log(data$births_relative100)))
Ef_day_of_week <- apply(subset(draws3, variable='f_day_of_week'), 2, median)
Ef_day_of_week <- exp(Ef_day_of_week - mean(Ef_day_of_week) + mean(log(data$births_relative100)))
pf <- data %>%
mutate(Ef = Ef) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef), color=set1[1], alpha=0.75) +
labs(x="Date", y="Relative number of births")
pf1 <- data %>%
mutate(Ef1 = Ef1) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef1), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
pf2 <- data %>%
mutate(Ef2 = Ef2) %>%
group_by(day_of_year2) %>%
summarise(meanbirths=mean(births_relative100), meanEf2=mean(Ef2)) %>%
ggplot(aes(x=as.Date("1987-12-31")+day_of_year2, y=meanbirths)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_line(aes(y=meanEf2), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year")
pf3 <- ggplot(data=data, aes(x=day_of_week, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_continuous(breaks = 1:7, labels=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')) +
geom_line(data=data.frame(x=1:7,y=Ef_day_of_week), aes(x=x, y=Ef_day_of_week), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of week")
(pf + pf1) / (pf2 + pf3)
Weekday effects are easy to estimate as there are about thousand observations per weekday.
Model 4: long term smooth + seasonal + weekday with increasing magnitude
Looking at the time series of whole data we see the dots representing the daily values forming three branches that are getting further away from each other. In previous analysis (BDA3) we also had a model component allowing gradually changing effect for day of week and did observe that the effect of Saturday and Sunday did get stronger in time. The next model includes time dependent magnitude component for the day of week effect. \[
f = \mbox{intercept} + f_1 + f_2 + \exp(g_3)\beta_{\mbox{day of week}} \\
\mbox{intercept} \sim \mbox{normal}(0,1)\\
f_1 \sim \mbox{GP}(0,K_1)\\
f_2 \sim \mbox{GP}(0,K_2)\\
g_3 \sim \mbox{GP}(0,K_3)\\
\beta_{\mbox{day of week}} = 0 \quad \mbox{if day of week is Monday}\\
\beta_{\mbox{day of week}} \sim \mbox{normal}(0,1) \quad \mbox{if day of week is not Monday}
\] The magnitude of the weekday effect is modelled with \(\exp(g_3)\), where \(g_3\) has GP prior with zero mean and exponentiated quadratic covariance function.
Compile Stan model 4 gpbf4.stan
model4 <- cmdstan_model(stan_file = root("Birthdays", "gpbf4.stan"),
include_paths = root("Birthdays"))
Data to be passed to Stan
standata4 <- list(x=data$id,
y=log(data$births_relative100),
N=length(data$id),
c_f1=1.5, # factor c of basis functions for GP for f1
M_f1=20, # number of basis functions for GP for f1
J_f2=20, # number of basis functions for periodic f2
c_g3=1.5, # factor c of basis functions for GP for g3
M_g3=5, # number of basis functions for GP for g3
day_of_week=data$day_of_week)
As we have increased the complexity of the model, the mode starts to be less and less representative of the posterior. We still use the optimization to check that code returns something reasonable and as initial values for MCMC, but we now stop the optimization early. By adding tol_obj=10
argument, the optimization stops when the change in the log posterior density is less than 10, which is likely to happened before reaching the mode.
opt4 <- model4$optimize(data=standata4, init=0, algorithm='bfgs', tol_obj=10)
Check whether parameters have reasonable values
odraws4 <- opt4$draws()
subset(odraws4, variable=c('sigma_','lengthscale_','sigma'), regex=TRUE)
# A draws_matrix: 1 iterations, 1 chains, and 7 variables
variable
draw sigma_f1 sigma_f2 sigma_g3 lengthscale_f1 lengthscale_f2 lengthscale_g3 sigma
1 0.7 0.69 0.4 0.15 0.24 0.75 0.31
subset(odraws4, variable=c('beta_f3'))
# A draws_matrix: 1 iterations, 1 chains, and 6 variables
variable
draw beta_f3[1] beta_f3[2] beta_f3[3] beta_f3[4] beta_f3[5] beta_f3[6]
1 0.39 0.14 0.06 0.2 -1.3 -1.7
Compare the model to the data
Ef <- exp(as.numeric(subset(odraws4, variable='f')))
Ef1 <- as.numeric(subset(odraws4, variable='f1'))
Ef1 <- exp(Ef1 - mean(Ef1) + mean(log(data$births_relative100)))
Ef2 <- as.numeric(subset(odraws4, variable='f2'))
Ef2 <- exp(Ef2 - mean(Ef2) + mean(log(data$births_relative100)))
Ef_day_of_week <- as.numeric(subset(odraws4, variable='f_day_of_week'))
Ef_day_of_week <- exp(Ef_day_of_week - mean(Ef_day_of_week) + mean(log(data$births_relative100)))
Ef3 <- as.numeric(subset(odraws4, variable='f3'))
Ef3 <- exp(Ef3 - mean(Ef3) + mean(log(data$births_relative100)))
pf <- data %>%
mutate(Ef = Ef) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef), color=set1[1], alpha=0.75) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
pf1 <- data %>%
mutate(Ef1 = Ef1) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef1), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
pf2 <- data %>%
mutate(Ef2 = Ef2) %>%
group_by(day_of_year2) %>%
summarise(meanbirths=mean(births_relative100), meanEf2=mean(Ef2)) %>%
ggplot(aes(x=as.Date("1987-12-31")+day_of_year2, y=meanbirths)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_line(aes(y=meanEf2), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year")
pf3 <- ggplot(data=data, aes(x=day_of_week, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_continuous(breaks = 1:7, labels=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')) +
geom_line(data=data.frame(x=1:7,y=Ef_day_of_week), aes(x=x, y=Ef_day_of_week), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of week")
pf3b <- data %>%
mutate(Ef3 = Ef3) %>%
ggplot(aes(x=date, y=births_relative100/Ef1/Ef2*100*100)) + geom_point(color=set1[2], alpha=0.2) +
geom_point(aes(y=Ef3), color=set1[1], size=0.1) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
(pf + pf1) / (pf2 + pf3b)
Sample short chains using the early stopped optimization result as initial values (although the result from short chains can be useful in a quick workflow, the result should not be used as the final result).
init4 <- sapply(c('lengthscale_f1','lengthscale_f2','lengthscale_g3',
'sigma_f1','sigma_f2','sigma_g3','sigma',
'beta_f1','beta_f2','beta_f3','beta_g3'),
function(variable) {as.numeric(subset(odraws4, variable=variable))})
fit4 <- model4$sample(data=standata4, iter_warmup=100, iter_sampling=100,
chains=4, parallel_chains=4,
init=function() { init4 }, refresh=10)
Check whether parameters have reasonable values
draws4 <- fit4$draws()
summarise_draws(subset(draws4, variable=c('sigma_','lengthscale_','sigma'), regex=TRUE))
# A tibble: 7 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 sigma_f1 0.64 0.62 0.13 0.11 0.47 0.87 1.0 237. 281.
2 sigma_f2 0.29 0.28 0.043 0.038 0.23 0.37 1.0 256. 338.
3 sigma_g3 0.19 0.18 0.047 0.046 0.13 0.28 1.0 614. 282.
4 lengthscale_f1 0.21 0.21 0.034 0.032 0.14 0.25 1.0 356. 316.
5 lengthscale_f2 0.20 0.20 0.018 0.016 0.18 0.24 1.1 56. 230.
6 lengthscale_g3 0.76 0.76 0.19 0.19 0.44 1.0 1.0 553. 383.
7 sigma 0.31 0.31 0.0026 0.0026 0.31 0.31 1.0 270. 252.
summarise_draws(subset(draws4, variable=c('beta_f3')))
# A tibble: 6 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 beta_f3[1] 0.36 0.36 0.042 0.040 0.30 0.43 1.0 245. 334.
2 beta_f3[2] 0.13 0.13 0.021 0.020 0.10 0.17 1.0 317. 313.
3 beta_f3[3] 0.052 0.052 0.015 0.015 0.029 0.075 1.0 402. 368.
4 beta_f3[4] 0.18 0.18 0.024 0.022 0.14 0.22 1.0 290. 303.
5 beta_f3[5] -1.2 -1.1 0.13 0.12 -1.4 -0.96 1.0 218. 326.
6 beta_f3[6] -1.6 -1.5 0.17 0.16 -1.8 -1.3 1.0 220. 318.
Compare the model to the data
draws4 <- as_draws_matrix(draws4)
Ef <- exp(apply(subset(draws4, variable='f'), 2, median))
Ef1 <- apply(subset(draws4, variable='f1'), 2, median)
Ef1 <- exp(Ef1 - mean(Ef1) + mean(log(data$births_relative100)))
Ef2 <- apply(subset(draws4, variable='f2'), 2, median)
Ef2 <- exp(Ef2 - mean(Ef2) + mean(log(data$births_relative100)))
Ef_day_of_week <- apply(subset(draws4, variable='f_day_of_week'), 2, median)
Ef_day_of_week <- exp(Ef_day_of_week - mean(Ef_day_of_week) + mean(log(data$births_relative100)))
Ef3 <- apply(subset(draws4, variable='f3'), 2, median)
Ef3 <- exp(Ef3 - mean(Ef3) + mean(log(data$births_relative100)))
pf <- data %>%
mutate(Ef = Ef) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef), color=set1[1], alpha=0.75) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
pf1 <- data %>%
mutate(Ef1 = Ef1) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef1), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
pf2 <- data %>%
mutate(Ef2 = Ef2) %>%
group_by(day_of_year2) %>%
summarise(meanbirths=mean(births_relative100), meanEf2=mean(Ef2)) %>%
ggplot(aes(x=as.Date("1987-12-31")+day_of_year2, y=meanbirths)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_line(aes(y=meanEf2), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year")
pf3 <- ggplot(data=data, aes(x=day_of_week, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_continuous(breaks = 1:7, labels=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')) +
geom_line(data=data.frame(x=1:7,y=Ef_day_of_week), aes(x=x, y=Ef_day_of_week), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of week")
pf3b <- data %>%
mutate(Ef3 = Ef3) %>%
ggplot(aes(x=date, y=births_relative100/Ef1/Ef2*100*100)) + geom_point(color=set1[2], alpha=0.2) +
geom_point(aes(y=Ef3), color=set1[1], size=0.1) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
(pf + pf1) / (pf2 + pf3b)
The model fits well the different branches visible in plotted daily relative number of births, that is, it is able to model the increasing weekend effect.
Model 5: long term smooth + seasonal + weekday with time dependent magnitude + day of year RHS
The next component to add is day of year effect. Many bank holidays are every year on the same day of year and there might be also other special days that are favored or disfavored.
\[
f = \mbox{intercept} + f_1 + f_2 + \exp(g_3)\beta_{\mbox{day of week}} + \beta_{\mbox{day of year}}\\
\mbox{intercept} \sim \mbox{normal}(0,1)\\
f_1 \sim \mbox{GP}(0,K_1)\\
f_2 \sim \mbox{GP}(0,K_2)\\
g_3 \sim \mbox{GP}(0,K_3)\\
\beta_{\mbox{day of week}} = 0 \quad \mbox{if day of week is Monday}\\
\beta_{\mbox{day of week}} \sim \mbox{normal}(0,1) \quad \mbox{if day of week is not Monday}\\
\beta_{\mbox{day of year}} \sim RHS(0,0.1)
\] As we assume that only some days of year are special, we use regularized horseshoe (RHS) prior for day of year effects.
At this point the optimization didn’t produce reasonable result as earlier and sampling turned out to be very slow. We assumed the optimization fails because there were so many more parameters with hierarchical prior. As even the short chain sampling would have taken more than hour, it would have been time consuming to further to test the model. As part of the quick iterative model building it was better to give up on this model for a moment.
Compile Stan model 5 gpbf5.stan
model5 <- cmdstan_model(stan_file = root("Birthdays", "gpbf5.stan"),
include_paths = root("Birthdays"))
Data to be passed to Stan
standata5 <- list(x=data$id,
y=log(data$births_relative100),
N=length(data$id),
c_f1=1.5, # factor c of basis functions for GP for f1
M_f1=20, # number of basis functions for GP for f1
J_f2=20, # number of basis functions for periodic f2
c_g3=1.5, # factor c of basis functions for GP for g3
M_g3=5, # number of basis functions for GP for g3
scale_global=0.1, # gloval scale for RHS prior
day_of_week=data$day_of_week,
day_of_year=data$day_of_year2) # 1st March = 61 every year
Optimizing is faster than sampling (although this result can be useful in a quick workflow, the result should not be used as the final result).
opt5 <- model5$optimize(data=standata5, init=0, algorithm='lbfgs',
history=100, tol_obj=10)
Check whether parameters have reasonable values
odraws5 <- opt5$draws()
subset(odraws5, variable=c('sigma_','lengthscale_','sigma'), regex=TRUE)
# A draws_matrix: 1 iterations, 1 chains, and 7 variables
variable
draw sigma_f1 sigma_f2 sigma_g3 lengthscale_f1 lengthscale_f2 lengthscale_g3 sigma
1 9.8e-17 0.00014 1 0.00036 0.061 1.3 1
subset(odraws5, variable=c('beta_f3'))
# A draws_matrix: 1 iterations, 1 chains, and 6 variables
variable
draw beta_f3[1] beta_f3[2] beta_f3[3] beta_f3[4] beta_f3[5] beta_f3[6]
1 -8.9 -7.4 -7 -7.9 -3.5 -0.49
Ef4 <- as.numeric(subset(odraws5, variable='beta_f4'))*sd(log(data$births_relative100))
Ef4 <- exp(Ef4)*100
data.frame(x=as.Date("1988-01-01")+0:365, y=Ef4) %>%
ggplot(aes(x=x,y=y)) + geom_line(color=set1[1]) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year")
Compare the model to the data
Ef <- exp(as.numeric(subset(odraws5, variable='f')))
Ef1 <- as.numeric(subset(odraws5, variable='f1'))
Ef1 <- exp(Ef1 - mean(Ef1) + mean(log(data$births_relative100)))
Ef2 <- as.numeric(subset(odraws5, variable='f2'))
Ef2 <- exp(Ef2 - mean(Ef2) + mean(log(data$births_relative100)))
Ef_day_of_week <- as.numeric(subset(odraws5, variable='f_day_of_week'))
Ef_day_of_week <- exp(Ef_day_of_week - mean(Ef_day_of_week) + mean(log(data$births_relative100)))
Ef4 <- as.numeric(subset(odraws5, variable='beta_f4'))*sd(log(data$births_relative100))
Ef4 <- exp(Ef4)*100
pf <- data %>%
mutate(Ef = Ef) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef), color=set1[1], alpha=0.75) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
pf1 <- data %>%
mutate(Ef1 = Ef1) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef1), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
pf2 <- data %>%
mutate(Ef2 = Ef2) %>%
group_by(day_of_year2) %>%
summarise(meanbirths=mean(births_relative100), meanEf2=mean(Ef2)) %>%
ggplot(aes(x=as.Date("1987-12-31")+day_of_year2, y=meanbirths)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_line(aes(y=meanEf2), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year")
pf3 <- ggplot(data=data, aes(x=day_of_week, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_continuous(breaks = 1:7, labels=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')) +
geom_line(data=data.frame(x=1:7,y=Ef_day_of_week), aes(x=x, y=Ef_day_of_week), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of week")
pf2b <-data.frame(x=as.Date("1959-12-31")+1:366, y=Ef4) %>%
ggplot(aes(x=x,y=y)) + geom_line(color=set1[1]) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year")
(pf + pf1) / (pf2 + pf3) / (pf2b)
The quick model fit for model 5 is not good, but as the sampling was very slow it wasn’t easy to figure out what is going wrong.
Model 6: long term smooth + seasonal + weekday + day of year
To simplify the analysis of the day of year effect and make the inference during the exploration faster, we drop the time dependent day of week effect and RHS for a moment and use normal prior for the day of year effect.
\[
f = \mbox{intercept} + f_1 + f_2 + \beta_{\mbox{day of week}} + \beta_{\mbox{day of year}}\\
\mbox{intercept} \sim \mbox{normal}(0,1)\\
f_1 \sim \mbox{GP}(0,K_1)\\
f_2 \sim \mbox{GP}(0,K_2)\\
\beta_{\mbox{day of week}} = 0 \quad \mbox{if day of week is Monday}\\
\beta_{\mbox{day of week}} \sim \mbox{normal}(0,1) \quad \mbox{if day of week is not Monday}\\
\beta_{\mbox{day of year}} \sim \mbox{normal}(0,0.1)
\]
Compile Stan model 6 gpbf6.stan
model6 <- cmdstan_model(stan_file = root("Birthdays", "gpbf6.stan"),
include_paths = root("Birthdays"))
Data to be passed to Stan
standata6 <- list(x=data$id,
y=log(data$births_relative100),
N=length(data$id),
c_f1=1.5, # factor c of basis functions for GP for f1
M_f1=20, # number of basis functions for GP for f1
J_f2=20, # number of basis functions for periodic f2
day_of_week=data$day_of_week,
day_of_year=data$day_of_year2) # 1st March = 61 every year
Optimizing is faster than sampling (although this result can be useful in a quick workflow, the result should not be used as the final result).
opt6 <- model6$optimize(data=standata6, init=0, algorithm='lbfgs',
history=100, tol_obj=10)
Check whether parameters have reasonable values
odraws6 <- opt6$draws()
subset(odraws6, variable=c('sigma_','lengthscale_','sigma'), regex=TRUE)
# A draws_matrix: 1 iterations, 1 chains, and 6 variables
variable
draw sigma_f1 sigma_f2 sigma_f4 lengthscale_f1 lengthscale_f2 sigma
1 0.73 0.48 0.16 0.095 0.45 0.28
subset(odraws6, variable=c('beta_f3'))
# A draws_matrix: 1 iterations, 1 chains, and 6 variables
variable
draw beta_f3[1] beta_f3[2] beta_f3[3] beta_f3[4] beta_f3[5] beta_f3[6]
1 0.32 0.09 0.0085 0.14 -1.1 -1.6
Ef4 <- as.numeric(subset(odraws6, variable='beta_f4'))*sd(log(data$births_relative100))
Ef4 <- exp(Ef4)*100
data.frame(x=as.Date("1988-01-01")+0:365, y=Ef4) %>%
ggplot(aes(x=x,y=y)) + geom_line(color=set1[1]) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year")
We recognize some familiar structure in the day of year effect and proceed to sampling. Sample short chains using the early stopped optimization result as initial values (although the result from short chains can be useful in a quick workflow, the result should not be used as the final result).
init6 <- sapply(c('lengthscale_f1','lengthscale_f2',
'sigma_f1','sigma_f2','sigma_f4','sigma',
'beta_f1','beta_f2','beta_f3','beta_f4'),
function(variable) {as.numeric(subset(odraws6, variable=variable))})
fit6 <- model6$sample(data=standata6, iter_warmup=100, iter_sampling=100,
chains=4, parallel_chains=4,
init=function() { init6 })
Check whether parameters have reasonable values
draws6 <- fit6$draws()
summarise_draws(subset(draws6, variable=c('sigma_','lengthscale_','sigma'), regex=TRUE))
# A tibble: 6 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 sigma_f1 0.63 0.60 0.14 0.11 0.48 0.95 1.2 20. 44.
2 sigma_f2 0.29 0.29 0.056 0.051 0.20 0.39 1.2 30. 50.
3 sigma_f4 0.17 0.17 0.0077 0.0074 0.16 0.19 1.0 393. 374.
4 lengthscale_f1 0.21 0.21 0.034 0.035 0.16 0.27 1.3 11. 22.
5 lengthscale_f2 0.25 0.25 0.029 0.030 0.21 0.30 1.1 21. 47.
6 sigma 0.29 0.29 0.0025 0.0023 0.28 0.29 1.0 345. 258.
summarise_draws(subset(draws6, variable=c('beta_f3')))
# A tibble: 6 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 beta_f3[1] 0.35 0.35 0.012 0.013 0.33 0.37 1.0 321. 298.
2 beta_f3[2] 0.13 0.13 0.011 0.012 0.11 0.15 1.0 428. 400.
3 beta_f3[3] 0.046 0.047 0.012 0.012 0.026 0.065 1.0 410. 338.
4 beta_f3[4] 0.18 0.18 0.012 0.013 0.16 0.20 1.0 481. 305.
5 beta_f3[5] -1.1 -1.1 0.012 0.012 -1.1 -1.1 1.0 400. 415.
6 beta_f3[6] -1.5 -1.5 0.012 0.012 -1.5 -1.5 1.0 337. 448.
draws6 <- as_draws_matrix(draws6)
Ef4 <- apply(subset(draws6, variable='beta_f4'), 2, median)*sd(log(data$births_relative100))
Ef4 <- exp(Ef4)*100
data.frame(x=as.Date("1988-01-01")+0:365, y=Ef4) %>%
ggplot(aes(x=x,y=y)) + geom_line(color=set1[1]) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year")
Compare the model to the data
draws6 <- as_draws_matrix(draws6)
Ef <- exp(apply(subset(draws6, variable='f'), 2, median))
Ef1 <- apply(subset(draws6, variable='f1'), 2, median)
Ef1 <- exp(Ef1 - mean(Ef1) + mean(log(data$births_relative100)))
Ef2 <- apply(subset(draws6, variable='f2'), 2, median)
Ef2 <- exp(Ef2 - mean(Ef2) + mean(log(data$births_relative100)))
Ef_day_of_week <- apply(subset(draws6, variable='f_day_of_week'), 2, median)
Ef_day_of_week <- exp(Ef_day_of_week - mean(Ef_day_of_week) + mean(log(data$births_relative100)))
Ef4 <- apply(subset(draws6, variable='beta_f4'), 2, median)*sd(log(data$births_relative100))
Ef4 <- exp(Ef4)*100
pf <- data %>%
mutate(Ef = Ef) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef), color=set1[1], alpha=0.75) +
labs(x="Date", y="Relative number of births")
pf1 <- data %>%
mutate(Ef1 = Ef1) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef1), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
pf2 <- data %>%
mutate(Ef2 = Ef2) %>%
group_by(day_of_year2) %>%
summarise(meanbirths=mean(births_relative100), meanEf2=mean(Ef2)) %>%
ggplot(aes(x=as.Date("1987-12-31")+day_of_year2, y=meanbirths)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_line(aes(y=meanEf2), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year")
pf3 <- ggplot(data=data, aes(x=day_of_week, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_continuous(breaks = 1:7, labels=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')) +
geom_line(data=data.frame(x=1:7,y=Ef_day_of_week), aes(x=x, y=Ef_day_of_week), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of week")
f13 <- data %>% filter(year==1988)%>%select(day,date)%>%mutate(y=Ef4)%>%filter(day==13)
pf2b <-data.frame(x=as.Date("1988-01-01")+0:365, y=Ef4) %>%
ggplot(aes(x=x,y=y)) + geom_line(color=set1[1]) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year") +
annotate("text",x=as.Date("1988-01-01"),y=Ef4[1]-1,label="New year") +
annotate("text",x=as.Date("1988-02-14"),y=Ef4[45]+1.5,label="Valentine's day") +
annotate("text",x=as.Date("1988-02-29"),y=Ef4[60]-2.5,label="Leap day") +
annotate("text",x=as.Date("1988-04-01"),y=Ef4[92]-1.5,label="April 1st") +
annotate("text",x=as.Date("1988-07-04"),y=Ef4[186]-1.5,label="Independence day") +
annotate("text",x=as.Date("1988-10-31"),y=Ef4[305]-1.5,label="Halloween") +
annotate("text",x=as.Date("1988-12-24"),y=Ef4[360]-1.5,label="Christmas") +
geom_point(data=f13,aes(x=date,y=y), size=3, shape=1)
(pf + pf1) / (pf2 + pf3) / pf2b
The short sampling result looks reasonable and thus the problem is not in adding the day of year effect itself. In the bottom plot, the circles mark 13th day of each month. Results look similar to our previous analyses , so it seems the day or year effect model component is working as it should, but there was some problem with our RHS implementation. As there is more variation in the day of year effects than we would hope, we did some additional experiments with different priors for the day of year effect (double exponential, Cauchy and Student’s t with unknown degrees of freedom as models 6b, 6c, 6d), but decided it’s better to add other components before investing that part more thoroughly.
Model 7: long term smooth + seasonal + weekday + day of year normal + floating special days
We can see in the model 6 results that day of year effects have some dips in the relative number of births that are spread over a week. From previous analyse we know these correspond to holidays that are not on a specific day of year, but are for example on the last Monday of May. We call these floating special days and include Memorial day (last Monday of May), Labor day (first Monday of September, and we include also the following Tuesday), and Thanksgiving (fourth Thursday of November, and we include also the following Friday).
Compile Stan model 7 gpbf7.stan
model7 <- cmdstan_model(stan_file = root("Birthdays", "gpbf7.stan"),
include_paths = root("Birthdays"))
Floating special days
# Memorial day
memorial_days <- with(data,which(month==5&day_of_week==1&day>=25))
# Labor day
labor_days <- with(data,which(month==9&day_of_week==1&day<=7))
labor_days <- c(labor_days, labor_days+1)
# Thanksgiving
thanksgiving_days <- with(data,which(month==11&day_of_week==4&day>=22&day<=28))
thanksgiving_days <- c(thanksgiving_days, thanksgiving_days+1)
Data to be passed to Stan
standata7 <- list(x=data$id,
y=log(data$births_relative100),
N=length(data$id),
c_f1=1.5, # factor c of basis functions for GP for f1
M_f1=20, # number of basis functions for GP for f1
J_f2=20, # number of basis functions for periodic f2
day_of_week=data$day_of_week,
day_of_year=data$day_of_year2, # 1st March = 61 every year
memorial_days=memorial_days,
labor_days=labor_days,
thanksgiving_days=thanksgiving_days)
Optimizing is faster than sampling (although this result can be useful in a quick workflow, the result should not be used as the final result).
opt7 <- model7$optimize(data=standata7, init=0, algorithm='lbfgs',
history=100, tol_obj=10)
Check whether parameters have reasonable values
odraws7 <- opt7$draws()
subset(odraws7, variable=c('intercept','sigma_','lengthscale_','sigma'), regex=TRUE)
# A draws_matrix: 1 iterations, 1 chains, and 6 variables
variable
draw sigma_f1 sigma_f2 sigma_f4 lengthscale_f1 lengthscale_f2 sigma
1 0.83 0.44 0.16 0.063 0.43 0.26
subset(odraws7, variable=c('beta_f3'))
# A draws_matrix: 1 iterations, 1 chains, and 6 variables
variable
draw beta_f3[1] beta_f3[2] beta_f3[3] beta_f3[4] beta_f3[5] beta_f3[6]
1 0.28 0.056 0.0099 0.12 -1.2 -1.6
Ef4 <- as.numeric(subset(odraws7, variable='beta_f4'))*sd(log(data$births_relative100))
Ef4 <- exp(Ef4)*100
data.frame(x=as.Date("1988-01-01")+0:365, y=Ef4) %>%
ggplot(aes(x=x,y=y)) + geom_line(color=set1[1]) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year")
Sample short chains using the early stopped optimization result as initial values (although the result from short chains can be useful in a quick workflow, the result should not be used as the final result).
init7 <- sapply(c('lengthscale_f1','lengthscale_f2',
'sigma_f1','sigma_f2','sigma_f4','sigma',
'beta_f1','beta_f2','beta_f3','beta_f4','beta_f5'),
function(variable) {as.numeric(subset(odraws7, variable=variable))})
fit7 <- model7$sample(data=standata7, iter_warmup=100, iter_sampling=100, chains=4, parallel_chains=4,
init=function() { init7 }, refresh=10)
Check whether parameters have reasonable values
draws7 <- fit7$draws()
summarise_draws(subset(draws7, variable=c('sigma_','lengthscale_','sigma'), regex=TRUE))
# A tibble: 6 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 sigma_f1 0.73 0.72 0.14 0.12 0.53 1.0 1.5 7.9 22.
2 sigma_f2 0.28 0.28 0.038 0.028 0.23 0.36 1.2 16. 27.
3 sigma_f4 0.17 0.17 0.0079 0.0079 0.16 0.19 1.0 407. 271.
4 lengthscale_f1 0.22 0.22 0.031 0.025 0.15 0.26 1.6 7.1 12.
5 lengthscale_f2 0.31 0.30 0.039 0.041 0.24 0.38 1.5 7.7 30.
6 sigma 0.26 0.26 0.0021 0.0020 0.26 0.27 1.0 484. 234.
summarise_draws(subset(draws7, variable=c('beta_f3')))
# A tibble: 6 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 beta_f3[1] 0.31 0.31 0.012 0.012 0.29 0.32 1.0 510. 329.
2 beta_f3[2] 0.074 0.074 0.011 0.011 0.056 0.092 1.0 475. 333.
3 beta_f3[3] 0.029 0.029 0.011 0.011 0.012 0.047 1.0 399. 262.
4 beta_f3[4] 0.14 0.14 0.011 0.012 0.12 0.16 1.0 461. 326.
5 beta_f3[5] -1.2 -1.2 0.011 0.011 -1.2 -1.1 1.0 462. 358.
6 beta_f3[6] -1.6 -1.6 0.011 0.011 -1.6 -1.6 1.0 471. 411.
Compare the model to the data
draws7 <- as_draws_matrix(draws7)
Ef <- exp(apply(subset(draws7, variable='f'), 2, median))
Ef1 <- apply(subset(draws7, variable='f1'), 2, median)
Ef1 <- exp(Ef1 - mean(Ef1) + mean(log(data$births_relative100)))
Ef2 <- apply(subset(draws7, variable='f2'), 2, median)
Ef2 <- exp(Ef2 - mean(Ef2) + mean(log(data$births_relative100)))
Ef_day_of_week <- apply(subset(draws7, variable='f_day_of_week'), 2, median)
Ef_day_of_week <- exp(Ef_day_of_week - mean(Ef_day_of_week) + mean(log(data$births_relative100)))
Ef4 <- apply(subset(draws7, variable='beta_f4'), 2, median)*sd(log(data$births_relative100))
Ef4 <- exp(Ef4)*100
Efloats <- apply(subset(draws7, variable='beta_f5'), 2, median)*sd(log(data$births_relative100))
Efloats <- exp(Efloats)*100
floats1988<-c(memorial_days[20], labor_days[c(20,40)], thanksgiving_days[c(20,40)])-6939
Ef4float <- Ef4
Ef4float[floats1988] <- Ef4float[floats1988]*Efloats[c(1,2,2,3,3)]/100
pf <- data %>%
mutate(Ef = Ef) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef), color=set1[1], alpha=0.75) +
labs(x="Date", y="Relative number of births")
pf1 <- data %>%
mutate(Ef1 = Ef1) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef1), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
pf2 <- data %>%
mutate(Ef2 = Ef2) %>%
group_by(day_of_year2) %>%
summarise(meanbirths=mean(births_relative100), meanEf2=mean(Ef2)) %>%
ggplot(aes(x=as.Date("1987-12-31")+day_of_year2, y=meanbirths)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_line(aes(y=meanEf2), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year")
pf3 <- ggplot(data=data, aes(x=day_of_week, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_continuous(breaks = 1:7, labels=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')) +
geom_line(data=data.frame(x=1:7,y=Ef_day_of_week), aes(x=x, y=Ef_day_of_week), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of week")
f13 <- data %>% filter(year==1988)%>%select(day,date)%>%mutate(y=Ef4float)%>%filter(day==13)
pf2b <-data.frame(x=as.Date("1988-01-01")+0:365, y=Ef4float) %>%
ggplot(aes(x=x,y=y)) + geom_line(color=set1[1]) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year") +
annotate("text",x=as.Date("1988-01-01"),y=Ef4float[1]-1,label="New year") +
annotate("text",x=as.Date("1988-02-14"),y=Ef4float[45]+1.5,label="Valentine's day") +
annotate("text",x=as.Date("1988-02-29"),y=Ef4float[60]-2.5,label="Leap day") +
annotate("text",x=as.Date("1988-04-01"),y=Ef4float[92]-1.5,label="April 1st") +
annotate("text",x=as.Date("1988-07-04"),y=Ef4float[186]-1.5,label="Independence day") +
annotate("text",x=as.Date("1988-10-31"),y=Ef4float[305]-1.5,label="Halloween") +
annotate("text",x=as.Date("1988-12-24"),y=Ef4float[360]-2,label="Christmas") +
annotate("text",x=as.Date("1988-05-30"),y=Ef4float[151]-1.5,label="Memorial day") +
annotate("text",x=as.Date("1988-09-05"),y=Ef4float[249]-1.5,label="Labor day") +
annotate("text",x=as.Date("1988-11-24"),y=Ef4float[329]-1,label="Thanksgiving")+
geom_point(data=f13,aes(x=date,y=y), size=3, shape=1)
(pf + pf1) / (pf2 + pf3) / (pf2b)
The day of year and floating special day effects are shown for year 1988 (which is also a leap year) and the results seem reasonable.
Model 8: long term smooth + seasonal + weekday with time dependent magnitude + day of year + special
As the day of year and floating day effects work well, we’ll add the time dependent day of week effect back to the model.
Compile Stan model 8 gpbf8.stan
model8 <- cmdstan_model(stan_file = root("Birthdays", "gpbf8.stan"),
include_paths = root("Birthdays"))
Floating special days
# Memorial day
memorial_days <- with(data,which(month==5&day_of_week==1&day>=25))
# Labor day
labor_days <- with(data,which(month==9&day_of_week==1&day<=7))
labor_days <- c(labor_days, labor_days+1)
# Thanksgiving
thanksgiving_days <- with(data,which(month==11&day_of_week==4&day>=22&day<=28))
thanksgiving_days <- c(thanksgiving_days, thanksgiving_days+1)
Data to be passed to Stan
standata8 <- list(x=data$id,
y=log(data$births_relative100),
N=length(data$id),
c_f1=1.5, # factor c of basis functions for GP for f1
M_f1=20, # number of basis functions for GP for f1
J_f2=20, # number of basis functions for periodic f2
c_g3=1.5, # factor c of basis functions for GP for g3
M_g3=5, # number of basis functions for GP for g3
day_of_week=data$day_of_week,
day_of_year=data$day_of_year2, # 1st March = 61 every year
memorial_days=memorial_days,
labor_days=labor_days,
thanksgiving_days=thanksgiving_days)
Optimizing is faster than sampling (although this result can be useful in a quick workflow, the result should not be used as the final result).
opt8 <- model8$optimize(data=standata8, init=0.1, algorithm='lbfgs',
history=100, tol_obj=10)
Check whether parameters have reasonable values
odraws8 <- opt8$draws()
subset(odraws8, variable=c('sigma_','lengthscale_','sigma'), regex=TRUE)
# A draws_matrix: 1 iterations, 1 chains, and 8 variables
variable
draw sigma_f1 sigma_f2 sigma_g3 sigma_f4 lengthscale_f1 lengthscale_f2 lengthscale_g3 sigma
1 0.51 0.57 0.4 0.18 0.18 0.45 0.83 0.23
subset(odraws8, variable=c('beta_f3'))
# A draws_matrix: 1 iterations, 1 chains, and 6 variables
variable
draw beta_f3[1] beta_f3[2] beta_f3[3] beta_f3[4] beta_f3[5] beta_f3[6]
1 0.36 0.1 0.055 0.18 -1.4 -1.8
Ef4 <- as.numeric(subset(odraws8, variable='beta_f4'))*sd(log(data$births_relative100))
Ef4 <- exp(Ef4)*100
data.frame(x=as.Date("1988-01-01")+0:365, y=Ef4) %>%
ggplot(aes(x=x,y=y)) + geom_line(color=set1[1]) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year")
Compare the model to the data
Ef <- exp(as.numeric(subset(odraws8, variable='f')))
Ef1 <- as.numeric(subset(odraws8, variable='f1'))
Ef1 <- exp(Ef1 - mean(Ef1) + mean(log(data$births_relative100)))
Ef2 <- as.numeric(subset(odraws8, variable='f2'))
Ef2 <- exp(Ef2 - mean(Ef2) + mean(log(data$births_relative100)))
Ef_day_of_week <- as.numeric(subset(odraws8, variable='f_day_of_week'))
Ef_day_of_week <- exp(Ef_day_of_week - mean(Ef_day_of_week) + mean(log(data$births_relative100)))
Ef3 <- as.numeric(subset(odraws8, variable='f3'))
Ef3 <- exp(Ef3 - mean(Ef3) + mean(log(data$births_relative100)))
Ef4 <- as.numeric(subset(odraws8, variable='beta_f4'))*sd(log(data$births_relative100))
Ef4 <- exp(Ef4)*100
Efloats <- as.numeric(subset(odraws8, variable='beta_f5'))*sd(log(data$births_relative100))
Efloats <- exp(Efloats)*100
floats1988<-c(memorial_days[20], labor_days[c(20,40)], thanksgiving_days[c(20,40)])-6939
Ef4float <- Ef4
Ef4float[floats1988] <- Ef4float[floats1988]*Efloats[c(1,2,2,3,3)]/100
pf <- data %>%
mutate(Ef = Ef) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_point(aes(y=Ef), color=set1[1], alpha=0.2) +
labs(x="Date", y="Relative number of births")
pf1 <- data %>%
mutate(Ef1 = Ef1) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef1), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
pf2 <- data %>%
mutate(Ef2 = Ef2) %>%
group_by(day_of_year2) %>%
summarise(meanbirths=mean(births_relative100), meanEf2=mean(Ef2)) %>%
ggplot(aes(x=as.Date("1987-12-31")+day_of_year2, y=meanbirths)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_line(aes(y=meanEf2), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year")
pf3 <- ggplot(data=data, aes(x=day_of_week, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_continuous(breaks = 1:7, labels=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')) +
geom_line(data=data.frame(x=1:7,y=Ef_day_of_week), aes(x=x, y=Ef_day_of_week), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of week")
N=length(data$id)
pf3b <- data %>%
mutate(Ef3 = Ef3*Ef1/100) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_point(aes(y=Ef3), color=set1[1], size=0.1) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births") +
annotate("text",x=as.Date("1989-08-01"),y=(Ef3*Ef1/100)[c((N-5):(N-4), N, N-6)],label=c("Mon","Tue","Sat","Sun"))
f13 <- data %>% filter(year==1988)%>%select(day,date)%>%mutate(y=Ef4float)%>%filter(day==13)
pf2b <-data.frame(x=as.Date("1988-01-01")+0:365, y=Ef4float) %>%
ggplot(aes(x=x,y=y)) + geom_line(color=set1[1]) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year") +
annotate("text",x=as.Date("1988-01-01"),y=Ef4float[1]-1,label="New year") +
annotate("text",x=as.Date("1988-02-14"),y=Ef4float[45]+1.5,label="Valentine's day") +
annotate("text",x=as.Date("1988-02-29"),y=Ef4float[60]-2.5,label="Leap day") +
annotate("text",x=as.Date("1988-04-01"),y=Ef4float[92]-1.5,label="April 1st") +
annotate("text",x=as.Date("1988-07-04"),y=Ef4float[186]-1.5,label="Independence day") +
annotate("text",x=as.Date("1988-10-31"),y=Ef4float[305]-1.5,label="Halloween") +
annotate("text",x=as.Date("1988-12-24"),y=Ef4float[360]-2,label="Christmas") +
annotate("text",x=as.Date("1988-05-30"),y=Ef4float[151]-2,label="Memorial day") +
annotate("text",x=as.Date("1988-09-05"),y=Ef4float[249]-1.5,label="Labor day") +
annotate("text",x=as.Date("1988-11-24"),y=Ef4float[329]-1,label="Thanksgiving")+
geom_point(data=f13,aes(x=date,y=y), size=3, shape=1)
(pf + pf1) / (pf2 + pf3b) / (pf2b)
Sample short chains using the early stopped optimization result as initial values (although the result from short chains can be useful in a quick workflow, the result should not be used as the final result).
init8 <- sapply(c('lengthscale_f1','lengthscale_f2','lengthscale_g3',
'sigma_f1','sigma_f2','sigma_g3','sigma_f4','sigma',
'beta_f1','beta_f2','beta_f3','beta_g3','beta_f4','beta_f5'),
function(variable) {as.numeric(subset(odraws8, variable=variable))})
fit8 <- model8$sample(data=standata8, iter_warmup=100, iter_sampling=100, chains=4, parallel_chains=4,
init=function() { init8 }, refresh=10)
Check whether parameters have reasonable values
draws8 <- fit8$draws()
summarise_draws(subset(draws8, variable=c('sigma_','lengthscale_','sigma'), regex=TRUE))
# A tibble: 8 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 sigma_f1 0.73 0.70 0.13 0.12 0.59 0.97 1.6 7.0 23.
2 sigma_f2 0.30 0.30 0.035 0.040 0.24 0.35 1.3 14. 32.
3 sigma_g3 0.18 0.17 0.043 0.030 0.13 0.28 1.4 9.6 28.
4 sigma_f4 0.17 0.17 0.0077 0.0078 0.16 0.19 1.0 788. 238.
5 lengthscale_f1 0.22 0.23 0.031 0.038 0.17 0.27 1.9 6.2 30.
6 lengthscale_f2 0.31 0.31 0.029 0.029 0.26 0.36 1.3 13. 70.
7 lengthscale_g3 0.62 0.61 0.17 0.22 0.36 0.86 1.2 13. 118.
8 sigma 0.23 0.23 0.0020 0.0021 0.23 0.24 1.0 202. 163.
summarise_draws(subset(draws8, variable=c('beta_f3')))
# A tibble: 6 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 beta_f3[1] 0.31 0.31 0.023 0.025 0.28 0.35 1.7 6.7 20.
2 beta_f3[2] 0.082 0.082 0.011 0.011 0.065 0.10 1.1 21. 62.
3 beta_f3[3] 0.043 0.042 0.011 0.011 0.026 0.061 1.0 145. 368.
4 beta_f3[4] 0.15 0.15 0.014 0.014 0.13 0.17 1.3 11. 46.
5 beta_f3[5] -1.2 -1.2 0.078 0.091 -1.3 -1.1 2.1 5.8 24.
6 beta_f3[6] -1.6 -1.6 0.10 0.12 -1.8 -1.4 2.1 5.7 20.
Compare the model to the data
draws8 <- as_draws_matrix(draws8)
Ef <- exp(apply(subset(draws8, variable='f'), 2, median))
Ef1 <- apply(subset(draws8, variable='f1'), 2, median)
Ef1 <- exp(Ef1 - mean(Ef1) + mean(log(data$births_relative100)))
Ef2 <- apply(subset(draws8, variable='f2'), 2, median)
Ef2 <- exp(Ef2 - mean(Ef2) + mean(log(data$births_relative100)))
Ef_day_of_week <- apply(subset(draws8, variable='f_day_of_week'), 2, median)
Ef_day_of_week <- exp(Ef_day_of_week - mean(Ef_day_of_week) + mean(log(data$births_relative100)))
Ef3 <- apply(subset(draws8, variable='f3'), 2, median)
Ef3 <- exp(Ef3 - mean(Ef3) + mean(log(data$births_relative100)))
Ef4 <- apply(subset(draws8, variable='beta_f4'), 2, median)*sd(log(data$births_relative100))
Ef4 <- exp(Ef4)*100
Efloats <- apply(subset(draws8, variable='beta_f5'), 2, median)*sd(log(data$births_relative100))
Efloats <- exp(Efloats)*100
floats1988<-c(memorial_days[20], labor_days[c(20,40)], thanksgiving_days[c(20,40)])-6939
Ef4float <- Ef4
Ef4float[floats1988] <- Ef4float[floats1988]*Efloats[c(1,2,2,3,3)]/100
pf <- data %>%
mutate(Ef = Ef) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_point(aes(y=Ef), color=set1[1], alpha=0.2) +
labs(x="Date", y="Relative number of births")
pf1 <- data %>%
mutate(Ef1 = Ef1) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef1), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
pf2 <- data %>%
mutate(Ef2 = Ef2) %>%
group_by(day_of_year2) %>%
summarise(meanbirths=mean(births_relative100), meanEf2=mean(Ef2)) %>%
ggplot(aes(x=as.Date("1987-12-31")+day_of_year2, y=meanbirths)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_line(aes(y=meanEf2), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year")
pf3 <- ggplot(data=data, aes(x=day_of_week, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_continuous(breaks = 1:7, labels=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')) +
geom_line(data=data.frame(x=1:7,y=Ef_day_of_week), aes(x=x, y=Ef_day_of_week), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of week")
N=length(data$id)
pf3b <- data %>%
mutate(Ef3 = Ef3*Ef1/100) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_point(aes(y=Ef3), color=set1[1], size=0.1) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births") +
annotate("text",x=as.Date("1989-08-01"),y=(Ef3*Ef1/100)[c((N-5):(N-4), N, N-6)],label=c("Mon","Tue","Sat","Sun"))
f13 <- data %>% filter(year==1988)%>%select(day,date)%>%mutate(y=Ef4float)%>%filter(day==13)
pf2b <-data.frame(x=as.Date("1988-01-01")+0:365, y=Ef4float) %>%
ggplot(aes(x=x,y=y)) + geom_line(color=set1[1]) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year") +
annotate("text",x=as.Date("1988-01-01"),y=Ef4float[1]-1,label="New year") +
annotate("text",x=as.Date("1988-02-14"),y=Ef4float[45]+1.5,label="Valentine's day") +
annotate("text",x=as.Date("1988-02-29"),y=Ef4float[60]-2.5,label="Leap day") +
annotate("text",x=as.Date("1988-04-01"),y=Ef4float[92]-1.5,label="April 1st") +
annotate("text",x=as.Date("1988-07-04"),y=Ef4float[186]-1.5,label="Independence day") +
annotate("text",x=as.Date("1988-10-31"),y=Ef4float[305]-1.5,label="Halloween") +
annotate("text",x=as.Date("1988-12-24"),y=Ef4float[360]-2,label="Christmas") +
annotate("text",x=as.Date("1988-05-30"),y=Ef4float[151]-2,label="Memorial day") +
annotate("text",x=as.Date("1988-09-05"),y=Ef4float[249]-1.5,label="Labor day") +
annotate("text",x=as.Date("1988-11-24"),y=Ef4float[329]-1,label="Thanksgiving")+
geom_point(data=f13,aes(x=date,y=y), size=3, shape=1)
(pf + pf1) / (pf2 + pf3b) / (pf2b)
The inference for the model works fine, which hints that our RHS implementation for the model 5 was wrong or had very difficult posterior. Before testing RHS again, we’ll test with an easier to implement Student’s \(t\) prior whether long tailed prior for day of year effect is reasonable. These experiments help also to find out whether the day of year effect is sensitive to the prior choice.
Model 8+t_nu: day of year effect with Student’s t prior
Compile Stan model 8 + t_nu gpbf8tnu.stan
model8tnu <- cmdstan_model(stan_file = root("Birthdays", "gpbf8tnu.stan"),
include_paths = root("Birthdays"))
Optimizing is faster than sampling (although this result can be useful in a quick workflow, the result should not be used as the final result).
opt8tnu <- model8tnu$optimize(data=standata8, init=0.1, algorithm='lbfgs',
history=100, tol_obj=10)
odraws8tnu <- opt8tnu$draws()
Sample short chains using the early stopped optimization result as initial values (although the result from short chains can be useful in a quick workflow, the result should not be used as the final result).
init8tnu <- sapply(c('lengthscale_f1','lengthscale_f2','lengthscale_g3',
'sigma_f1','sigma_f2','sigma_g3','sigma_f4','nu_f4','sigma',
'beta_f1','beta_f2','beta_f3','beta_g3','beta_f4','beta_f5'),
function(variable) {as.numeric(subset(odraws8tnu, variable=variable))})
fit8tnu <- model8tnu$sample(data=standata8, iter_warmup=100, iter_sampling=100,
chains=4, parallel_chains=4,
init=function() { init8tnu }, refresh=10)
Check whether parameters have reasonable values
draws8tnu <- fit8tnu$draws()
summarise_draws(subset(draws8tnu, variable=c('intercept','sigma_','lengthscale_','sigma','nu_'), regex=TRUE))
# A tibble: 9 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 sigma_f1 0.71 0.70 0.16 0.18 0.51 1.0 1.2 19. 83.
2 sigma_f2 0.29 0.28 0.046 0.039 0.23 0.40 1.2 27. 15.
3 sigma_g3 0.21 0.21 0.050 0.043 0.14 0.31 1.1 70. 122.
4 sigma_f4 0.0039 0.0034 0.0016 0.0017 0.0019 0.0069 2.1 5.7 40.
5 lengthscale_f1 0.22 0.22 0.034 0.026 0.15 0.26 1.1 22. 75.
6 lengthscale_f2 0.21 0.20 0.014 0.013 0.18 0.23 1.3 10. 56.
7 lengthscale_g3 0.75 0.75 0.21 0.22 0.40 1.1 1.1 25. 98.
8 sigma 0.23 0.23 0.0021 0.0021 0.23 0.24 1.0 315. 171.
9 nu_f4 0.74 0.72 0.11 0.11 0.58 0.92 1.3 11. 78.
Posterior of degrees of freedom nu_f4
is very close to 0.5, and thus the distribution has thicker tails than Cauchy. This is strong evidence that the distribution of day of year effects is far from normal. Compare the model to the data
draws8 <- as_draws_matrix(draws8tnu)
Ef <- exp(apply(subset(draws8, variable='f'), 2, median))
Ef1 <- apply(subset(draws8, variable='f1'), 2, median)
Ef1 <- exp(Ef1 - mean(Ef1) + mean(log(data$births_relative100)))
Ef2 <- apply(subset(draws8, variable='f2'), 2, median)
Ef2 <- exp(Ef2 - mean(Ef2) + mean(log(data$births_relative100)))
Ef_day_of_week <- apply(subset(draws8, variable='f_day_of_week'), 2, median)
Ef_day_of_week <- exp(Ef_day_of_week - mean(Ef_day_of_week) + mean(log(data$births_relative100)))
Ef3 <- apply(subset(draws8, variable='f3'), 2, median)
Ef3 <- exp(Ef3 - mean(Ef3) + mean(log(data$births_relative100)))
Ef4 <- apply(subset(draws8, variable='beta_f4'), 2, median)*sd(log(data$births_relative100))
Ef4 <- exp(Ef4)*100
Efloats <- apply(subset(draws8, variable='beta_f5'), 2, median)*sd(log(data$births_relative100))
Efloats <- exp(Efloats)*100
floats1988<-c(memorial_days[20], labor_days[c(20,40)], thanksgiving_days[c(20,40)])-6939
Ef4float <- Ef4
Ef4float[floats1988] <- Ef4float[floats1988]*Efloats[c(1,2,2,3,3)]/100
pf <- data %>%
mutate(Ef = Ef) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_point(aes(y=Ef), color=set1[1], alpha=0.2) +
labs(x="Date", y="Relative number of births")
pf1 <- data %>%
mutate(Ef1 = Ef1) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef1), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
pf2 <- data %>%
mutate(Ef2 = Ef2) %>%
group_by(day_of_year2) %>%
summarise(meanbirths=mean(births_relative100), meanEf2=mean(Ef2)) %>%
ggplot(aes(x=as.Date("1987-12-31")+day_of_year2, y=meanbirths)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_line(aes(y=meanEf2), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year")
pf3 <- ggplot(data=data, aes(x=day_of_week, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_continuous(breaks = 1:7, labels=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')) +
geom_line(data=data.frame(x=1:7,y=Ef_day_of_week), aes(x=x, y=Ef_day_of_week), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of week")
N=length(data$id)
pf3b <- data %>%
mutate(Ef3 = Ef3*Ef1/100) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_point(aes(y=Ef3), color=set1[1], size=0.1) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births") +
annotate("text",x=as.Date("1989-08-01"),y=(Ef3*Ef1/100)[c((N-5):(N-4), N, N-6)],label=c("Mon","Tue","Sat","Sun"))
f13 <- data %>% filter(year==1988)%>%select(day,date)%>%mutate(y=Ef4float)%>%filter(day==13)
pf2b <-data.frame(x=as.Date("1988-01-01")+0:365, y=Ef4float) %>%
ggplot(aes(x=x,y=y)) + geom_line(color=set1[1]) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year") +
annotate("text",x=as.Date("1988-01-01"),y=Ef4float[1]-1,label="New year") +
annotate("text",x=as.Date("1988-02-14"),y=Ef4float[45]+1.5,label="Valentine's day") +
annotate("text",x=as.Date("1988-02-29"),y=Ef4float[60]-2.5,label="Leap day") +
annotate("text",x=as.Date("1988-04-01"),y=Ef4float[92]-1.5,label="April 1st") +
annotate("text",x=as.Date("1988-07-04"),y=Ef4float[186]-1.5,label="Independence day") +
annotate("text",x=as.Date("1988-10-31"),y=Ef4float[305]-1.5,label="Halloween") +
annotate("text",x=as.Date("1988-12-24"),y=Ef4float[360]-2,label="Christmas") +
annotate("text",x=as.Date("1988-05-30"),y=Ef4float[151]-2,label="Memorial day") +
annotate("text",x=as.Date("1988-09-05"),y=Ef4float[249]-1.5,label="Labor day") +
annotate("text",x=as.Date("1988-11-24"),y=Ef4float[329]-1,label="Thanksgiving")+
geom_point(data=f13,aes(x=date,y=y), size=3, shape=1)
(pf + pf1) / (pf2 + pf3b) / (pf2b)
The other effects seem to be quite similar as with the previous model, but the day of year effects are clearly different with most days having non-detectable effect. There are also effects that seemed to be quite clear in normal prior model such as 13th day of month effect, which is not visible anymore. As the posterior of degrees of freedom t_nu
was concentrated close to 1, it’s likely that the normal prior for day of year effect can’t be the best. So far we hadn’t used model comparison such as leave-one-out cross-validation (LOO-CV) as each added component had qualitatively big and reasonable effect. Now as day of year effect is sensitive to prior choice, but it’s not clear how much better \(t_\nu\) prior distribution is we use LOO-CV to compare the models.
loo8 <- fit8$loo()
loo8tnu <- fit8tnu$loo()
loo_compare(list(`Model 8 normal`=loo8,`Model 8 Student\'s t`=loo8tnu))
elpd_diff se_diff
Model 8 Student's t 0.0 0.0
Model 8 normal -116.4 16.4
As we could have expected based on the posterior of nu_f4
Student’s t prior on day of year effects is better. As low degrees of freedom indicate a thick tailed distribution for day of year effect is needed, we decided to test again RHS prior.
Model 8+RHS: day of year effect with RHS prior
Model 5 had RHS prior but the problem was that optimization result wasn’t even close to sensible and MCMC was very slow. Given the other models we now know that the problem is not in adding day of year effect or combining it with time dependent magnitude for the day of week effect. It was easier now to focus on figuring out the problem in RHS. Since RHS is presented as a scale mixture of normals involving hierarchical prior, it is common to use non-centered parameterization for RHS prior. Non-centered parameterization is useful when the information from the likelihood is weak and the prior dependency dominates in the posterior dependency. RHS is often used when there are less observations than unknowns. In this problem each unknown (one day of year effect) is informed by several observations from different years, and then it might be that the centered parameterization is better. And this turned out to be true and the inference for model 8 with centered parameterization RHS prior on day of year effect worked much better than for model 5. (In Stan it was easy to test switch from non-centered to centered parameterization by removing the multplier from one of the parameter declarations).
Compile Stan model 8 + RHS gpbf8rhs.stan
model8rhs <- cmdstan_model(stan_file = root("Birthdays", "gpbf8rhs.stan"),
include_paths = root("Birthdays"))
Add a global scale for RHS prior
standata8 <- c(standata8,
scale_global=0.1) # global scale for RHS prior
Optimizing is faster than sampling (although this result can be useful in a quick workflow, the result should not be used as the final result).
opt8rhs <- model8rhs$optimize(data=standata8, init=0.1, algorithm='lbfgs',
history=100, tol_obj=10)
odraws8rhs <- opt8rhs$draws()
Sample short chains using the optimization result as initial values (although the result from short chains can be useful in a quick workflow, the result should not be used as the final result).
init8rhs <- sapply(c('lengthscale_f1','lengthscale_f2','lengthscale_g3',
'sigma_f1','sigma_f2','sigma_g3','sigma_f4','sigma',
'beta_f1','beta_f2','beta_f3','beta_g3','beta_f4','beta_f5',
'tau_f4','lambda_f4','caux_f4'),
function(variable) {as.numeric(subset(odraws8rhs, variable=variable))})
fit8rhs <- model8rhs$sample(data=standata8, iter_warmup=100, iter_sampling=100,
chains=4, parallel_chains=4,
init=function() { init8rhs }, refresh=10)
Check whether parameters have reasonable values
draws8rhs <- fit8rhs$draws()
summarise_draws(subset(draws8rhs, variable=c('sigma_','lengthscale_','sigma','nu_'), regex=TRUE))
# A tibble: 8 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 sigma_f1 0.71 0.72 0.11 0.13 0.52 0.88 1.6 7.1 50.
2 sigma_f2 0.26 0.27 0.038 0.044 0.21 0.33 1.7 7.0 22.
3 sigma_g3 0.19 0.19 0.043 0.041 0.12 0.26 1.3 11. 35.
4 sigma_f4 0.084 0.062 0.060 0.047 0.023 0.21 1.7 6.6 31.
5 lengthscale_f1 0.22 0.22 0.031 0.031 0.15 0.26 1.8 6.3 18.
6 lengthscale_f2 0.24 0.23 0.031 0.037 0.20 0.29 2.2 5.6 22.
7 lengthscale_g3 0.66 0.65 0.16 0.16 0.42 0.93 1.3 11. 76.
8 sigma 0.23 0.23 0.0019 0.0022 0.23 0.24 1.0 265. 200.
Compare the model to the data
draws8 <- as_draws_matrix(draws8rhs)
Ef <- exp(apply(subset(draws8, variable='f'), 2, median))
Ef1 <- apply(subset(draws8, variable='f1'), 2, median)
Ef1 <- exp(Ef1 - mean(Ef1) + mean(log(data$births_relative100)))
Ef2 <- apply(subset(draws8, variable='f2'), 2, median)
Ef2 <- exp(Ef2 - mean(Ef2) + mean(log(data$births_relative100)))
Ef_day_of_week <- apply(subset(draws8, variable='f_day_of_week'), 2, median)
Ef_day_of_week <- exp(Ef_day_of_week - mean(Ef_day_of_week) + mean(log(data$births_relative100)))
Ef3 <- apply(subset(draws8, variable='f3'), 2, median)
Ef3 <- exp(Ef3 - mean(Ef3) + mean(log(data$births_relative100)))
Ef4 <- apply(subset(draws8, variable='beta_f4'), 2, median)*sd(log(data$births_relative100))
Ef4 <- exp(Ef4)*100
Efloats <- apply(subset(draws8, variable='beta_f5'), 2, median)*sd(log(data$births_relative100))
Efloats <- exp(Efloats)*100
floats1988<-c(memorial_days[20], labor_days[c(20,40)], thanksgiving_days[c(20,40)])-6939
Ef4float <- Ef4
Ef4float[floats1988] <- Ef4float[floats1988]*Efloats[c(1,2,2,3,3)]/100
pf <- data %>%
mutate(Ef = Ef) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_point(aes(y=Ef), color=set1[1], alpha=0.2) +
labs(x="Date", y="Relative number of births")
pf1 <- data %>%
mutate(Ef1 = Ef1) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_line(aes(y=Ef1), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births")
pf2 <- data %>%
mutate(Ef2 = Ef2) %>%
group_by(day_of_year2) %>%
summarise(meanbirths=mean(births_relative100), meanEf2=mean(Ef2)) %>%
ggplot(aes(x=as.Date("1987-12-31")+day_of_year2, y=meanbirths)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_line(aes(y=meanEf2), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year")
pf3 <- ggplot(data=data, aes(x=day_of_week, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
scale_x_continuous(breaks = 1:7, labels=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')) +
geom_line(data=data.frame(x=1:7,y=Ef_day_of_week), aes(x=x, y=Ef_day_of_week), color=set1[1]) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of week")
N=length(data$id)
pf3b <- data %>%
mutate(Ef3 = Ef3*Ef1/100) %>%
ggplot(aes(x=date, y=births_relative100)) + geom_point(color=set1[2], alpha=0.2) +
geom_point(aes(y=Ef3), color=set1[1], size=0.1) +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births") +
annotate("text",x=as.Date("1989-08-01"),y=(Ef3*Ef1/100)[c((N-5):(N-4), N, N-6)],label=c("Mon","Tue","Sat","Sun"))
f13 <- data %>% filter(year==1988)%>%select(day,date)%>%mutate(y=Ef4float)%>%filter(day==13)
pf2b <-data.frame(x=as.Date("1988-01-01")+0:365, y=Ef4float) %>%
ggplot(aes(x=x,y=y)) + geom_line(color=set1[1]) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_hline(yintercept=100, color='gray') +
labs(x="Date", y="Relative number of births of year") +
annotate("text",x=as.Date("1988-01-01"),y=Ef4float[1]-1,label="New year") +
annotate("text",x=as.Date("1988-02-14"),y=Ef4float[45]+1.5,label="Valentine's day") +
annotate("text",x=as.Date("1988-02-29"),y=Ef4float[60]-2.5,label="Leap day") +
annotate("text",x=as.Date("1988-04-01"),y=Ef4float[92]-1.5,label="April 1st") +
annotate("text",x=as.Date("1988-07-04"),y=Ef4float[186]-1.5,label="Independence day") +
annotate("text",x=as.Date("1988-10-31"),y=Ef4float[305]-1.5,label="Halloween") +
annotate("text",x=as.Date("1988-12-24"),y=Ef4float[360]-2,label="Christmas") +
annotate("text",x=as.Date("1988-05-30"),y=Ef4float[151]-2,label="Memorial day") +
annotate("text",x=as.Date("1988-09-05"),y=Ef4float[249]-1.5,label="Labor day") +
annotate("text",x=as.Date("1988-11-24"),y=Ef4float[329]-1,label="Thanksgiving")+
geom_point(data=f13,aes(x=date,y=y), size=3, shape=1)
(pf + pf1) / (pf2 + pf3b) / (pf2b)
Visually we get quite similar result as with \(t_\nu\) prior. When we compare the models with LOO-CV, there is not much difference between these priors.
loo8rhs<-fit8rhs$loo()
loo_compare(list(`Model 8 Students t`=loo8tnu,`Model 8 RHS`=loo8rhs))
elpd_diff se_diff
Model 8 Students t 0.0 0.0
Model 8 RHS -8.5 5.1
Further improvements for the day of year effect
It’s unlikely that day of year effect would be unstructured with some distribution like RHS, and thus instead of trying to find a prior distribution that would improve LOO-CV, it would make more sense to further add structural information. For example, it would be possible to add more known special days and take into account that a special day effect and weekend effect probably are not additive. Furthermore if there are less births during some day, the births need to happen some other day and it can be assumed that there would be corresponding excess of births before of after a bank holiday. This ringing around days with less births is not simple as it is also affected whether the previous and following days are weekend days. This all gets more complicated than we want to include in this case study, but the reader can see how the similar gradual model building could be made by adding additional components. Eventually it is likely that there starts to be worry of overfitting, but integration over the unknown alleviates that and looking at the predictive performance estimates such LOO-CV can help to decide when the additional model components don’t improve the predictive performance or can’t be well identified.
Residual analysis
We can get further ideas for how to improve the model also by looking at the residuals.
draws8 <- as_draws_matrix(draws8tnu)
Ef <- exp(apply(subset(draws8, variable='f'), 2, median))
data %>%
mutate(Ef = Ef) %>%
ggplot(aes(x=date, y=log(births_relative100/Ef))) + geom_point(color=set1[2]) +
geom_hline(yintercept=0, color='gray') +
scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
theme(panel.grid.major.x=element_line(color='gray',size=1))
We can see some structure, specifically in years 1969–1978 the residual has negative peak in the middle of the year, while in years 1981–1988 the residual has positive peak in the middle of the year. This kind of pattern appears as we use the same seasonal effect for all years, but the magnitude of seasonal effect is changing. It would be possible to modify the model to include gradually changing seasonal effect, but leave it out from this case study.
The best model so far explains already 94% of the variance (LOO-R2).
draws8 <- as_draws_matrix(draws8tnu)
f <- exp(subset(draws8, variable='f'))
loo8tnu <- fit8tnu$loo(save_psis=TRUE)
Efloo <- E_loo(f, psis_object=loo8tnu$psis_object)$value
LOOR2 <- 1-var(log(data$births_relative100/Efloo))/var(log(data$births_relative100))
print(LOOR2, digits=2)
[1] 0.94
As it seems we could still improve by adding more structure and time varying seasonal effect, it seems the variability in the number of births from day to day is quite well predictable. Of course big part of the variation is due to planned induced births and c-sections, and thus hospitals do already control the number of births per day and there is no really practical use for the result. However there are plenty of similar time series, for example, in consumer behavior that are affected by special days.
More accurate inference
During all the iterative model building we favored optimization and short MCMC chains. In the end we also run with higher adapt_delta to reduce the probability of divergences, higher maximum treedepth to ensure higher effective sample size per iteration (ESS per second doesn’t necessarily improve), and run much longer chains, but didn’t see practical differences in plots or LOO-CV values. As running these longer chains can take hours they are not run as part of this notebook. An example of how to reduce probability of divergences and increase maximum treedepth is shown below (there is rarely need to increase adapt_delta larger than 0.95 and if there are still divergences with adapt_delta equal to 0.99, the posterior has serious problems and it should be considered whether re-parameterization, better data or more informative priors could help).
## fit8tnu <- model8tnu$sample(data=standata8, chains=4, parallel_chains=4,
## adapt_delta=0.95, max_treedepth=15)
