Using a fitted model for decision analysis: Mixture model for time series competition

Author

Andrew Gelman

Published

2022-08-18

Modified

2026-06-15

This notebook includes the code for the Bayesian Workflow book Chapter 20 Using a fitted model for decision analysis: Mixture model for time series competition.

1 Introduction

A few years ago someone sent us an email about a “Global Climate Challenge” that he had seen online, and which was introduced as follows:

It has often been claimed that alarm about global warming is supported by observational evidence. I have argued that there is no observational evidence for global-warming alarm: rather, all claims of such evidence rely on invalid statistical analyses.

Some people, though, have asserted that the statistical analyses are valid. Those people assert, in particular, that they can determine, via statistical analysis, whether global temperatures have been increasing more than would be reasonably expected by random natural variation. Those people do not present any counter to my argument, but they make their assertions anyway.

In response to that, I am sponsoring a contest: the prize is $100,000. In essence, the prize will be awarded to anyone who can demonstrate, via statistical analysis, that the increase in global temperatures is probably not due to random natural variation.

How to win the money?

The file Series1000.txt contains 1000 time series. Each series has length 135: the same as that of the most commonly studied series of global temperatures (which span 1880–2014). The 1000 series were generated as follows. First, 1000 random series were obtained (via a trendless statistical model fit for global temperatures). Then, some randomly-selected series had a trend added to them. Some trends were positive; the others were negative. Each individual trend was 1°C/century (in magnitude)—which is greater than the trend claimed for global temperatures.

A prize of $100,000 (one hundred thousand U.S. dollars) will be awarded to the first person who submits an entry that correctly identifies at least 900 series: which series were generated by a trendless process and which were generated by a trending process.

But also this:

Each entry must be accompanied by a payment of $10.

Load packages

library(rprojroot)
root <- has_file(".Bayesian-Workflow-root")$make_fix_file()
library(scales)
options(digits = 2)
library(cmdstanr)
library(posterior)
options(mc.cores = 4)
set.seed(1123)

print_stan_file <- function(file) {
  code <- readLines(file)
  if (isTRUE(getOption("knitr.in.progress")) &
        identical(knitr::opts_current$get("results"), "asis")) {
    # In render: emit as-is so Pandoc/Quarto does syntax highlighting
    block <- paste0("```stan", "\n", paste(code, collapse = "\n"), "\n", "```")
    knitr::asis_output(block)
  } else {
    writeLines(code)
  }
}

2 Data

OK, now it’s time to get to work. We start by downloading and graphing the data.

series <- matrix(
  scan(root("timeseries", "data", "Series1000.txt")),
  nrow = 1000,
  ncol = 135,
  byrow = TRUE
)
T <- 135
N <- 1000

par(mar = c(3, 3, 2, 0), tck = -0.01, mgp = c(1.5, 0.5, 0))
plot(c(1, T), range(series),
     xaxs = "i", bty = "l", type = "n",
     xlab = "Time", ylab = "y")
for (n in 1:N) {
  lines(1:T, series[n, ], lwd = .5, col = alpha("black", 0.2))
}

3 Separate linear regression

Aha! The lines are fanning out from a common starting point. We’ll fit a regression to each line and then summarize each line by its average slope. This is not necessarily the most efficient way to estimate a slope from correlated data, but we will not be concerned about that, as our purpose here is to demonstrate decision analysis in the context of a fitted model, rather than to fit the model in an optimal way.

slope <- rep(NA, N)
se <- rep(NA, N)
for (n in 1:N) {
  data <- series[n, ]
  time <- 1:T
  fit <- lm(data ~ time)
  coefs <- summary(fit)$coefficients
  slope[n] <- 100 * coefs[2, "Estimate"]
  se[n] <- 100 * coefs[2, "Std. Error"]
}

We multiplied the slopes (and standard errors) by 100 to put them on a per-century scale to match the above instructions.

Next we plot the estimated slopes and their standard errors:

par(mar = c(3, 3, 2, 0), tck = -.01, mgp = c(1.5, .5, 0))
plot(slope, se,
     ylim = c(0, 1.05 * max(se)), yaxs = "i", bty = "l",
     xlab = "Estimated slope", ylab = "SE", pch = 20, cex = .5)

OK, not much information in the se’s. How about a histogram of the estimated slopes?

par(mar = c(3, 3, 2, 0), tck = -.01, mgp = c(1.5, .5, 0))
hist(
  slope,
  xlab = "Estimated slope",
  breaks = seq(floor(10 * min(slope)), ceiling(10 * max(slope))) / 10,
  main = ""
)

Based on the problem description, I’d expect to see distributions centered at 0, -1, and 1. It looks like this might be the case.

4 Stan mixture model

So let’s fit a mixture model. That’s easy. Here’s the Stan model:

print_stan_file(root("timeseries", "mixture.stan"))

data {
  int K;
  int N;
  array[N] real y;
  array[K] real mu;
}
parameters {
  simplex[K] theta;
  real<lower=0> sigma;
}
model {
  array[K] real ps;
  sigma ~ exponential(0.1);
  for (n in 1:N) {
    for (k in 1:K) {
      ps[k] = log(theta[k]) + normal_lpdf(y[n] | mu[k], sigma);
    }
    target += log_sum_exp(ps);
  }
}

We sample from the posterior:

y <- slope
K <- 3
mu <- c(0, -1, 1)
data <- list(y = y, K = K, N = N, mu = mu)
mod <- cmdstan_model(root("timeseries", "mixture.stan"))

fit_mix <- mod$sample(data = data, refresh = 0)

Posterior summary:

print(fit_mix)

 variable     mean   median   sd  mad       q5      q95 rhat ess_bulk ess_tail
 lp__     -1175.22 -1174.89 1.28 1.03 -1177.78 -1173.84 1.00     2182     2994
 theta[1]     0.54     0.54 0.02 0.02     0.50     0.58 1.00     3127     2961
 theta[2]     0.24     0.24 0.02 0.02     0.21     0.27 1.00     3051     2990
 theta[3]     0.22     0.22 0.02 0.02     0.20     0.25 1.00     3683     3051
 sigma        0.40     0.40 0.02 0.02     0.38     0.44 1.00     2693     2485

Convergence is fine: \widehat{R} is close to 1 for everything. The estimated weights of the three mixture components are approximately 0.5, 0.25, 0.25. Given that the problem was made up, I’m guessing the weights of the underlying data-generation process are exactly 1/2, 1/4, and 1/4. The standard deviation of the slopes within each component is 0.4, or close to it. We could also try fitting a model where the standard deviations of the three components differ, but we won’t, partly because the description given with the simulated data described the change as adding a trend, and partly because the above histogram doesn’t seem to show any varying of the widths of the mixture components.

OK, now we’re getting somewhere. To make predictions, we need to know, for each series, the probability of it being in each of the three components. We’ll compute these probabilities by adding a generated quantities block to the Stan program:

generated quantities {
  matrix[N, K] p;
  for (n in 1:N) {
    vector[K] p_raw;
    for (k in 1:K) {
      p_raw[k] = theta[k] * exp(normal_lpdf(y[n] | mu[k], sigma));
    }
    for (k in 1:K) {
      p[n,k] = p_raw[k] / sum(p_raw);
    }
  }
}

We then re-fit the model, extract the p’s, and average them over the posterior simulations. This is made simpler using the posterior R package’s draws_rvars format.

mod <- cmdstan_model(root("timeseries", "mixture_2.stan"))
fit_mix <- mod$sample(data = data, refresh = 0)

prob_sims <- as_draws_rvars(fit_mix$draws())
prob <- mean(prob_sims$p)

We now have a 1000\times 3 matrix of probabilities. Let’s take a look at the first ten rows

round(prob[1:10,], 2)

      [,1] [,2] [,3]
 [1,] 0.08 0.00 0.92
 [2,] 0.40 0.60 0.00
 [3,] 0.93 0.01 0.06
 [4,] 0.83 0.17 0.00
 [5,] 0.82 0.17 0.00
 [6,] 0.95 0.01 0.05
 [7,] 0.74 0.00 0.26
 [8,] 0.86 0.14 0.00
 [9,] 0.11 0.00 0.89
[10,] 0.87 0.00 0.13

So, the first series is probably drawn from the sloping-upward model; the second might be from the null model or it might be from the sloping-downward model; the third, fourth, fifth, sixth, seventh, and eighth are probably from the null model; the ninth is probably from the sloping-upward model; and so forth.

We’ll now program this: for each of the 1000 series in the dataset, we’ll pick which of the three mixture components has the highest probability. We’ll save the probability and also which component is being picked.

max_prob <- apply(prob, 1, max)
choice <- apply(prob, 1, which.max)

And now we can sum this over the 1000 series. We’ll compute the number of series assigned to each of the three choices:

print(table(choice))

choice
  1   2   3 
559 232 209

The guesses are not quite in proportion 0.5, 0.25, 0.25. There seem to be too many guesses of zero slope and not enough of positive and negative slopes. But that makes sense given the decision problem: we want to maximize the number of correct guesses so we end up disproportionately guessing the most common category. That’s fine; it’s how it will be.

And we can compute the expected number of correct guesses (based on the posterior distribution we have here), and the standard deviation of the number of correct guesses (based on the reasonable approximation of independence of the 1000 series conditional on the model). And then we’ll print all these numbers:

expected_correct <- sum(max_prob)
sd_correct <- sqrt(sum(max_prob * (1 - max_prob)))
round(c(expected_correct, sd_correct), 1)

[1] 854  10

Interesting. The expected number of correct guesses is 855. Not quite the 900 that’s needed to win! The standard error of the number of correct guesses is 10.3, so 900 is over 5 standard errors away from our expected number correct. That’s bad news!

How bad is it? We can compute the normal cumulative density function to get the probability of at least 900 successes; that’s

pnorm(expected_correct, 900, sd_correct)

[1] 5e-06

That’s a small number; here’s its reciprocal:

recip <- 1 / pnorm(expected_correct, 900, sd_correct)
recip

[1] 2e+05

That’s almost a 1-in-200,000 chance of winning the big prize!

But we only have to get at least 900. So we can do the continuity correction and evaluate the probability of at least 899.5 successes, which, when inverted, yields:

1 / pnorm(expected_correct, 899.5, sd_correct)

[1] 159336

Nope, still no good. For the bet to be worth it, even in the crudest sense of expected monetary value, the probability of winning would have to be at least 1 in 10,000. (Recall that the prize is $100,000 but the cost of entry is $10.) And that’s all conditional on the designer of the study doing everything exactly as he said, and not playing with multiple seeds for the random number generator, etc. After all, he could well have first chosen a seed and generated the series, then performed something like the above analysis and checked that the most natural estimate gave only 850 correct or so, and in the very unlikely event that the natural estimate gave 900 or close to it, just re-running with a new seed. We have no reason to think that the creator of this challenge did anything like that; our point here is only that, even if he did his simulation in a completely clean way, our odds of winning are about 1 in 200,000—about 1/20th what we’d need for this to be a fair game.

There is one more thing, though: the data are highly autocorrelated, so least-squares regression may not be the most efficient way to estimate these slopes. If we can estimate the slopes more precisely, we can get more discrimination in our predictions. Maybe there is a way to win the game by extracting more information from each series, but it won’t be easy.

You could say that the above all demonstrates the designer’s point, that you can’t identify a trend in a time series of this length. But we don’t think it would make sense to draw that conclusion from this exercise. After all, you can just tweak the parameters in the problem a bit—or simply set the goal to 800 correct instead of 900—and the game becomes easy to win. Or, had the game been winnable as initially set up, you could just up the threshold to 950, and again it would become essentially impossible to win. Conversely, if the designer of the challenge had messed up his calculations and set the threshold to 800, and someone had sent in a winning entry, it wouldn’t disprove his claims about climate science, it would just mean he hadn’t been careful enough in setting up his bet.

Licenses

#' --- #' image: ../social-cards/timeseries.png #' title: "Using a fitted model for decision analysis: Mixture model for time series competition" #' author: "Andrew Gelman" #' date: 2022-08-18 #' date-modified: today #' date-format: iso #' format: #' html: #' number-sections: true #' code-copy: true #' code-download: true #' code-tools: true #' bibliography: ../casestudies.bib #' --- #' #' This notebook includes the code for the Bayesian Workflow book #' Chapter 20 *Using a fitted model for decision analysis: Mixture #' model for time series competition*. #' #' # Introduction #' #' A few years ago someone sent us an email about a "Global Climate #' Challenge" that he had seen online, and which was introduced as #' follows: #' #' > It has often been claimed that alarm about global warming is #' supported by observational evidence. I have argued that there is no #' observational evidence for global-warming alarm: rather, all claims #' of such evidence rely on invalid statistical analyses. #' #' > Some people, though, have asserted that the statistical analyses are #' valid. Those people assert, in particular, that they can determine, #' via statistical analysis, whether global temperatures have been #' increasing more than would be reasonably expected by random natural #' variation. Those people do not present any counter to my argument, #' but they make their assertions anyway. #' #' > In response to that, I am sponsoring a contest: the prize is #' $100,000. In essence, the prize will be awarded to anyone who can #' demonstrate, via statistical analysis, that the increase in global #' temperatures is probably not due to random natural variation. #' #' How to win the money? #' #' > The file `Series1000.txt` contains 1000 time series. Each #' series has length 135: the same as that of the most commonly studied #' series of global temperatures (which span 1880–2014). The 1000 #' series were generated as follows. First, 1000 random series were #' obtained (via a trendless statistical model fit for global #' temperatures). Then, some randomly-selected series had a trend added #' to them. Some trends were positive; the others were negative. Each #' individual trend was 1°C/century (in magnitude)—which is greater #' than the trend claimed for global temperatures. #' #' > A prize of $100,000 (one hundred thousand U.S. dollars) will be #' awarded to the first person who submits an entry that correctly #' identifies at least 900 series: which series were generated by a #' trendless process and which were generated by a trending process. #' #' But also this: #' #' > Each entry must be accompanied by a payment of $10. #' #+ setup, include=FALSE knitr::opts_chunk$set( cache = FALSE, message = FALSE, error = FALSE, warning = FALSE, comment = NA, out.width = '95%' ) #' #' **Load packages** #| cache: FALSE library(rprojroot) root <- has_file(".Bayesian-Workflow-root")$make_fix_file() library(scales) options(digits = 2) library(cmdstanr) library(posterior) options(mc.cores = 4) set.seed(1123) print_stan_file <- function(file) { code <- readLines(file) if (isTRUE(getOption("knitr.in.progress")) & identical(knitr::opts_current$get("results"), "asis")) { # In render: emit as-is so Pandoc/Quarto does syntax highlighting block <- paste0("```stan", "\n", paste(code, collapse = "\n"), "\n", "```") knitr::asis_output(block) } else { writeLines(code) } } #' # Data #' #' OK, now it's time to get to work. We start by downloading and #' graphing the data. series <- matrix( scan(root("timeseries", "data", "Series1000.txt")), nrow = 1000, ncol = 135, byrow = TRUE ) T <- 135 N <- 1000 #| label: fig-series_1 #| fig-height: 3.4 #| fig-width: 5.5 par(mar = c(3, 3, 2, 0), tck = -0.01, mgp = c(1.5, 0.5, 0)) plot(c(1, T), range(series), xaxs = "i", bty = "l", type = "n", xlab = "Time", ylab = "y") for (n in 1:N) { lines(1:T, series[n, ], lwd = .5, col = alpha("black", 0.2)) } #' # Separate linear regression #' #' Aha! The lines are fanning out from a common starting point. We'll fit #' a regression to each line and then summarize each line by its average #' slope. This is not necessarily the most efficient way to estimate #' a slope from correlated data, but we will not be concerned about #' that, as our purpose here is to demonstrate decision analysis in #' the context of a fitted model, rather than to fit the model in an #' optimal way. slope <- rep(NA, N) se <- rep(NA, N) for (n in 1:N) { data <- series[n, ] time <- 1:T fit <- lm(data ~ time) coefs <- summary(fit)$coefficients slope[n] <- 100 * coefs[2, "Estimate"] se[n] <- 100 * coefs[2, "Std. Error"] } #' We multiplied the slopes (and standard errors) by 100 to put them on a #' per-century scale to match the above instructions. #' #' Next we plot the estimated slopes and their standard errors: #| label: fig-series_2 #| fig-height: 3.6 #| fig-width: 5.5 par(mar = c(3, 3, 2, 0), tck = -.01, mgp = c(1.5, .5, 0)) plot(slope, se, ylim = c(0, 1.05 * max(se)), yaxs = "i", bty = "l", xlab = "Estimated slope", ylab = "SE", pch = 20, cex = .5) #' OK, not much information in the se's. How about a histogram of the #' estimated slopes? #| label: fig-series_3 #| fig-height: 3.6 #| fig-width: 5.5 par(mar = c(3, 3, 2, 0), tck = -.01, mgp = c(1.5, .5, 0)) hist( slope, xlab = "Estimated slope", breaks = seq(floor(10 * min(slope)), ceiling(10 * max(slope))) / 10, main = "" ) #' Based on the problem description, I'd expect to see distributions #' centered at 0, -1, and 1. It looks like this might be the case. #' #' # Stan mixture model #' #' So let's fit a mixture model. That's easy. Here's the Stan model: #| results: asis print_stan_file(root("timeseries", "mixture.stan")) #' We sample from the posterior: y <- slope K <- 3 mu <- c(0, -1, 1) data <- list(y = y, K = K, N = N, mu = mu) mod <- cmdstan_model(root("timeseries", "mixture.stan")) #| label: fit_mix #| results: hide fit_mix <- mod$sample(data = data, refresh = 0) #' Posterior summary: print(fit_mix) #' Convergence is fine: $\widehat{R}$ is close to 1 for everything. The #' estimated weights of the three mixture components are approximately #' 0.5, 0.25, 0.25. Given that the problem was made up, I'm guessing the #' weights of the underlying data-generation process are exactly 1/2, #' 1/4, and 1/4. The standard deviation of the slopes within each #' component is 0.4, or close to it. We could also try fitting a model #' where the standard deviations of the three components differ, but we #' won't, partly because the description given with the simulated data #' described the change as adding a trend, and partly because the above #' histogram doesn't seem to show any varying of the widths of the #' mixture components. #' #' OK, now we're getting somewhere. To make predictions, we need to know, #' for each series, the probability of it being in each of the three #' components. We'll compute these probabilities by adding a generated #' quantities block to the Stan program: #' #' ```stan #' generated quantities { #' matrix[N, K] p; #' for (n in 1:N) { #' vector[K] p_raw; #' for (k in 1:K) { #' p_raw[k] = theta[k] * exp(normal_lpdf(y[n] | mu[k], sigma)); #' } #' for (k in 1:K) { #' p[n,k] = p_raw[k] / sum(p_raw); #' } #' } #' } #' ``` #' #' We then re-fit the model, extract the $p$'s, and average them over the #' posterior simulations. This is made simpler using the posterior R package's #' `draws_rvars` format. #' #| label: fit_mix_2 #| results: hide mod <- cmdstan_model(root("timeseries", "mixture_2.stan")) fit_mix <- mod$sample(data = data, refresh = 0) #' prob_sims <- as_draws_rvars(fit_mix$draws()) prob <- mean(prob_sims$p) #' We now have a $1000\times 3$ matrix of probabilities. Let's take a #' look at the first ten rows round(prob[1:10,], 2) #' So, the first series is probably drawn from the sloping-upward model; #' the second might be from the null model or it might be from the #' sloping-downward model; the third, fourth, fifth, sixth, seventh, and #' eighth are probably from the null model; the ninth is probably from #' the sloping-upward model; and so forth. #' #' We'll now program this: for each of the 1000 series in the dataset, #' we'll pick which of the three mixture components has the highest #' probability. We'll save the probability and also which component is #' being picked. max_prob <- apply(prob, 1, max) choice <- apply(prob, 1, which.max) #' And now we can sum this over the 1000 series. We'll compute the number #' of series assigned to each of the three choices: print(table(choice)) #' The guesses are not quite in proportion 0.5, 0.25, 0.25. There seem to #' be too many guesses of zero slope and not enough of positive and #' negative slopes. But that makes sense given the decision problem: we #' want to maximize the number of correct guesses so we end up #' disproportionately guessing the most common category. That's fine; #' it's how it will be. #' #' And we can compute the expected number of correct guesses (based on #' the posterior distribution we have here), and the standard deviation #' of the number of correct guesses (based on the reasonable #' approximation of independence of the 1000 series conditional on the #' model). And then we'll print all these numbers: expected_correct <- sum(max_prob) sd_correct <- sqrt(sum(max_prob * (1 - max_prob))) round(c(expected_correct, sd_correct), 1) #' Interesting. The expected number of correct guesses is #' `r round(expected_correct)`. Not quite the 900 that's needed to win! The standard #' error of the number of correct guesses is `r round(sd_correct, 1)`, so 900 is #' over 5 standard errors away from our expected number correct. That's bad news! #' #' How bad is it? We can compute the normal cumulative density function #' to get the probability of at least 900 successes; that's pnorm(expected_correct, 900, sd_correct) #' That's a small number; here's its reciprocal: recip <- 1 / pnorm(expected_correct, 900, sd_correct) recip #' That's almost a 1-in-200,000 chance of winning the big prize! #' #' But we only have to get _at least_ 900. So we can do the continuity #' correction and evaluate the probability of at least 899.5 successes, #' which, when inverted, yields: 1 / pnorm(expected_correct, 899.5, sd_correct) #' Nope, still no good. For the bet to be worth it, even in the crudest #' sense of expected monetary value, the probability of winning would #' have to be at least 1 in 10,000. (Recall that the prize is \$100,000 #' but the cost of entry is \$10.) And that's all conditional on the #' designer of the study doing everything exactly as he said, and not #' playing with multiple seeds for the random number generator, #' etc. After all, he could well have first chosen a seed and generated #' the series, then performed something like the above analysis and #' checked that the most natural estimate gave only 850 correct or so, #' and in the very unlikely event that the natural estimate gave 900 or #' close to it, just re-running with a new seed. We have no reason to #' think that the creator of this challenge did anything like that; our #' point here is only that, even if he did his simulation in a completely #' clean way, our odds of winning are about 1 in 200,000---about 1/20th #' what we'd need for this to be a fair game. #' #' There is one more thing, though: the data are highly autocorrelated, #' so least-squares regression may not be the most efficient way to #' estimate these slopes. If we can estimate the slopes more precisely, #' we can get more discrimination in our predictions. Maybe there is a #' way to win the game by extracting more information from each series, #' but it won't be easy. #' #' You could say that the above all demonstrates the designer's point, #' that you can't identify a trend in a time series of this length. But we #' don't think it would make sense to draw that conclusion from this #' exercise. After all, you can just tweak the parameters in the problem #' a bit---or simply set the goal to 800 correct instead of 900---and the #' game becomes easy to win. Or, had the game been winnable as initially #' set up, you could just up the threshold to 950, and again it would #' become essentially impossible to win. Conversely, if the designer of #' the challenge had messed up his calculations and set the threshold to #' 800, and someone had sent in a winning entry, it wouldn't disprove his #' claims about climate science, it would just mean he hadn't been #' careful enough in setting up his bet. #' #' # Licenses {.unnumbered} #' #' * Code © 2022--2025, Andrew Gelman, licensed under BSD-3. #' * Text © 2022--2025, Andrew Gelman, licensed under CC-BY-NC 4.0.