library(rprojroot)
root <- has_file(".Bayesian-Workflow-root")$make_fix_file()
library(cmdstanr)
options(mc.cores = 4)
options(digits = 2)
## options(htmltools.dir.version = FALSE)
set.seed(1123)
# utility functions
fround <- function(x, digits) {
format(round(x, digits), nsmall = digits)
}
print_stan_file <- function(file) {
code <- readLines(file)
if (isTRUE(getOption("knitr.in.progress")) &
identical(knitr::opts_current$get("results"), "asis")) {
# In render: emit as-is so Pandoc/Quarto does syntax highlighting
block <- paste0("```stan", "\n", paste(code, collapse = "\n"), "\n", "```")
knitr::asis_output(block)
} else {
writeLines(code)
}
}Declining exponentials
This notebook includes the code for the Bayesian Workflow book Section 12.4 Using modeling ideas to address computing problems.
1 Introduction
Declining exponential: A nonlinear model that comes up in many science and engineering applications, also illustrating the use of bounds in parameter estimation.
Sum of declining exponentials: An important class of nonlinear models that can be surprisingly difficult to estimate from data. This example illustrates how Stan can fail, and also the way in which default prior information can add stability to an otherwise intractable problem.
Load packages and set options
2 Declining exponential
Let’s fit the following simple model: y = ae^{-bx} + \mathrm{error}, given data (x,y)_i: y_i = ae^{-bx_i} + \epsilon_i, \text{ for } i=1,\dots,N, We shall assume the errors are independent and normally distributed: \epsilon_i \sim \operatorname{normal}(0,\sigma).
Here is the model in Stan:
print_stan_file(root("declining_exponentials", "exponential.stan"))data {
int N;
vector[N] x;
vector[N] y;
}
parameters {
real a;
real b;
real<lower=0> sigma;
}
model {
y ~ normal(a * exp(-b * x), sigma);
a ~ normal(0, 10);
b ~ normal(0, 10);
sigma ~ normal(0, 10);
}We have given the parameters a, and b, and \sigma normal prior distributions centered at 0 with standard deviation 10. In addition, the parameter \sigma is constrained to be positive. The purpose of the prior distributions is to keep the computations at a reasonable value. If we were working on a problem in which we thought that a, b, or \sigma could be much greater than 10, we would want to use a weaker prior distribution.
Another point about the above Stan program: the model for y is vectorized and could instead have been written more explicitly as a loop:
for (i in 1:N) {
y[i] ~ normal(a*exp(-b*x[i]), sigma);
}
We prefer the vectorized version as it is more compact and it also runs faster in Stan, for reasons discussed elsewhere in this book.
2.1 Simulated data
To demonstrate our exponential model, we fit it to fake data. We’ll simulate N=100 data points with predictors x uniformly distributed between 0 and 10, from the above model with a=0.2, b=0.3, \sigma=0.5.
a <- 5
b <- 0.3
sigma <- 0.5
N <- 100
x <- runif(N, 0, 10)
y <- rnorm(N, a * exp(-b * x), sigma)Here is a graph of the true curve and the simulated data:
curve(a * exp(-b * x), from = 0, to = 10, ylim = range(x, y),
xlab = "x", ylab = "y", bty = "l",
main = "Data and true model", cex.main = 1)
points(x, y, pch = 20, cex = 0.2)And then we fit the model:
data_1 <- list(N = N, x = x, y = y)
mod_1 <- cmdstan_model(root("declining_exponentials", "exponential.stan"))
fit_1 <- mod_1$sample(data = data_1)Posterior summary:
print(fit_1) variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ 20.60 20.91 1.25 0.99 18.25 21.92 1.00 1690 2189
a 5.15 5.15 0.19 0.19 4.85 5.48 1.00 1487 1445
b 0.30 0.30 0.02 0.02 0.28 0.33 1.00 1670 1575
sigma 0.49 0.49 0.04 0.04 0.44 0.55 1.00 2407 2550
Recall that the true parameter values were a=5.0, b=0.3, \sigma=0.5. Here the model is simple enough and the data are clean enough that we can estimate all three of these parameters with reasonable precision from the data, as can be seen from the 95% intervals above.
Alternatively, we might want to say ahead of time that we are fitting a declining exponential curve that starts positive and descends to zero. We would thus want to constrain the parameters a and b to be positive, which we can do in the parameters block:
real<lower=0> a;
real<lower=0> b;
Otherwise we leave the model unchanged. In this case the results turn out to be very similar:
mod_1b <- cmdstan_model(root("declining_exponentials","exponential_positive.stan"))
fit_1b <- mod_1b$sample(data = data_1)Posterior summary:
print(fit_1b) variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ 21.08 21.41 1.24 0.99 18.75 22.39 1.00 1794 2181
a 5.15 5.15 0.18 0.18 4.85 5.46 1.00 1849 1911
b 0.30 0.30 0.02 0.01 0.28 0.33 1.00 1786 2002
sigma 0.49 0.49 0.04 0.03 0.44 0.56 1.00 2628 2442
2.2 Small simulated data
With weaker data, though, the constraints could make a difference. We could experiment on this by doing the same simulation but with just N=10 data points:
N <- 10
x <- runif(N, 0, 10)
y <- rnorm(N, a * exp(-b * x), sigma)curve(a * exp(-b * x), from = 0, to = 10, ylim = range(x, y),
xlab = "x", ylab = "y", bty = "l",
main = "Just 10 data points", cex.main = 1)
points(x, y, pch = 20, cex = 0.2)First we fit the unconstrained model:
data_2 <- list(N = N, x = x, y = y)
fit_2 <- mod_1$sample(data = data_2)Posterior summary:
print(fit_2) variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ -0.80 -0.04 2.41 1.32 -5.73 1.24 1.02 241 92
a 6.00 5.94 2.20 1.06 4.21 8.49 1.02 1122 426
b 0.53 0.39 0.94 0.08 0.27 0.79 1.02 311 90
sigma 0.70 0.62 0.32 0.17 0.41 1.31 1.01 213 106
Then we fit the constrained model:
fit_2b <- mod_1b$sample(data = data_2)Posterior summary:
print(fit_2b) variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ 0.16 0.71 1.93 1.33 -3.88 2.06 1.00 661 326
a 6.13 5.95 1.85 1.01 4.20 8.21 1.00 1075 862
b 0.48 0.39 0.78 0.08 0.27 0.67 1.00 890 408
sigma 0.68 0.61 0.27 0.18 0.40 1.16 1.00 730 518
Different things can happen with different sets of simulated data, but the inference with the positivity constraints will typically be much more stable. Of course, we would only want to constrain the model in this way if we knew that the positivity restriction is appropriate.
2.3 Positive simulated data
Now suppose that the data are also restricted to be positive. Then we need a different error distribution, as the above model with additive normal errors can yield negative data.
Let’s try a multiplicative error, with a lognormal distribution: y_i = ae^{-bx_i} * \epsilon_i, \text{ for } i=1,\dots,N\\ \log\epsilon_i \sim \operatorname{normal}(0,\log\sigma), \text{ for } i=1,\dots,N
Here is the model in Stan:
print_stan_file(root("declining_exponentials", "exponential_positive_lognormal.stan"))data {
int N;
vector[N] x;
vector[N] y;
}
parameters {
real<lower=0> a;
real<lower=0> b;
real<lower=0> sigma;
}
model {
vector[N] y_pred = a * exp(-b * x);
y ~ lognormal(log(y_pred), sigma);
a ~ normal(0, 10);
b ~ normal(0, 10);
sigma ~ normal(0, 10);
}As before, we can simulate fake data from this model:
a <- 5
b <- 0.3
sigma <- 0.5
N <- 100
x <- runif(N, 0, 10)
epsilon <- exp(rnorm(N, 0, sigma))
y <- a * exp(-b * x) * epsiloncurve(a * exp(-b * x), from = 0, to = 10, ylim = range(x, y),
xlab = "x", ylab = "y", bty = "l",
main = "Data and true model", cex.main = 1)
points(x, y, pch = 20, cex = 0.2)We can then fit the model to the simulated data and check that the parameters are approximately recovered:
data_3 <- list(N = N, x = x, y = y)
mod_3 <- cmdstan_model(root("declining_exponentials", "exponential_positive_lognormal.stan"))
fit_3 <- mod_3$sample(data = data_3)Posterior summary:
print(fit_3) variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ -71.40 -71.08 1.22 1.01 -73.83 -70.07 1.00 1504 2068
a 4.95 4.93 0.49 0.47 4.20 5.80 1.00 1393 1680
b 0.30 0.30 0.02 0.02 0.27 0.33 1.00 1341 1654
sigma 0.50 0.49 0.04 0.03 0.44 0.56 1.00 2112 2232
3 Sum of declining exponentials
From the numerical analysis literature, here is an example of an inference problem that appears simple but can be surprisingly difficult. The challenge is to estimate the parameters of a sum of declining exponentials: y = a_1e^{-b_1x} + a_2e^{-b_2x}. This is also called an inverse problem, and it can be challenging to decompose these two declining functions.
This expression, and others like it, arise in many examples, including in pharmacology, where x represents time and y could be the concentration of a drug in the blood of someone who was given a specified dose at time 0. In a simple two-compartment model, the total concentration will look like a sum of declining exponentials.
To set this up as a statistics problem, we add some noise to the system. We want the data to always be positive so our noise will be multiplicative: y_i = (a_1e^{-b_1x_i} + a_2e^{-b_2x_i}) * \epsilon_i, \text{ for } i=1,\dots,N, with lognormally-distributed errors \epsilon.
Here is the model in Stan:
print_stan_file(root("declining_exponentials", "sum_of_exponentials.stan"))data {
int N;
vector[N] x;
vector[N] y;
}
parameters {
vector<lower=0>[2] a;
positive_ordered[2] b;
real<lower=0> sigma;
}
model {
vector[N] y_pred = a[1] * exp(-b[1] * x) + a[2] * exp(-b[2] * x);
y ~ lognormal(log(y_pred), sigma);
}The coefficients a and the residual standard deviation \sigma are constrained to be positive. The parameters b are also positive—these are supposed to be declining, not increasing, exponentials—and are also constrained to be ordered, so that b_1<b_2. We need this to keep the model identified: Without some sort of restriction, there would be no way from the data to tell which component is labeled 1 and which is 2. So we arbitrarily label the component with lower value of b—that is, the one that declines more slowly—as the first one, and the component with higher value of b to be the second. We added the positive_ordered type to Stan because this sort of identification problem comes up fairly often in applications.
We’ll try out our Stan model by simulating fake data from a model where the two curves should be cleanly distinguished, setting b_1=0.1 and b_2=2.0, a factor of 20 apart in scale. We’ll simulate 1000 data points where the predictors x are uniformly spaced from 0 to 10, and, somewhat arbitrarily, set a_1=1.0, a_2=0.8, and \sigma=0.2. We can then simulate from the lognormal distribution to generate the data y.
a <- c(1, 0.8)
b <- c(0.1, 2)
N <- 1000
x <- seq(0, 10, length = N)
sigma <- 0.2
epsilon <- exp(rnorm(N, 0, sigma))
y <- (a[1] * exp(-b[1] * x) + a[2] * exp(-b[2] * x)) * epsilonHere is a graph of the true curve and the simulated data:
data_graph <- function(a, b, sigma, x, y) {
curve(a[1] * exp(-b[1] * x) + a[2] * exp(-b[2] * x),
from = min(x), to = max(x),
ylim = c(0, 1.05 * max(y)), xlim = c(0, max(x)),
xlab = "x", ylab = "y", xaxs = "i", yaxs = "i", bty = "l",
main = "Data and true model", cex.main = 1)
points(x, y, pch = 20, cex = 0.2)
text(max(x), 0.5 * max(y),
paste("y = ", fround(a[1], 1), "*exp(", fround(-b[1], 1), "*x) + ",
fround(a[2], 1), "*exp(", fround(-b[2], 1), "*x)", sep = ""), adj = 1)
}
data_graph(a, b, sigma, x, y)And then we fit the model:
data_4 <- list(N = N, x = x, y = y)
mod_4 <- cmdstan_model(root("declining_exponentials", "sum_of_exponentials.stan"))
fit_4 <- mod_4$sample(data = data_4)Posterior summary:
print(fit_4) variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ 203.09 203.38 1.55 1.40 200.18 205.02 1.00 1624 2626
a[1] 0.99 0.99 0.02 0.02 0.96 1.02 1.00 1882 1911
a[2] 0.86 0.86 0.09 0.09 0.72 1.01 1.00 2657 2382
b[1] 0.10 0.10 0.00 0.00 0.09 0.10 1.00 1915 2000
b[2] 2.25 2.23 0.32 0.31 1.76 2.82 1.00 1951 2151
sigma 0.20 0.20 0.00 0.00 0.19 0.20 1.00 2858 2492
The parameters are recovered well, with the only difficulty being b_2, where the estimate is 1.89 but the true value is 2.0—but that is well within the posterior uncertainty. Stan worked just fine on this nonlinear model.
But now let’s make the problem just slightly more difficult. Instead of setting the two parameters b to 0.1 and 2.0, we’ll make them 0.1 and 0.2, so now only a factor of 2 separates the scales of the two declining exponentials.
b <- c(0.1, 0.2)
y_2 <- (a[1] * exp(-b[1] * x) + a[2] * exp(-b[2] * x)) * epsilon
data_5 <- list(N = N, x = x, y = y_2)This should still be easy to fit in Stan, right? Wrong:
fit_5 <- mod_4$sample(data = data_5)Posterior summary:
print(fit_5) variable mean median sd mad q5 q95 rhat ess_bulk
lp__ 9.1e+02 9.1e+02 1.78 1.6e+00 9.1e+02 9.1e+02 1.05 88
a[1] 1.7e+00 1.7e+00 0.02 2.0e-02 1.7e+00 1.8e+00 1.01 284
a[2] 5.7e-01 4.7e-01 0.41 4.0e-01 1.0e-01 1.4e+00 1.27 11
b[1] 1.3e-01 1.3e-01 0.00 0.0e+00 1.3e-01 1.3e-01 1.01 314
b[2] 8.9e+307 8.7e+307 Inf 6.5e+307 8.1e+306 1.7e+308 1.01 348
sigma 2.0e-01 2.0e-01 0.00 0.0e+00 1.9e-01 2.0e-01 1.01 310
ess_tail
627
598
72
706
585
449
What happened?? It turns out that these two declining exponentials are very hard to detect. Look: here’s a graph of the two-component model for the expected data, y=1.0e^{-0.1x}+0.8e^{-0.2x}:
curve(a[1] * exp(-b[1] * x) + a[2] * exp(-b[2] * x),
from = 0, to = 10, ylim = c(0, 1.9), xlim = c(0, 10),
xlab = "x", ylab = "y", xaxs = "i", yaxs = "i", bty = "l",
main = "True model", cex.main = 1)
text(5.6, 0.7, "y = 1.0*exp(-0.1x) + 0.8*exp(-0.2x)", adj = 1)And now we’ll overlay a graph of a particular one-component model, y=1.8e^{-0.135x}:
curve(a[1] * exp(-b[1] * x) + a[2] * exp(-b[2] * x), from = 0, to = 10,
ylim = c(0, 1.9), xlim = c(0, 10),
xlab = "x", ylab = "y", xaxs = "i", yaxs = "i", bty = "l",
main = "True model and scarily-close one-component approximation", cex.main = 1)
curve(1.8 * exp(-0.135 * x), from = 0, to = 10, add = TRUE)
text(5.6, 0.7, "y = 1.0*exp(-0.1x) + 0.8*exp(-0.2x)", adj = 1)
text(6.1, 1.3, "y = 1.8*exp(-0.135x)", adj = 1)The two lines are strikingly close, and it would be essentially impossible to tell them apart based on noisy data, even 1000 measurements. So Stan had trouble recovering the true parameters from the data.
Still, if the parameters are difficult to fit, this should just result in a high posterior uncertainty. Why did the Stan fit explode? The problem in this case is that, since only one term in the model was required to fit these data, the second term was completely free—and the parameter \beta_2 was unbounded: there was nothing stopping it from being estimated as arbitrarily large. This sort of unbounded posterior distribution is called improper (see Bayesian Data Analysis for a more formal definition), and there is no way of drawing simulations from such a distribution, hence Stan does not converge. The simulations drift off to infinity, as there is nothing in the prior or likelihood that is keeping them from doing so.
To fix the problem, we can add some prior information. Here we shall use our default, which is independent \operatorname{normal}(0,1) prior densities on all the parameters; thus, we add these lines to the model block in the Stan program:
a ~ normal(0, 1);
b ~ normal(0, 1);
sigma ~ normal(0, 1);
For this particular example, all we really need is a prior on b (really, just b_2 because of the ordering), but to demonstrate the point we shall assign default priors to everything. The priors are in addition to the rest of the model; that is, they go on top of the positivity and ordering constraints. So, for example, the prior for \sigma is the positive half of a normal, which is sometimes written as \operatorname{normal}^+(0,1).
We now can fit this new model to our data, and the results are much more stable:
mod_5_with_priors <- cmdstan_model(root("declining_exponentials", "sum_of_exponentials_with_priors.stan"))
fit_5_with_priors <- mod_5_with_priors$sample(data = data_5)Posterior summary:
print(fit_5_with_priors) variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ 198.96 199.31 1.80 1.58 195.62 201.14 1.01 766 765
a[1] 1.35 1.56 0.44 0.23 0.40 1.73 1.05 113 452
a[2] 0.43 0.24 0.42 0.24 0.03 1.36 1.05 114 445
b[1] 0.11 0.12 0.02 0.01 0.07 0.13 1.03 336 588
b[2] 0.58 0.31 0.56 0.24 0.14 1.80 1.03 273 547
sigma 0.20 0.20 0.00 0.00 0.19 0.20 1.00 1340 1381
The fit is far from perfect—compare to the true parameter values, a_1=1.0, a_2=0.8, b_1=0.1, b_2=0.2—but we have to expect that. As explained above, the data at hand do not identify the parameters, so all we can hope for in a posterior distribution is some summary of uncertainty.
The question then arises, what about those prior distributions? We can think about them in a couple different ways.
From one direction, we can think of scaling. We are using priors centered at 0 with a scale of 1; this can be reasonable if the parameters are on “unit scale,” meaning that we expect them to be of order of magnitude around 1. Not all statistical models are on unit scale. For example, in the above model, if the data y_i are on unit scale, but the data values x_i take on values in the millions, then we’d probably expect the parameters b_1 and b_2 to be roughly on the scale of 10^{-6}. In such a case, we’d want to rescale x so that the coefficients b are more interpretable. Similarly, if the values of y ranged in the millions, then the coefficients a would have to be of order 10^6, and, again, we would want to rescale the data or the model so that a would be on unit scale. By using unit scale priors, we are implicitly assuming the model has been scaled.
From the other direction, instead of adapting the model to the prior distribution, we could adapt the prior to the model. That would imply an understanding of a reasonable range of values for the parameters, based on the context of the problem. In any particular example this could be done by simulating parameter vectors from the prior distribution and graphing the corresponding curves of expected data, and seeing if these could plausibly cover the possible cases that might arise in the particular problem being studied.
No matter how it’s done, inference has to come from somewhere, and if the data are weak, you need to put in prior information if your goal is to make some statement about possible parameter values, and from there to make probabilistic predictions and decisions.
Licenses
- Code © 2022–2025, Andrew Gelman, licensed under BSD-3.
- Text © 2022–2025, Andrew Gelman, licensed under CC-BY-NC 4.0.