Challenge of multimodality: Differential equation for planetary motion

Author

Charles C. Margossian, Andrew Gelman and Aki Vehtari

Published

2020-11-09

Modified

2026-04-07

This notebook includes the code for the Bayesian Workflow book Chapter 30 Challenge of multimodality: Differential equation for planetary motion.

1 Introduction

As developers of statistical software1, we realize that we cannot fully automate modeling. Practitioners need to take bespoke steps to fit, evaluate, and improve their models. At the same time, the more modeling we do, the better prepared we usually are for the next project we undertake. It’s not uncommon to apply hard-learned lessons from a past project to a new problem. Articles on modeling workflows aim to formalize this experience, building and deconstructing heuristics, developing methods, and motivating new theory (e.g. Blei 2014; Gabry et al. 2019; Betancourt 2020; Gelman et al. 2020). We believe this effort must be complemented by case studies, such as this one, which illustrate in great details the concepts that arise in workflows, provide usable code, and also raise new questions. The specific concepts we encounter in this article include why our inference fails way more slowly than it succeeds, why the behavior of an ODE changes wildly across the parameter space, and what to do when chains drift into “unreasonable” modes of the posterior distribution.

Given the recorded position of a planet over time, we want to estimate the physical properties of a star-planet system. This includes position, momentum, and gravitational interaction. We fit the model with Stan (Carpenter et al. 2017), using a Hamiltonian Monte Carlo (HMC) sampler, a gradient-based Markov chain Monte Carlo (MCMC) algorithm; for a thorough introduction to HMC, we recommend the article by Betancourt (2018). Initially, we meant the planetary motion problem to be a simple textbook example for ODE-based models; but it turns out many interesting challenges arise when we do a Bayesian analysis on this model. We discuss how to diagnose and fix these issues. The techniques we deploy involve a combination of mathematical considerations that are specific to the system at hand and statistical principles we can generalize. This case study examines the model in the controlled setting of simulated data.

We begin by constructing our model, using elementary notions of classical mechanics. After failing to fit this complete model, we develop a simpler model and build our way back to the original model. This step-by-step expansion allows us to isolate issues that make Bayesian inference challenging. Classic tools we use involve: running multiple chains, checking convergence diagnostics, and simulating data from the fitted model. There are several ways to visualize simulated data and an agile use of plots proves extremely helpful. Running simulations for multiple parameter values allows us to understand how the parameters interact with the likelihood, and identify local modes in the posterior. In our presentation, we try to distinguish generalizable methods, problem-specific steps, and shortcoming in existing defaults.

Load packages

library(rprojroot)
root <- has_file(".Bayesian-Workflow-root")$make_fix_file()
library(cmdstanr)
library(posterior)
library(ggplot2)
library(dplyr)
library(plyr)
library(tidyr)
library(boot)
library(latex2exp)
dir.create(root("planetary_motion","saved_fit"))
source(root("planetary_motion", "tools.R"))
library(bayesplot)
bayesplot::color_scheme_set("viridisC")
theme_set(bayesplot::theme_default(base_family = "sans"))
set.seed(1954)

print_stan_code <- function(code) {
  if (isTRUE(getOption("knitr.in.progress")) &
        identical(knitr::opts_current$get("results"), "asis")) {
    # In render: emit as-is so Pandoc/Quarto does syntax highlighting
    block <- paste0("```stan", "\n", paste(code, collapse = "\n"), "\n", "```")
    knitr::asis_output(block)
  } else {
    writeLines(code)
  }
}

2 Building the model

Consider a simple star-planet system. We assume the star is much more massive than the planet and approximate the position of the star as fixed. We would like to estimate the following quantities:

  • The gravitational force between the two objects. We assume the gravitational constant is known, G = 1.0 \times 10^{-3} in some unit, and aim to evaluate the star-planet mass ratio. To do this, we set the planetary mass to m = 1. It remains to evaluate the solar mass, M.
  • The initial position vector, q_0, and the initial momentum vector, p_0, of the planet.
  • The subsequent position vector, q(t), of the planet over time.
  • The position vector of the star, q_*.

2.1 Equations of motion

A more natural treatment of the problem would be using polar coordinates, but for simplicity, let’s use Cartesian coordinates. The motion of the planet is described by Newton’s law of motion, which is a second-order differential equation. For convenience, we use Hamilton’s formulation of the law, which is based on a system of two first-order differential equations, \begin{align*} \frac{\mathrm d q}{\mathrm d t} & = & \frac{p}{m}, \\ \frac{\mathrm d q}{\mathrm d t} & = & - \frac{k}{r^3} (q - q_*), \end{align*} where k = GmM = 10^{-3}M, and r = \sqrt{(q - q_*)^T(q - q_*)} is the distance between the planet and the star.

Our measurements are made with a normally distributed error: q_\mathrm{obs} \sim \mathrm{Normal}(q, \sigma I), with \sigma the standard deviation and I the 2 \times 2 identity matrix.

2.2 Simulation

The simulation can be done in Stan, using one of its numerical integrators. At t = 0, q_0 = (1, 0) and p_0 = (0, 1), and we set k = 1.

mod_sim <- cmdstan_model(root("planetary_motion", "planetary_motion_sim.stan"))
print_stan_code(mod_sim$code())
// It's assumed that the planet orbits a star at position (0, 0),
// and that the mass of the planet is negligable next to that of
// the star (meaning the star doesn't move).

functions {
  vector ode(real t, vector y, real m, real k) {
    vector[2] q = y[1:2];
    vector[2] p = y[3:4];
    real r_cube = pow(dot_self(q), 1.5);
    
    vector[4] dydt;
    dydt[1:2] = p / m;
    dydt[3:4] = -k * q / r_cube;
    
    return dydt;
  }
}
data {
  int N;
  real<lower=0> sigma_x;
  real<lower=0> sigma_y;
}
transformed data {
  // intial state at t = 0
  real t0 = 0;
  int n_coord = 2;
  vector[n_coord] q0 = to_vector({1.0, 0.0});
  vector[n_coord] p0 = to_vector({0.0, 1.0});
  vector[n_coord * 2] y0 = append_row(q0, p0);
  
  // model parameters
  real m = 1.0;
  real k = 1.0;
  
  // exact motion
  array[N] real t;
  for (n in 1 : N) {
    t[n] = (n * 1.0) / 10;
  }
  array[2] real x_r = {m, k};
  array[N] vector[n_coord * 2] y = ode_rk45(ode, y0, t0, t, m, k);
}
parameters {
  real phi;
}
model {
  phi ~ normal(0, 1);
}
generated quantities {
  array[N, 2] real q_obs;
  
  q_obs[ : , 1] = normal_rng(y[ : , 1], sigma_x);
  q_obs[ : , 2] = normal_rng(y[ : , 2], sigma_y);
}
N <- 40
sigma <- 0.01
sim <- mod_sim$sample(
  data = list(N = N, sigma_x = sigma, sigma_y = sigma),
  chains = 1,
  iter_warmup = 1,
  iter_sampling = 2,
  seed = 123
)
simulation <- as.vector(sim$draws(variables = "q_obs")[1, , ])

q_obs <- array(NA, c(N, 2))
q_obs[, 1] <- simulation[1:40]
q_obs[, 2] <- simulation[41:80]
ggplot(data = data.frame(q_x = q_obs[, 1], 
                         q_y = q_obs[, 2], 
                         time = 1:40), 
       aes(x = q_x, y = q_y, label = time)) +
  geom_point() +
  annotate(geom = "text", x = q_obs[1, 1] - 0.05, y = q_obs[1, 2], 
           label = "t=1", hjust = 1) +
  annotate(geom = "text", x = q_obs[40, 1] + 0.05, y = q_obs[40, 2], 
           label = "t=40", hjust = 0)
Figure 1

3 Fitting a simple model and diagnosing inference

A first attempt at fitting the complete model (planetary_motion_star.stan) fails spectacularly: the chains do not converge and take a long time to run. This is an invitation to start with a simpler model. A useful simplification is more manageable but still exhibits the challenges we encounter with the complete model. The hope is that a fix in this simple context translates into a fix in a more sophisticated setting. This turns out to be true here where a simple one-parameter model allows us to understand the fundamentally multimodal nature of the model.

There are many ways to simplify a model. A general approach is to fix some of the parameters, which we can easily do when working with simulated data. We fix all the parameters, except for k, and writhe the file planetary_motion.stan. We now want to characterize the posterior distribution, p(k \mid q_\mathrm{obs}), by fitting the model with Stan. Since k is a scalar, we could use quadrature but our goal is to understand the pathologies that frustrate our inference algorithm, so we stick to MCMC. As a prior, we use k \sim \mathrm{Normal}^+(0, 1).

We fit 8 chains in parallel.

stan_data1 <- list(N = N, q_obs = q_obs)
chains <- 8
mod1 <- cmdstan_model(root("planetary_motion", "planetary_motion.stan"))
print_stan_code(mod1$code())
functions {
  vector ode(real t, vector y, real k, real m) {
    vector[2] q = y[1:2];
    vector[2] p = y[3:4];
    
    real r_cube = pow(dot_self(q), 1.5);
    
    vector[4] dydt;
    dydt[1:2] = p / m;
    dydt[3:4] = -k * q / r_cube;
    
    return dydt;
  }
}
data {
  int N;
  array[N, 2] real q_obs;
}
transformed data {
  real t0 = 0;
  int N_coord = 2;
  vector[N_coord] q0 = to_vector({1.0, 0.0});
  vector[N_coord] p0 = to_vector({0.0, 1.0});
  vector[N_coord * 2] y0 = append_row(q0, p0);
  
  real m = 1.0;
  
  array[N] real t;
  for (n in 1:N) {
    t[n] = n * 1.0 / 10;
  }
  
  real<lower=0> sigma_x = 0.01;
  real<lower=0> sigma_y = 0.01;
  
  // ODE solver parameters
  real rel_tol = 1e-6;
  real abs_tol = 1e-6;
  int max_steps = 1000;
}
parameters {
  real<lower=0> k;
}
transformed parameters {
  array[N] vector[N_coord * 2] y =
    ode_bdf_tol(ode, y0, t0, t, rel_tol, abs_tol, max_steps, k, m);
}
model {
  k ~ normal(0, 1);
  
  q_obs[ : , 1] ~ normal(y[ : , 1], sigma_x);
  q_obs[ : , 2] ~ normal(y[ : , 2], sigma_y);
}
generated quantities {
  array[N] real qx_pred;
  array[N] real qy_pred;
  
  qx_pred = normal_rng(y[ : , 1], sigma_x);
  qy_pred = normal_rng(y[ : , 2], sigma_y);
}
fit1 <- mod1$sample(
  data = stan_data1,
  chains = chains,
  parallel_chains = chains,
  iter_warmup = 500,
  iter_sampling = 500,
  seed = 123,
  save_warmup = TRUE
)
fit1$save_object(file = root("planetary_motion/saved_fit", "fit1.RDS"))

The inference takes a while to run, so we read in the saved output.

fit1 <- readRDS(root("planetary_motion/saved_fit", "fit1.RDS"))
print(fit1$time(), digits = 2)
$total
[1] 4498

$chains
  chain_id  warmup sampling   total
1        1  113.51   215.90  329.40
2        2    5.56     5.94   11.49
3        3    0.83     0.63    1.45
4        4    0.63     0.61    1.24
5        5 2241.97  2256.23 4498.20
6        6    5.98     4.23   10.21
7        7    0.49     0.39    0.88
8        8 1142.58   776.14 1918.72

The first notable pathology is that some of the chains take a much longer time to run. The difference is not subtle.

Let’s examine the summary.

fit1$summary(
  variables = c("lp__", "k"), 
  "mean", "sd", "rhat", "ess_bulk", "ess_tail"
)
# A tibble: 2 × 6
  variable       mean        sd  rhat ess_bulk ess_tail
  <chr>         <dbl>     <dbl> <dbl>    <dbl>    <dbl>
1 lp__     -230866.   179038.    2.71     9.46     24.5
2 k              4.02      2.98  3.15     9.31     NA  

We see that \hat R \gg 1. Wow, these numbers are dramatic! Clearly the chains are not mixing and we can visualize this using trace plots.

mcmc_trace(fit1$draws(), pars = c("lp__", "k"))
Figure 2

Notice that chains 3 and 4, which ran for longer than 1000 seconds, sample around the largest values of k and also produce the lowest log posterior distribution. It’s not uncommon for issues to come in bulks. By contrast, chains that sample around k = 1 run in ~2 seconds, and produce a high log posterior.

At this point, we may formulate several hypothesis as to what may be happening:

  • The model is degenerate and cannot resolve the correct value of k.
  • The chains get stuck at local modes and cannot escape, even if the log posterior is much higher near k = 1.
  • The chains get stuck on flat surfaces, and the gradient doesn’t guide the chain back to k \approx 1.
  • The numerical integrator struggles to accurately solve the ODEs for large values of k. This means the ODE integrator is slower or even inaccurate. In the latter case, this also means our gradient calculations, and computation of HMC could be wrong.

3.1 Is the model degenerate?

Bearing a slight abuse of language, we use “degenerate” to mean that wildly different values of k roughly produce the same data generating process. We can check for degeneracy by looking at the posterior predictive checks, split across chains. We plot q_x against q_y, and for each chain, compute the median estimate for q_\mathrm{pred}, obtained using the generated quantities block. Note that since we fixed \sigma = 0.01, we expect the confidence interval to be very narrow.

data_pred <- data.frame(q_obs, 1:N)
names(data_pred) <- c("qx", "qy", "t")
ppc_plot2D(fit1, data_pred = data_pred)
Figure 3

Clearly, only the chains that landed close to k = 1 (chains 2, 5, and 7) are able to fit the data. This is consistent with the much higher log posterior density these chains produce. Degeneracy therefore doesn’t drive the lack of convergence because if it did the different chains would produce roughly the same predictions.

At this point, we have taken “standard” steps to diagnose issues with our inference. To fully grasp what prevents the chains from mixing, we require a more bespoke analysis. We summarize our reasoning, noting it involves unmentioned trials and errors, and long moments of pondering.

3.2 The wanderers: how do the chains even find these presumed modes?

Given our prior, k \sim \mathrm{normal}^+(0, 1), and the very strong log posterior density around k = 1, we may wonder: how did the chains drift to these distant modes? Based on the trace plots, the chains appear to be relatively static during the sampling phase. We extend the trace plots to include the warmup phase.

mcmc_trace(fit1$draws(inc_warmup = TRUE), pars = c("lp__", "k"),
           n_warmup = 500)
Figure 4

It is now clear that the chain’s final position is mostly driven by its initial point. It then only takes a few iterations for the chain to be trapped inside a mode. We further note that the initial points, which are based on Stan’s defaults, are inconsistent with our prior. Currently, the starting point are sampled from a uniform distribution over (-2, 2), over the unconstrained space. That is \log k^{(0)} \sim \mathrm{uniform}(-2, 2). The bounds of k are thence (\sim 0.13, 7.4), which is a rather large range. Inevitably, some of the chains start in the far tails of p(k \mid y) and cannot make it back.

3.3 Confirming the existence of local modes

Because the parameter space is one-dimensional, we can “cheat” a bit – well, we’ve earned it by setting up a simplified problem – and compute the joint distribution (i.e. unnormalized posterior) across a grid of values for k to check that the modes indeed exist.

ks <- seq(from = 0.2, to = 9, by = 0.01)
q0 <- c(1.0, 0)
p0 <- c(0, 1.0)
dt <- 0.001
m <- 1
n_obs <- nrow(q_obs)
ts <- 1:n_obs / 10
sigma_x <- 0.01
sigma_y <- 0.01
lk <- rep(NA, length(ks))
for (i in 1:length(ks)) {
  k <- ks[i]
  q_sim <- solve_trajectory(q0, p0, dt, k, m, n_obs, ts)
  lk[i] <- 
    sum(dnorm(q_obs[, 1], q_sim[, 1], sigma_x, log = T)) +
    sum(dnorm(q_obs[, 2], q_sim[, 2], sigma_y, log = T)) +
    dnorm(k, 0, 1, log = T)
}
ggplot(data = data.frame(ks = ks, lk = lk),
       aes(x = ks, y = lk)) +
  geom_line() +
  theme(text = element_text(size = 18)) +
  ylab("log joint") +
  xlab("k")
Figure 5

There is a strong mode at k = 1 and a “wiggly” tail for larger values of k, with some notable local modes.

3.4 Elliptical motion induces multimodality

To understand how these modes arise, we may reason about the log likelihood as a function that penalizes large distances between q_\mathrm{obs} and q(k), the positions we obtain when we simulate trajectories for a certain value of k. Indeed \log p(q_\mathrm{obs} \mid k) = C - \frac{1}{2\sigma^2} ||q_\mathrm{obs} - q(k))||^2_2, where C is a constant which doesn’t depend on k.

Let us now simulate trajectories for various values of k.

k <- 0.5
q_050 <- solve_trajectory(q0, p0, dt, k, m, n_obs, ts)
k <- 1.6
q_160 <- solve_trajectory(q0, p0, dt, k, m, n_obs, ts)
k <- 2.16
q_216 <- solve_trajectory(q0, p0, dt, k, m, n_obs, ts)
k <- 3
q_300 <- solve_trajectory(q0, p0, dt, k, m, n_obs, ts)
q_plot <- rbind(q_obs, q_050, q_160, q_216, q_300)
k_plot <- rep(c("1.0", "0.5", "1.6", "2.16", "3.00"), each = 40)
plot_data <- data.frame(q_plot, k_plot, obs = 1:40)
names(plot_data) <- c("qx", "qy", "k", "obs")
comp_point <- 35
select <- 1:(40 * 5)
ggplot() +
  geom_path(data = plot_data[select, ], 
            aes(x = qx, y = qy, color = k)) + 
  geom_point(aes(x = q_216[comp_point, 1], y = q_216[comp_point, 2]),
            shape = 3) +
  geom_point(aes(x = q_160[comp_point, 1], y = q_160[comp_point, 2]),
             shape = 3) +
  geom_point(aes(x = q_obs[comp_point, 1], y = q_obs[comp_point, 2])) +
  # Add segments to compare distances.
  geom_segment(aes(x = q_obs[comp_point, 1], y = q_obs[comp_point, 2],
                   xend = q_216[comp_point], 
                   yend = q_216[comp_point, 2]),
               linetype = "dashed") +
  geom_segment(aes(x = q_obs[comp_point, 1], y = q_obs[comp_point, 2],
                   xend = q_160[comp_point, 1], 
                   yend = q_160[comp_point, 2]),
               linetype = "dashed") +
  xlab(TeX("$q_x$")) +
  ylab(TeX("$q_y$"))
Figure 6

As k, and therefore the gravitational force, increases, the orbit becomes shorter. The orbit eventually gets so short that for high enough values of k, the planet undergoes multiple orbits in the observed time. We next note that for k < 1, the trajectory can drift arbitrarily far away from the observed ellipsis. On the other hand, for k > 1, the simulated ellipsis must be contained inside the observed ellipsis, which bounds the distance between q_\mathrm{obs} and q. Finally, as we change k and “rotate” the ellipsis, some of the observed and simulated positions become more aligned, which induces the wiggles in the tail of the likelihood and creates the local modes. This can be seen with the 35^\mathrm{th} observation, where the observed trajectory at k = 1 is closer to the trajectory simulated when k = 2.16 than when k = 1.6.

Clearly the local modes are a mathematical artifact caused by the interaction between our measurement model and the observed elliptical motion: they do not describe a latent phenomenon of interest. It also clear, from the trace and likelihood plot, that the minor modes contribute a negligible probability mass to the posterior distribution. This means that any chain which doesn’t explore the dominant mode wastes our precious computational resources.

3.5 Bad Markov chain, slow Markov chain?

Not only are the samples produced by the misbehaving chains essentially useless when computing summary quantities, such as expectation values and quantiles; they also take much longer to run! Let’s elucidate why that is.

A priori, the ODE we solve is fairly simple. It certainly is when k = 1. But the problem becomes numerically more difficult for large values of k, because for each step \Delta t, the planet travels a longer trajectory; this means a greater change in q and p. Hence, in order to achieve the same precision, a numerical integrator must take smaller step sizes, leading to longer integration times.

There is wisdom is this anecdote: an easy deterministic problem can become difficult in a Bayesian analysis. Indeed Bayesian inference requires us to solve the problem across a range of parameter values, which means we must sometimes confront unsuspected versions of the said problem. In our experience, notably with differential equation based models in pharmacology and epidemiology, we sometime require a more computationally expensive stiff solver to tackle difficult ODEs generated during the warmup phase; on the other hand, the problem behaves better during the sampling phase.

Other times, slow computation can alert us that our inference is allowing for absurd parameter values and that we need either better priors or more reasonable initial points.

Ideally, we would want our algorithm to fail fast – it does the opposite. We however note that, waiting for the chains to undergo 1,000 iterations was unnecessary. The observed failure could have been diagnosed by running the algorithm for a shorter time, which is an important perspective to keep in mind in the early stages of model development.

3.6 Improving the inference

Equipped with a sound understanding of the pathology at hand, we can try to improve our inference. Four candidate solutions come to mind.

3.6.1 Build stronger priors

One option is to build a more informative prior to reflect our belief that a high value of k is implausible; or that any data generating process that suggests the planet undergoes several orbits over the observation time is unlikely. When such information is available, stronger priors can indeed improve computation. This is unfortunately not the case here. A stronger prior would reduce the density at the mode, but the wiggles in the tail of the posterior would persist. Given MCMC uses an unnormalized posterior density to explore the parameter space, a change in the magnitude of the log density will have no effect, even though a change in the gradient can affect HMC. Paradoxically, with more data, the tail wiggles become stronger: the model is fundamentally multimodal. Note further that our current prior, k \sim \mathrm{normal}^+(0, 1), is already inconsistent with the values k takes at the minor modes. In principle we could go a step further and add a hard constraint on orbital time or velocity to remove the modes.

3.6.2 Reweight draws from each chain

One issue is that the Markov chains fail to transition from one mode to the other, meaning some chains sample over a region with a low probability mass. We can correct our Monte Carlo estimate using a re-weighting scheme, such as stacking (Yao, Vehtari, and Gelman 2018). This strategy likely gives us reasonable Monte Carlo estimates, but: (i) we will not comprehensively explore all the modes with 8 chains, so stacking should really be treated as discarding the chains stuck at local modes and (ii) we still pay a heavy computational price, as the chains in minor modes take up to \sim 1000 times longer to run.

3.6.3 Tune the starting points

Stan’s default initial points produce values which are inconsistent with our prior (and our domain expertise). In a non-asymptotic regime, the Markov chain doesn’t always “forget” its starting point, and it is unlikely to do so here even if we run the chain for many many more iterations. We can therefore not ignore this tuning parameter in our algorithm. What then constitutes an appropriate starting point?

There are various considerations. In order to assess whether our chains converge, notably with \hat R, our starting points should be overdispersed, relative to our posterior (e.g. Vehtari et al. 2020). But clearly, for this and other examples, too much dispersion can prevent certain chains from exploring the relevant regions of the parameter space. One heuristic is to sample the starting point from the prior distribution, or potentially from an overdispered prior.

Another perspective is to simply admit that there is no one-size-fits-all solution. This is very much true of other tuning parameters of our algorithm, such as the length of the warmup or the target acceptance rate of HMC. While defaults exist, a first attempt at fitting the model can motivate adjustments. In this sense, we can justify using a tighter distribution to draw the starting points after examining the behavior of the Markov chains with a broad starting distribution.

It took \sim 2,000 seconds to discover the fit failed though. This is not ideal! As mentioned before, we don’t need to wait this long to diagnose a failure in our inference. The analysis presented here could have been made with 100 MCMC draws.

4 Pathfinder to the rescue!

Pathfinder (Zhang et al. 2022) uses fast deterministic quasi-Newton L-BFGS optimization algorithm and chooses the best approximate normal distribution along the optimization path according to the stochastic Kullback-Leibler divergence estimate. Running many pathfinders parallely is fast and multi-Pathfinder uses many normal approximations as a mixture importance sampling proposal distribution from which we can obtain approximate posterior draws. If minor modes have negligible density, the drwas from those modes will be discarded. The Pathfinder’s approximate draws can be used to initialize MCMC. Pathfinder has its limitations, too, but in those cases it ``fails fast’’.

We run the Pathfinder with 40 paths to find many modes. We usually need fewer than 100 L-BFGS iterations. To illustrate the robustness of the Pathfinder algorithm we use Stan’s default initialization which above was shown to be bad for MCMC.

pth1p <- mod1$pathfinder(
  data = stan_data1,
  num_paths = 40,
  single_path_draws = 25,
  draws = 1000,
  max_lbfgs_iters = 100,
  refresh = 0
)

Pathfinder reports that some of the paths failed, but it doesn’t matter as they failed fast, and the results from the rest of the paths are diagnosed to be good.

We initialize MCMC with Pathfinder results

fit1p <- mod1$sample(
  data = stan_data1,
  init = pth1p,
  chains = chains,
  parallel_chains = chains,
  iter_warmup = 500,
  iter_sampling = 500,
  seed = 123,
  save_warmup = TRUE,
  refresh = 0
)
Running MCMC with 8 parallel chains...
Chain 5 finished in 1.1 seconds.
Chain 1 finished in 1.3 seconds.
Chain 4 finished in 1.3 seconds.
Chain 2 finished in 1.5 seconds.
Chain 3 finished in 1.5 seconds.
Chain 6 finished in 1.5 seconds.
Chain 7 finished in 1.5 seconds.
Chain 8 finished in 1.5 seconds.

All 8 chains finished successfully.
Mean chain execution time: 1.4 seconds.
Total execution time: 1.7 seconds.

Convergence diagnostics look good

fit1p$summary(
  variables = c("lp__", "k"), 
  "mean", "sd", "rhat", "ess_bulk", "ess_tail"
)
# A tibble: 2 × 6
  variable    mean       sd  rhat ess_bulk ess_tail
  <chr>      <dbl>    <dbl> <dbl>    <dbl>    <dbl>
1 lp__     -29.9   0.702     1.00    1851.    2197.
2 k          1.000 0.000329  1.00    1593.    2020.
ppc_plot2D(fit1p, data_pred = data_pred)
Figure 7

Everything now looks good. The chains converge near k = 1, and simulate predictions that are consistent with the data.

5 Inference for the full model

5.1 Using conditional likelihoods to understand modes

Once again, we need to reason about whether these local modes are mere mathematical artifacts or the manifestation of a latent phenomenon of interest. We now have some intuition that elliptical observations allow for local modes, because of rotating orbits and chance alignments between observed and generated data.

Conceptually, tweaking q_* means we can move the star closer to the planet and thus increase the gravitational interaction. This is not unlike tweaking k, except we are affecting the r term in \frac{\mathrm d p}{\mathrm d t} = - \frac{k}{r^3}(q - q_*). We may expect the same pathology to manifest itself.

This is more difficult to visualize because the parameters are now 7-dimensional, rather than 1-dimensional as was the case in our first simplified model. Unfortunately we cannot compute the likelihood across a 7-dimensional grid and we wouldn’t quite know how to visualize it. Our proposition is to look at conditional likelihoods, that is fix some of the parameters and examine the remaining ones. This, from a certain point of view, very much amounts to studying a simplification of our model. To begin, we fix all parameters (based on the correct values or equivalently estimates from the well-behaving Markov chains), except for q_*^x.

star_x <- seq(from = -0.5, to = 0.8, by = 0.01)
star_s <- array(NA, dim = c(length(star_x), 2))
star_s[, 1] <- star_x
star_s[, 2] <- rep(0, length(star_x))
k <- 1
q0 <- c(1.0, 0)
p0 <- c(0, 1.0)
dt <- 0.001
m <- 1
n_obs <- nrow(q_obs)
lk <- rep(NA, nrow(star_s))
for (i in 1:nrow(star_s)) {
  q_sim <- solve_trajectory(q0, p0, dt, k, m, n_obs, ts, star_s[i, ])
  lk[i] <- 
    sum(dnorm(q_obs[, 1], q_sim[, 1], sigma_x, log = T)) +
    sum(dnorm(q_obs[, 2], q_sim[, 2], sigma_y, log = T))
}
ggplot(data = data.frame(star_x = star_x, lk = lk),
       aes(x = star_x, y = lk)) + 
  geom_line() +
  ylab("Conditional log likelihood") +
  xlab(TeX("$q_*^x$"))
Figure 8

This is the type of profile we expect. We can extend this to a heat map, with both coordinates of q_* varying, and observe “ripples” that suggest the presence of local modes.

star_x <- seq(from = -0.5, to = 0.8, by = 0.05)
star_y <- seq(from = -0.5, to = 0.5, by = 0.05)
star_s <- array(NA, c(length(star_x), length(star_y), 2))
for (i in 1:length(star_x)) {
  for (j in 1:length(star_y)) star_s[i, j, ] = c(star_x[i], star_y[j])
}
star_data <- array(NA, c(length(star_s) / 2, 2))
for (i in 1:length(star_x)) {
  for (j in 1:length(star_y)) {
    index <- (j - 1) * length(star_x) + i
    star_data[index, ] <- star_s[i, j, ]
  }
}
lk <- rep(NA, nrow(star_data)) 
for (i in 1:nrow(star_data)) {
  q_sim <- solve_trajectory(q0, p0, dt, k, m, n_obs, ts, star_data[i, ])
  
  lk[i] <- 
    sum(dnorm(q_obs[, 1], q_sim[, 1], sigma_x, log = T)) +
    sum(dnorm(q_obs[, 2], q_sim[, 2], sigma_y, log = T))
}
ggplot(data = data.frame(star_x = star_data[, 1],
                         star_y = star_data[, 2],
                         lk = lk), 
       aes(x = star_x, y = star_y, fill = lk)) +
  geom_tile() +
  xlab(TeX("$q_*^x$")) +
  ylab(TeX("$q_*^y$")) +
  labs(fill = "log likelihood")
Figure 9

These figures support our conjecture and, along with the log posterior and the posterior predictive checks, suggest the modes are mathematical artifacts, much like to ones we have previously probed.

5.2 Fitting the model

We run the Pathfinder with 40 paths to find many modes. We usually need fewer than 100 L-BFGS iterations. To illustrate the robustness of the Pathfinder algorithm we use Stan’s default initialization which above was shown to be bad for MCMC.

mod2 <- cmdstan_model(root("planetary_motion", "planetary_motion_star.stan"))
print_stan_code(mod2$code())
functions {
  vector ode(real t, vector y, real k, vector star, real m) {
    vector[2] q = y[1:2];
    vector[2] p = y[3:4];
    
    real r_cube = pow(dot_self(q - star), 1.5);
    
    vector[4] dydt;
    dydt[1:2] = p / m;
    dydt[3:4] = -k * (q - star) / r_cube;
    
    return dydt;
  }
}
data {
  int N;
  array[N, 2] real q_obs;
  array[N] real time;
  
  real<lower=0> sigma;
}
transformed data {
  real t0 = 0;
  int N_coord = 2;
  real m = 1.0;
  
  real<lower=0> sigma_x = sigma;
  real<lower=0> sigma_y = sigma;
  
  // ODE solver parameters
  real rel_tol = 1e-6;
  real abs_tol = 1e-6;
  int max_steps = 1000;
}
parameters {
  real<lower=0> k;
  vector[N_coord] q0;
  vector[N_coord] p0;
  vector[N_coord] star;
}
transformed parameters {
  vector[N_coord * 2] y0 = append_row(q0, p0);
  array[N] vector[N_coord * 2] y =
    ode_rk45_tol(ode, y0, t0, time, rel_tol, abs_tol, max_steps, k, star, m);
}
model {
  k ~ normal(1, 0.001); // prior derive based on solar system 
  // (still pretty uninformative)
  
  p0[1] ~ normal(0, 1);
  p0[2] ~ lognormal(0, 1); // impose p0 to be positive.
  q0 ~ normal(0, 1);
  star ~ normal(0, 0.5);
  
  q_obs[ : , 1] ~ normal(y[ : , 1], sigma_x);
  q_obs[ : , 2] ~ normal(y[ : , 2], sigma_y);
}
generated quantities {
  array[N] real qx_pred;
  array[N] real qy_pred;
  
  qx_pred = normal_rng(y[ : , 1], sigma_x);
  qy_pred = normal_rng(y[ : , 2], sigma_y);
}
N_select <- 40
time <- (1:N_select) / 10
stan_data2 <- list(N = N_select, q_obs = q_obs, time = time, sigma = sigma)
pth2 <- mod2$pathfinder(
  data = stan_data2,
  num_paths = 40,
  single_path_draws = 25,
  draws = 1000,
  max_lbfgs_iters = 100,
  refresh = 0
)

We get informative messages that most of the 40 pathfinders failed, but there are enough that succeeded, and this took only 3s.

We initialize MCMC with Pathfinder results.

fit2p <- mod2$sample(
  data = stan_data2,
  init = pth2,
  chains = chains,
  parallel_chains = chains,
  iter_warmup = 500,
  iter_sampling = 500,
  seed = 123,
  save_warmup = TRUE
)

Convergence diagnostics look good

fit2p$summary(
  variables = c("lp__", "k", "q0", "p0", "star"), 
  "mean", "sd", "rhat", "ess_bulk", "ess_tail"
)
# A tibble: 8 × 6
  variable      mean       sd  rhat ess_bulk ess_tail
  <chr>        <dbl>    <dbl> <dbl>    <dbl>    <dbl>
1 lp__     -32.2     1.89      1.00    1364.    1695.
2 k          1.00    0.000996  1.01    4275.    2473.
3 q0[1]      1.00    0.00371   1.00    1751.    2026.
4 q0[2]     -0.00341 0.00449   1.00    1276.    1963.
5 p0[1]      0.00495 0.00612   1.00    1149.    1659.
6 p0[2]      1.000   0.00440   1.00    1314.    1463.
7 star[1]    0.00124 0.00545   1.00    1413.    1406.
8 star[2]    0.00670 0.00543   1.00    1979.    2036.

Posterior predictive checks look good for all chains!

ppc_plot2D(fit2p, data_pred = data_pred, plot_star = TRUE)
Figure 10

6 Discussion and lessons learned

When we fail to fit a model, examining a simplified model can help us understand the challenges that frustrate our inference algorithm. In practice it is difficult to find a simplification which is manageable and still exhibits the pathology we wish to understand. A straightforward way of simplifying is to fix some of the model parameters. When the model has more structure, such as in multilevel models, more elaborate simplifications can be considered.

In the planetary motion example, we are confronted with a multimodal posterior distribution. This geometry prevents our chains from cohesively exploring the parameter space and leads to biased Monte Carlo estimates. It is important to understand how these local modes arise and what their contribution to the posterior probability mass might be. We do so using posterior predictive checks. It is not uncommon for minor modes, with negligible probability mass, to trap a Markov chain. The possibility of such ill-fitting modes implies we should always run multiple chains, perhaps more than our current default of 4.

This case study also raises the question of what role starting points may play. Ideally a Markov chain forgets its initial value but in a non-asymptotic regime this may not be the case. Just as there is no universal default prior, there is no universal default initial point. Modelers often need to depart from defaults to ensure a numerically stable evaluation of the joint density and improve MCMC computation. At the same time we want dispersed initial points in order to have reliable convergence diagnostics and to potentially explore all the relevant modes. As with other tuning parameters of an inference algorithm, picking starting points can be an iterative process, with adjustments made after a first attempt at fitting the model.

We do not advocate mindlessly discarding misbehaving chains. It is important to analyze where this poor behavior comes from, and whether it hints at serious flaws in our model and in our inference.

The Pathfinder algorithm can be used to find many modes and obtain approximate posterior draws. If the Pareto-\hat{k} diagnostic for the Pathfinder approximation looks good, then the after the importance sampling those draws would be good enough and there would not be need to run MCMC. Pathfinder provides a great way to fit and fail fast. Further posterior draws can be obtained using MCMC with Pathfinder initialization.

Acknowledgement

We thank Ben Bales, Matthew West, Martin Modrák, and Jonah Gabry for helpful discussion.

References

Betancourt, M. 2018. “A Conceptual Introduction to Hamiltonian Monte Carlo.” arXiv:1701.02434v2.
———. 2020. “Towards a Principled Bayesian Workflow.” betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html.
Blei, David M. 2014. “Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models.” Annual Review of Statistics and Its Application 1. https://doi.org/https://doi.org/10.1146/annurev-statistics-022513-115657.
Carpenter, Bob, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus A. Brubaker, Jiqiang Guo, Peter Li, and Allen Riddel. 2017. “Stan: A Probabilistic Programming Language.” Journal of Statistical Software. https://doi.org/10.18637/jss.v076.i01.
Gabry, J, D Simpson, A Vehtari, M Betancourt, and A Gelman. 2019. “Visualization inBayesian Workflow (with Discussion and Rejoinder).” Journal of the Royal Statistical Society A 182: 389–441.
Gelman, A, D Simpson, A Vehtari, C C Margossian, Bob Carpenter, Y Yao, B Bales, L Kennedy, P-C Bürkner, and M Modák. 2020. “Bayesian Workflow.” In Preperation.
Vehtari, A, A Gelman, D Simpson, B Carpenter, and P-C Bürkner. 2020. “Rank-Normalization, Folding, and Localization: An Improved \hat R for Assessing Convergence of MCMC.” Bayesian Analysis. https://doi.org/10.1214/20-BA1221.
Yao, Y, A Vehtari, and A Gelman. 2018. “Using Stacking to Average Bayesian Predictive Distributions (with Discussion).” Bayesian Analysis 13: 917–1003.
Zhang, L., B. Carpenter, A. Gelman, and A. Vehtari. 2022. “Pathfinder: Parallel Quasi-Newton Variational Inference.” Journal of Machine Learning Research 23 (306).

Licenses

  • Code © 2020–2026, Charles C. Margossian, Andrew Gelman and Aki Vehtari, licensed under BSD-3.
  • Text © 2020–2026, Charles C. Margossian, Andrew Gelman and Aki Vehtari, licensed under CC-BY-NC 4.0.

Footnotes

  1. One of the software we work on is Stan, see mc-stan.org.↩︎