Here are some answers by Aki Vehtari to frequently asked questions about cross-validation and loo
package. If you have further questions, please ask them in Stan discourse thread named Cross-validation FAQ.
Cross-validation is a family of techniques that try to estimate how well a model would predict previously unseen data by using fits of the model to a subset of the data to predict the rest of the data.
Cross-validation can be used to:
Even if the goal of the model is not to make predictions, a model which makes bad or badly calibrated predictions is less likely to provide useful insights to a phenomenon studied.
Two basic cases why to use cross-validation for one model are:
More about these cases:
1 ) For example, Vehtari and Lampinen (2002) describe a model for predicting concrete quality based on amount of cement, sand properties (see Kalliomäki, Vehtari and Lampinen (2005)), and additives. One of the quality measurements is compressive strength 3 months after casting. For example, when constructing bridges, it is very useful to be able to predict the compressive strength before casting concrete. Vehtari and Lampinen (2002) estimated 90% quantile of absolute error for new castings, that is, they reported that in 90% of cases the difference between the prediction and the actual measurement 3 months after the casting is less than the given value (also other quantiles were reported to the concrete experts). This way it was possible to asses whether the prediction model was accurate enough to have practical relevance.
2a) Even if we are not interested in predicting the actual future, a model which could make good predictions has learned something useful from the data. For example, if a regression model is not able to predict better than null model (a model only for the marginal distribution of the data) then it has not learned anything useful from the predictors. Correspondingly for time series models the predictors for the next time step can be observation values in previous time steps.
2b) Instead of considering predictions for future, we can consider whether we can generalize from some observations to others. For example, in social science we might make a model explaining poll results with demographical data. To test the model, instead of considering future pollings, we could test whether the model can predict for a new state. If we have observed data from all states in USA, then there are no new states (or it can take unpredictable time before there are new states), but we can simulate a situation where we leave out data from one state and check can we generalize from other states to the left out state. This is sensible approach when we assume that states are exchangeable conditional on the information available (see, e.g., Gelman et al. (2013) Chapter 5 for exchangeability). The generalization ability from one entity (a person, state, etc) to other similar entity tells us that model has learned something useful. It is very important to think what is the level where the generalization is most interesting. For example, in cognitive science and psychology it would be more interesting to generalize from one person to another than within person data from one trial to another trial for the same person. In cognitive science and psychology studies it is common that the study population is young university students, and in such thus there are limitations what we can say about the generalization to whole human population. In polling data from all US states, the whole population of US states has been observed, but there is limitation how we can generalize to other countries or future years.
2c) In addition of assessing the predictive accuracy and generalizability, it is useful to assess how well calibrated is the uncertainty quantification of the predictive distribution. Cross-validation is useful when we don’t trust that the model is well specified, although many bad mis-specifications can be diagnosed also with simpler posterior predictive checking. See, for example, case study roaches.
Three basic cases for why to use cross-validation for many models are:
More about these cases:
1 ) Use of cross-validation to select the model with best predictive performance is relatively safe if there are small or moderate number of models, and there is a lot of data compared to the model complexity or the best model is clearly best (Piironen and Vehtari, 2017; Sivula, Magnusson and Vehtari, 2020). See also Section How to use cross-validation for model selection?.
2a) Cross-validation is useful especially when there are posterior dependencies between parameters and examining the marginal posterior of a parameter is not very useful to determine whether the component related to that parameter is relevant. This happens, for example, in case of collinear predictors. See, for example, case studies collinear, mesquite, and bodyfat.
2b) Cross-validation is less useful for simple models with no posterior dependencies and assuming that simple model is not mis-specified. In that case the marginal posterior is less variable as it includes the modeling assumptions (which assume to be not mis-specified) while cross-validation uses non-model based approximation of the future data distribution which increases the variability. See, for example, case study betablockers.
2c) Cross-validation can provide quantitative measure, which should only complement but not replace understanding of qualitative patterns in the data (see, e.g., Navarro (2019)).
3 ) See more in How to use cross-validation for model averaging?.
See also the next Section “When not to use cross-validation?”, How is cross-validation related to overfitting?, and How to use cross-validation for model selection?.
In general there is no need to do any model selection (see more in How is cross-validation related to overfitting?, and How to use cross-validation for model selection?). The best approach is to build a rich model that includes all the uncertainties, do model checking, and possible model adjustments.
Cross-validation is not answering directly the question “Do the data provide evidence for some effect being non-zero?” Using cross-validation to compare a model with an additional term to a model without that term is a kind of null hypothesis testing. Cross-validation can tell whether that extra term can improve the predictive accuracy. The improvement in the predictive accuracy is a function of signal-to-noise-ratio, the size of the actual effect, and how much the effect is correlating with other included effects. If cross-validation prefers the simpler model, it is not necessarily evidence for an effect being exactly zero, but it is possible that the effect is too small to make a difference, or due to the dependencies it doesn’t provide additional information compared to what is already included in the model. Often it makes more sense to just fit the larger model and explore the posterior of the relevant coefficient. Analysing the posterior can however be difficult if there are strong posterior dependencies.
Using cross-validation to select one model among a large number of models (see How to use cross-validation for model selection?) can suffer from the selection induced bias (see, e.g., McLatchie and Vehtari (2023)).
It is important to separate
Different partitions of data are held out in different kinds of cross-validation.
Which unit is systematically left out determines the predictive task that cross-validation assesses model performance on (see more in When is cross-validation valid?). CV, LOO, LFO and LOGO and other cross-validation approaches do not yet specify the utility or loss, or how the computation is made except that it involves estimating cross-validated predictive densities or probabilities. If the goal is to estimate the predictive performance of a single model used for a specific prediction task, it is often natural to match the actual prediction task and how the data is partitioned in cross-validation, but sometimes better accuracy can be obtained with a different partitioning (possibly slightly increasing bias, but greatly reducing variance).
First we need to define the utility or loss function which compares predictions to observations. These predictions can be considered to be for future observations, or for other exchangeable entities (see more in What is cross-validation?). Some examples:
These are examples of utility and loss functions for using the model to predict the future data and then observing that data. Other utility and loss functions could also be used. See more in Can other utilities or losses be used than log predictive density?, Scoring rule in Wikipedia, and Gneiting and Raftery, 2012.
The value of the loss functions necessarily depends on the data we observe next. We can however try to estimate an expectation of the loss (a summary of average predictive performance over several predictions or expected predictive performance for one prediction) under the assumption that both the covariates and responses we currently have are representative of those we will observe in the future.
Similarly we can have expected RMSE, ACC, \(R^2\), etc.
In the papers and loo
package, following notations have been used
Similarly we can use the similar notation for other data divisions, and utility and loss functions. For example, when using LOO data division
These terms are not yet defining possible computational approximations.
The choice of partitions to leave out or metric of model performance is independent of the computational method (e.g. PSIS or K-fold-CV). Different computational methods can be used to make the computation faster:
We could write elpd_{psis-loo}, but often drop the specific computational method and report diagnostic information only if that computation may be unreliable.
When discussing, for example, properties of elpd_loo computed with PSIS-LOO, we can separately discuss limitations of - ELPD: Is this useful to know in the specific modeling task? Is log score the most appropriate utility or loss function in this modeling task? - LOO: Is LOO valid? Is LOO matching the interesting predictive task? Is inherent variance of LOO estimate problematic? - PSIS: Is IS failing? Is PSIS failing? Is additional variability due to computational approximation of LOO problematic?
Statistical folklore says we need cross-validation to avoid overfitting. Overfitting can refer to different things:
More about these cases:
If we condition the model on data, then make predictions for that same data, and finally compute expected utility or loss by comparing the predictions to that same data, the estimate is overoptimistic as the model has been fitted to the data. We want the model to fit to the data, but it’s not clear how much fitting is overfitting. Cross-validation is able to estimate better the predictive performance for future or otherwise unseen data (or exchangeable entities) and can be also used to assess how much model has fitted to the data.
Overfitting of bigger more complex models is a bigger problem when using less good inference methods. For example, bigger models fitted using maximum likelihood can have much worse predictive performance than simpler models. Overfitting of bigger models is a smaller problem in Bayesian inference because a) integrating over the posterior and b) use of priors. The impact of integration over the posterior is often underestimated compared to the impact of priors. See a video demonstrating how integration over the posterior is also regularizing and reducing overfit. It is still possible to make Bayesian models to overfit by using priors which have much more probability mass for over-complex solutions instead of simple solutions. Combination of (accurate) integration and sensible priors make it common that, for example, adding more predictors doesn’t decrease the predictive performance of bigger models even if the number of predictors is much higher than the number of observations (which would be a big problem with maximum likelihood).
In (2), it was mentioned that when using maximum likelihood tends to overfit more easily. In Bayesian inference, using approximate integration, for example, using variational inference with normal distribution with diagonal covariance, can overfit more than when using accurate integration. If accurate Bayesian inference is used for each model, but model selection is made using, for example, cross-validation, then the model selection process can overfit badly (Piironen and Vehtari, 2017).
Summary
loo
package) is less than 4, the difference is small (Sivula, Magnusson and Vehtari, 2020)). If elpd difference (elpd_diff in loo package) is larger than 4, then compare that difference to standard error of elpd_diff (provided e.g. by loo
package) (Sivula, Magnusson and Vehtari, 2020). See also Section How to interpret in Standard error (SE) of elpd difference (elpd_diff)?.If there is a large number of models compared, there is possibility of non-negligible overfitting in model selection.
Thus if there are a very large number of models to be compared, either methods that can estimate selection induced bias (McLatchie and Vehtari, 2023) should be used, or more elaborate approaches are recommended such as projection predictive variable selection (Piironen and Vehtari, 2017; Piironen+etal:projpred:2020; McLatchie et al., 2023).
See more in tutorial videos on using cross-validation for model selection
If one of the models in the model selection is not clearly the best, it may be better to average over many models.
Based on the experiments by Yao et al. (2018), Bayesian stacking has better performance than LOO-weights and LOO-BB-weights.
This question is often focusing on whether specific data partition (see What are the parts of cross-validation?) matches the predictive task. The folklore says that for valid cross-validation the data partition should match the predictive task. Matching the data partition with the predictive task is assumed to minimize the bias, but there are cases where alternative partition may have negligible bias but much smaller variance (see, e.g., Cooper et al. (2023)). Thus different partitions can be valid even if they don’t match the predictive task.
LOO and cross-validation in general do not require independence or conditional independence. Exchangeability is sufficient. Even if we are using models with conditional independence structure, we don’t require that the true data generating mechanism has such structure, but due to exchangeability and the data collection process we can proceed as if assuming conditional independence. See more in Chapter 5 of BDA3 (Gelman et al., 2013). Cross-validation can also be used when the model doesn’t have conditional independence structure. In time series, the observations \(y_1,\ldots,y_T\) are not exchangeable as the index has additional information about the similarity in time. If we have model \(p(y_t \mid f_t)\), with latent values \(f_t\) then pairs \((y_1,f_1),\ldots,(y_T,f_T)\) are exchangeable (see Chapter 5 of BDA3 (Gelman et al., 2013)) and we can factorize the likelihood trivially. We usually can present time series models with explicit latent values \(f_t\) , but sometimes we integrate them analytically out due to computational reasons and then get non-factorized likelihood for exactly the same model.
If we want to evaluate the goodness of the model part \(p(y_t \mid f_t)\), LOO is fine. If we want to evaluate the goodness of the time series model part \(p(f_1,\ldots,f_T)\), way may be interested in 1) goodness for predicting missing data in a middle (think about audio restoration of recorded music with missing parts, e.g. due to scratches in the medium) or 2) we may be interested in predicting future (think about stock market or disease transmission models).
If the likelihood is factorizable (and if it’s not we can make it factorizable in some cases, see Can cross-validation be used for time series?) then this shows in Stan code as sum of log-likelihood terms. Now it’s possible to define entities which are sums of those individual log likelihood components. If the sums are related to exchangeable parts, we may use terms like leave-one-observation-out (LOO), leave-one-subject-out, leave-one-time-point-out, etc. If we want additionally restrict the information flow, for example, in time series we can add constraint that if \(y_t\) is not observed then \(y_{t+1},\ldots,y_{T}\) are not observed, we can use leave-future-out (LFO).
How do we then choose the level of what to leave out in cross-validation? It depends which level of the model is interesting and if many levels are interesting then we can do cross-validation at different levels.
If you want to claim that your scientific hypothesis generalizes outside the specific observations you have, we need to define what is scientifically interesting. We are limited by the range of the data observed (see Vehtari and Lampinen (2002), and Vehtari and Ojanen (2012)), but if we can’t generalize even in that range, there is no hope to generalize outside of that range. For example in brain signal analysis it is useful to know if the time series model for brain signal is good, but it is scientifically more interesting to know whether the model learned from a set of brains work well also for new brains not included in the data used to learn the posterior (training set in ML terms). Here we are limited to assessing generalization in the subject population, for example, young university students. If we can’t generalize from one brain to another even in that limited population, there is no hop generalizing to brains of non-young-university-students.
The short answer is “Yes”. Hierarchical model is useful, for example, if there are several subjects and for each subject several trials. As discussed in When is cross-validation valid?, it is useful to think of the prediction task or generalizability over different exchangeable entities. We can use different types of cross-validation to choose the focus. This means that also different forms of cross-validation are valid for hierarchical models
See also
The short answer is “Yes” (see, e.g. Bürkner, Gabry and Vehtari (2020) and Cooper et al. (2023)). If there is a model \(p(y_i \mid f_i,\phi)\) and joint time series prior for \((f_1,...,f_T)\) then \(p(y_i \mid f_i,\phi)\) can be considered independent given \(f_i\) and \(\phi\) and likelihood is factorizable. This is true often and the past values are informative about future values, but conditionally we know \(f_i\), the past values are not providing additional information. This should not be confused with that when we don’t know \(f_i\) and integrate over the posterior of \((f_1,...,f_T)\), as then \(y_i\) are not conditionally independent given \(\phi\). Also they are not anymore exchangeable as we have the time ordering telling additional information. \(M\)-step ahead prediction (see Bürkner, Gabry and Vehtari (2020)) is more about the usual interest in predicting future and evaluating the time series model for \((f_1,...,f_T)\), but leave-one-out cross-validation is valid for assessing conditional part \(p(y_i \mid f_i)\). Even if we are interested in predicting the future, K-fold-CV with joint log score can have better model selection performance for time series models than LFO due to smaller variance (Cooper et al., 2023).
See also
The short answer is “Yes”. This is closely related to the question about time series. If there is a model \(p(y_i \mid f_i,\phi)\) and joint spatial prior for \((f_1,...,f_T)\) then \(p(y_i \mid f_i,\phi)\) can be considered independent given \(f_i\) and \(\phi\) and likelihood is factorizable. This is true often and the observations in the nearby regions are correlated, but conditionally we know \(f_i\), the nearby observations are not providing additional information. This should not be confused with that when we don’t know \(f_i\) and integrate over the posterior of \((f_1,...,f_T)\), as then \(y_i\) are not conditionally independent given \(\phi\). Also they are not anymore exchangeable as we have the spatial ordering telling additional information.
See also
Short answer is “Yes”. Vehtari, Gelman and Gabry (2017) state ``Instead of the log predictive density \(\log p(\tilde{y}_i \mid y)\), other utility (or loss) functions \(u(p(\tilde{y}_i \mid y), \tilde{y})\) could be used, such as classification error.’’ See also Vehtari and Ojanen (2012).
Vignette for loo
package about other utility and loss functions is work in progress, but there are examples elsewhere:
loo
package has functions (http://mc-stan.org/loo/reference/E_loo.html) for computing the necessary expectations. We have plan for adding more on other utilities and loss functions (see a Github issue).
We recommend log predictive density (log score) for model comparison in general as it measures the goodness of the whole predictive distribution including tails (see also Vehtari and Ojanen (2012)). We also recommend to use application specific utility and loss functions which can provide information whether the predictive accuracy is good enough in practice as compared to application expertise. It is possible that one model is better than others, but still not useful for practice. We are happy to get feedback on other utility and loss functions than log score, RMSE, ACC and \(R^2\) that would be even more application specific.
See also
Log densities and log probabilities can be transformed to densities and probabilities which have intrinsic interpretation, although most are not well calibrated for the values as they are not used to think in densities and probabilities and even less in log densities and log probabilities.
The log probabilities are easier. For example, Guido Biele had a problem computing elpd_loo
with a beta-binomial model for data with 22 categories. Computed individual elpd_loo
values for observations were around -461. For discrete model with uniform probability for 22 categories log probabilities would be \(\log(1/22)\approx -3.1\), and thus there was two orders of magnitude error in log scale. With the fixed code individual elpd_loo
values were about \(-2.3>-3.1\), that is, the model was beating the uniform distribution.
The log densities are more difficult as they require knowing possible scaling or transformations of the data. See more in Can cross-validation be used to compare different observation models / response distributions / likelihoods?.
Although ELPD is good for model comparison as it measures the goodness of the whole predictive distribution, the difference in ELPD is even more difficult to interpret without some practice, and thus we recommend to use also application specific utility or loss functions. See more in Can other utility and loss functions be used than log predictive density?.
As quick rule: If elpd difference (elpd_diff
in loo
package) is less than 4, the difference is small (Sivula, Magnusson and Vehtari, 2020). If elpd difference (elpd_diff
in loo package) is larger than 4, then compare that difference to standard error of elpd_diff
(provided e.g. by loo
package) (Sivula, Magnusson and Vehtari, 2020). The value for deciding what is small or large can be based on connection to Pseudo-BMA+-weights (Yao et al., 2018). See also How to interpret in Standard error (SE) of elpd difference (elpd_diff)?.
Short answer is “Yes”. First to make the terms more clear, \(p(y \mid \theta)\) as a function of \(y\) is an observation model and \(p(y \mid \theta)\) as a function of \(\theta\) is a likelihood. It is better to ask Can cross-validation be used to compare different observation models?
rstanarm
and brms
check that the hash of \(y\) is the same). If y is transformed, then the Jacobian of that transformation needs to be included. There is an example of this in mesquite case study.Likelihood is a function with respect to the parameters and, discrete observation model can have continuous likelihood function and continuous observation model can have discrete likelihood function. For example Stan doesn’t allow discrete parameters unless integrated out by summing, and thus in Stan you can mix only discrete and continuous observation models which have continuous likelihood functions.
First we need to think which utility or loss functions make sense for different data types. Log score can be used for discrete and continuous. Second we need to be careful with how the continuous data is scaled, as for example in the case of log score, the scaling affects log-densities and then log-probabilities and log-densities of arbitrarily scaled data are not comparable and their contributions would have arbitrary weights in the combined expected utility or loss.
Scaling of the data doesn’t change probabilities in discrete observation model. Scaling of the data does change the probability densities in continuous observation model. People often scale continuous data before modeling, for example, to have standard deviation of 1. The same holds for other transformations, e.g. people might compare Poisson model for discrete counts to normal model for log counts, and then the results are not comparable. When the probabilities don’t change but densities change, then the relative weight of components change. So we need to be careful, either by explicitly discretizing the continuous distribution to probabilities (see “Can cross-validation be used to compare different observation models / response distributions / likelihoods?”) or by keeping the scale such that densities correspond directly to sensible discretization.
We can also report the performance for discrete and continuous data separately, by summing the individual pointwise results for discrete and continuous separately.
As we have only finite number \(N\) of observations which are used by cross-validation as a proxy of the future data, there is uncertainty in the LOO estimate.
As \(\widehat{\mathrm{elpd}}_\mathrm{loo}\) is defined in Equation (4) by Vehtari, Gelman and Gabry (2017) as a sum and not as a mean, we multiply the variance of individual terms by \(\sqrt{N}\) instead of dividing by \(\sqrt{N}\).
SE assumes that normal approximation describes well the uncertainty related to the expected difference. Due to cross-validation folds not being independent, SE tends to be underestimated especially if the number of observations is small or the models are badly misspecified. The whole normal approximation tends to fail if the models are very similar or the models are badly misspecified. More about the failure modes, diagnostics and recommendations are available in a paper by Sivula, Magnusson and Vehtari (2020).
tl;dr When the difference (elpd_diff
) is larger than 4, the number of observations is larger than 100 and the model is not badly misspecified then normal approximation and SE are quite reliable description of the uncertainty in the difference. Differences smaller than 4 are small and then the models have very similar predictive performance and it doesn’t matter if the normal approximation fails or SE is underestimated (Sivula, Magnusson and Vehtari, 2020).
This is about Pareto-\(\hat{k}\) (khat) diagnostic for PSIS-LOO.
The Pareto-\(\hat{k}\) is a diagnostic for Pareto smoothed importance sampling (PSIS) (Vehtari, Gelman and Gabry, 2017), which is used to compute components of elpd_loo
. In importance-sampling LOO (the full posterior distribution is used as the proposal distribution), the Pareto-\(\hat{k}\) diagnostic estimates how far an individual leave-one-out distribution is from the full distribution. If leaving out an observation changes the posterior too much then importance sampling is not able to give reliable estimate. If \(\hat{k}<0.5\), then the corresponding component of elpd_loo
is estimated with high accuracy. If \(0.5<\hat{k}<0.7\) the accuracy is lower, but still OK. If \(\hat{k}>0.7\), then importance sampling is not able to provide useful estimate for that component/observation. Pareto-\(\hat{k}\) is also useful as a measure of influence of an observation. Highly influential observations have high \(\hat{k}\) values. Very high \(\hat{k}\) values often indicate model misspecification, outliers or mistakes in data processing. See Section 6 of Gabry et al. (2019) for an example.
If there are high \(\hat{k}\) values, we can gain additional information by looking at p_loo
reported, e.g. by loo
package. p_loo
is measure of effective number of parameters (see more in What is the interpretation of p_loo?.
If \(\hat{k} > 0.7\) then we can also look at the p_loo estimate for some additional information about the problem:
Although high \(\hat{k}\) values indicate problems with the models or computation, we don’t need always fix these problems. If we get high \(\hat{k}\) values but the model is clearly bad (based on PPC or much much worse elpd), we can just discard that model without fixing the computation. If we get a small number of high \(\hat{k}\) values in the initial part of the workflow, we may proceed without fixing the issues, but if in the final t´stages we are comparing models that have similar performance, there are some high \(\hat{k}\) values, and we want to be minimize probability of wrong conclusions, it’s best to fix the problems. There is no threshold for how many high \(\hat{k}\) values would be acceptable.
For more information see
Moment matching LOO can be used to reduce the number of high Pareto \(k\)’s faster than by refitting all problematic cases.
Yes, but you are likely to have many high Pareto \(k\)’s if prior is weak or if there are parameters which see the information only from one observation each (e.g. “random” effect models). See an example in Section Poisson model with “random effects”
in Roaches cross-validation demo.
p_loo
is called the effective number of parameters and can be computed as the difference between elpd_loo
and the non-cross-validated log posterior predictive density (Equations (4) and (3) in Vehtari, Gelman and Gabry (2017)). It is not needed for elpd_loo
, but has diagnostic value. It describes how much more difficult it is to predict future data than the observed data. Asymptotically under certain regularity conditions, p_loo
can be interpreted as the effective number of parameters. In well behaving cases p_loo
\(< N\) and p_loo
\(< p\), where \(p\) is the total number of parameters in the model. p_loo
\(> N\) or p_loo
\(> p\) indicates that the model has very weak predictive capability. This can happen even in case of well specified model (as demonstrated in Figure 1 in Vehtari, Gelman and Gabry (2017)), but may also indicate a severe model misspecification. See more in “Interpreting p_loo when Pareto-\(\hat{k}\) is large” in LOO Glossary.
See for some limitations, for example, Limitations of “Limitations of Bayesian Leave-one-out Cross-Validation for Model Selection” and Between the Devil and the Deep Blue Sea: Tensions Between Scientific Judgement and Statistical Model Selection.
LOO is an cross-validation approach which can be used to estimate expected utility and loss functions. If log score is used, LOO can be used to estimate ELPD and we write elpd_loo.
WAIC is an computational method to estimate ELPD and we could write elpd_waic. Watanabe used log score, but as WAIC has often be represented as an alternative to DIC which used -2 * log score, WAIC is sometimes use to estimate -2*ELPD. See also How are LOOIC and elpd_loo related? Why LOOIC is -2*elpd_loo?.
In theory, the computational method used in WAIC, which corresponds to a truncated Taylor series approximation of leave-one-out cross-validation, could be used with other smooth utilities and loss functions than log score (Vehtari and Ojanen, 2012), but we’re not aware of people doing that and we don’t recommend it as PSIS-LOO has better diagnostics.
All limitations when LOO is valid or sensible hold also for WAIC (see When is cross-validation valid?, Can cross-validation be used for hierarchical/multilevel models?, Can cross-validation be used for time series?). Thinking in terms of LOO cross-validation, it is easier to move to other cross-validation data division schemes.
Vehtari, Gelman and Gabry (2017) show that PSIS-LOO has usually smaller error in estimating ELPD than WAIC. The exception is the case when p_loo \(\ll N\), as then WAIC tends to have slightly smaller error, but in that case both PSIS-LOO and WAIC have very small error and it doesn’t matter which computational approximation is used. On the other hand, for flexible models WAIC fails more easily, has significant bias and is less easy to diagnose for failures. WAIC has been included in loo
package only for comparison purposes and to make it easy to replicate the results in Vehtari, Gelman and Gabry (2017).
Historically, some of the information criterion papers used to use -2 * log score instead of simple log score. The reason for -2 dates to back in time when maximum likelihood was commonly used, as for Gaussian model with known variance -2 log score is equal to squared error. Also asymptotically when using maximum likelihood for estimation and likelihood ratio test for null hypothesis testing within nested GLMs there is a connection to Chi^2 distribution.
The historical -2 was carried on to DIC which still was using point estimates. Watanbe did not use -2 in his WAIC paper. However, when people started using WAIC instead of DIC, some thought it would be useful to keep the same scale for comparison. This was what happened also in BDA3, but later, for example, Vehtari, Gelman and Gabry (2017) do not use -2 anymore, as the above mentioned connections do not hold in general for Bayesian models in finite case and there is no benefit in multiplying by -2. Future printings of BDA3 also recommend to not use -2.
If you prefer minimizing losses instead of maximizing utilities, multiply by -1.
The benefit of not using multiplier 2, is that then elpd_loo and p_loo are on the same scale and comparing models and using p_loo for diagnostics is easier.
For a longer answer see Vehtari and Ojanen (2012). Akaike’s original idea for an information criterion
(AIC) was to estimate the future predictive performance. There are many other information criteria, which make different assumptions about the model, inference, and prediction task (Vehtari and Ojanen, 2012). For the most common ones, the differences can be summarised as
Assuming regular and true model, these are asymptotically (with \(N \rightarrow \infty\)) the same. In finite case and singular models, WAIC and Bayesian LOO are the most sensible when doing Bayesian modeling. Bayesian LOO has benefits over WAIC as discussed in How are LOO and WAIC related?.
LOO-CV estimates the predictive performance given \(N-1\) observations. Bayes factor can be presented as ratio of predictive performance estimates given \(0\) observations. Alternatively Bayes factor can be interpreted as choosing the maximum a posterior model. - Bayes factor can be sensible when models are well specified and there is lot of data compared to the number of parameters, so that maximum a posteriori estimate is fine and the result is not sensitive to priors. - If there is not a lot of data compared to the number of parameters, Bayes factor can be much more sensitive to prior choice than LOO-CV. - If the models are not very close to the true model, Bayes factor can be more unstable than cross-validation (Yao et al., 2018; Oelrich et al., 2020). - Computation of Bayes factor is more challenging. For example, if computed from MCMC sample, usually several orders of magnitude bigger sample sizes are needed for Bayes factor than for LOO-CV. - If the models are well specified, regular, and there is a lot of data compared to the number of parameters (\(n \gg p\)), then Bayes factor may have smaller variance than LOO-CV. If the models are nested, instead of Bayes factor, it is also possible to look directly at the posterior of the interesting parameters (see also 2b in Using cross-validation for many models).
LOO-PIT is a form of posterior predictive checking (PPC). In the usual PPC, the marginal predictive distribution is compared to all data (sometimes in groups). See, for example, a bayesplot
vignette.
If we want to focus on conditional predictive distributions, we often have only one observation for each conditional predictive distribution. Using probability integral transformation (PIT, which corresponds to cumulative probability) we can transform the comparison of conditional predictive distribution and one observation to values between [0,1] which jointly have close to uniform distribution if the conditional predictive distributions are well calibrated. This type of checking can be more sensitive to reveal such model misspecifications that are not visible in the usual marginal PPC.
In case of models with a small number of parameters and a large number of observations, posterior predictive and LOO predictive distributions are likely to be very similar and we could do conditional predictive distribution PIT checking without LOO. In case of more flexible models, smaller number of observations, or highly influential observations (due to the model flexibility or misspecification) using LOO predictive distributions can be beneficial (especially when examining the conditional predictive distributions). See some examples in the loo
package documentation.
LOO-PIT (or posterior predictive PIT) values are in finite case only close to uniform and not exactly uniform even if the model would include the true data generating distribution. With small to moderate data sets the difference can be so small that we can’t see the difference, but that is why in the above we wrote close to uniform'' instead of
uniform’’. The difference can be illustrated with a simple normal model. Assume that the data comes from a normal distribution and consider a model \(\mathrm{normal}(\mu, \sigma)\) with classic uninformative priors. The posterior predictive distribution can then be computed analytically and is a Student’s \(t\) distribution. PIT values from comparison of a Student’s \(t\) distribution to the normal distributed data are not uniformly distributed, although with increasing data size, the predictive distribution will converge towards the true data generating distribution and the PIT value distribution will converge toward uniform. Thus, in theory, in case of finite data we can see slight deviation from uniformity, but that can be assumed to be small compared to what would be observed in case of bad model misspecification.
At the moment, loo
package LOO-PIT functions don’t yet support PIT values for discrete target distributions, but it is on the todo list of the package maintainers.
Unbiasedness has a special role in statistics, and too often there are dichotomous comments that something is not valid or is inferior because it’s not unbiased. However, often the non-zero bias is negligible, and often by modifying the estimator we may even increase bias but reduce the variance a lot providing an overall improved performance.
In CV the goal is to estimate the predictive performance for unobserved data given the observed data of size \(n\). CV has pessimistic bias due to using less than \(n\) observation to fit the models. In case of LOO-CV this bias is usually small and negligible. In case of \(K\)-fold-CV with a small \(K\), the bias can be non-negligible, but if the effective number of parameters of the model is much less than \(n\), then with \(K>10\) the bias is also usually negligible compared to the variance.
There is a bias correction approach by Burman (1989) (see also Fushiki (2011)) that reduces CV bias, but even in the cases with non-negligible bias reduction, the variance tends to increase so much that there is no real benefit (see, e.g. Vehtari and Lampinen (2002)).
For time series when the task is to predict future (there are other possibilities like missing data imputation) there are specific CV methods such as leave-future-out (LFO) that have lower bias than LOO-CV or \(K\)-fold-CV (Bürkner, Gabry and Vehtari, 2020). There are sometimes comments that LOO-CV and \(K\)-fold-CV would be invalid for time series. Although they tend to have a bigger bias than LFO, they are still valid and can be useful especially in model comparison where bias can cancel out.
Cooper et al. (2023) demonstrate how in time series model comparison variance is likely to dominate, it is more important to reduce the variance than bias, and leave-few-observations and use of joint log score is better than use of LFO. The problem with LFO is that the data sets used for fitting models are smaller increasing the variance.
Bengio and Grandvalet (2004) proved that there is no unbiased estimate for the variance of CV in general, which has been later used as an argument that there is no hope. Instead of dichotomizing to unbiased or biased, Sivula, Magnusson and Vehtari (2020) consider whether the variance estimates are useful and how to diagnose when the bias is likely to not be negligible (Sivula, Magnusson and Vehtari (2023) prove also a special case where there actually exists unbiased variance estimate).
CV tends to have high variance as the sample reuse is not making any modeling assumptions (this holds also for information criteria such as WAIC). Not making modeling assumptions is good when we don’t trust our models, but if we trust we can get reduced variance in model comparison, for example, examining directly the posterior or using reference models to filter out noise in the data (see, e.g., Piironen, Paasiniemi and Vehtari (2018) and Pavone et al. (2020)).
When using CV (information criteria such as WAIC) for model selection, the performance estimate for the selected model has additional selection induced bias. In case of small number of models this bias is usually negligible, that is, smaller than the standard deviation of the estimate or smaller than what is practically relevant. In case of negligible bias, we may choose suboptimal model, but the difference to the performance of oracle model is small. In case of a large number models the selection induced bias can be non-negligible, but this bias can be estimated using, for example, nested-CV or ordered statistics approach by McLatchie and Vehtari (2023).
Ambroise, C. and McLachlan, G. J. (2002) ‘Selection bias in gene extraction on the basis of microarray gene-expression data’, Proceedings of the National Academy of Sciences, 99(10), pp. 6562–6566.
Bengio, Y. and Grandvalet, Y. (2004) ‘No unbiased estimator of the variance of \(k\)-fold cross-validation’, Journal of Machine Learning Research, 5, pp. 1089–1105.
Bürkner, P.-C., Gabry, J. and Vehtari, A. (2020) ‘Approximate leave-future-out cross-validation for Bayesian time series models’, Journal of Statistical Computation and Simulation, 90, pp. 2499–2523.
Burman, P. (1989) ‘A comparative study of ordinary cross-validation, \(v\)-fold cross-validation and the repeated learning-testing methods’, Biometrika, 76(3), pp. 503–514.
Burnham, K. P. and Anderson, D. R. (2002) Model selection and multi-model inference: A practical information-theoretic approach. 2nd edn. Springer.
Cawley, G. C. and Talbot, N. L. C. (2010) ‘On over-fitting in model selection and subsequent selection bias in performance evaluation’, Journal of Machine Learning Research, 11, pp. 2079–2107.
Cooper, A., Simpson, D., Kennedy, L., Forbes, C. and Vehtari, A. (2023) ‘Cross-validatory model selection for bayesian autoregressions with exogenous regressors’, arXiv preprint arXiv:2301.08276.
Fushiki, T. (2011) ‘Estimation of prediction error by using k-fold cross-validation’, Statistics and Computing, 21, pp. 137–146.
Gabry, J., Simpson, D., Vehtari, A., Betancourt, M. and Gelman, A. (2019) ‘Visualization in Bayesian workflow’, Journal of the Royal Statistical Society: Series A (Statistics in Society), 182(2), pp. 389–402.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. and Rubin, D. B. (2013) Bayesian data analysis, third edition. CRC Press.
Gelman, A., Goodrich, B., Gabry, J. and Vehtari, A. (2019) ‘R-squared for Bayesian regression models’, The American Statistician, 73(3), pp. 307–309.
Kalliomäki, I., Vehtari, A. and Lampinen, J. (2005) ‘Shape analysis of concrete aggregates for statistical quality modeling’, Machine Vision and Applications, 16(3), pp. 197–201.
McLatchie, Y., Rögnvaldsson, S., Weber, F. and Vehtari, A. (2023) ‘Robust and efficient projection predictive inference’, arXiv preprint arXiv:2306.15581.
McLatchie, Y. and Vehtari, A. (2023) ‘Efficient estimation and correction of selection-induced bias with order statistics’, arXiv preprint arXiv:2309.03742.
Merkle, E. C., Furr, D. and Rabe-Hesketh, S. (2019) ‘Bayesian comparison of latent variable models: Conditional versus marginal likelihoods’, Psychometrika, 84(3), pp. 802–829.
Oelrich, O., Ding, S., Magnusson, M., Vehtari, A. and Villani, M. (2020) ‘When are Bayesian model probabilities overconfident?’, arXiv preprint arXiv:2003.04026.
Paananen, T., Piironen, J., Bürkner, P.-C. and Vehtari, A. (2021) ‘Implicitly adaptive importance sampling.’, Statistics and Computing, 31(16).
Pavone, F., Piironen, J., Bürkner, P.-C. and Vehtari, A. (2020) ‘Using reference models in variable selection’, arXiv preprint arXiv:2004.13118.
Piironen, J., Paasiniemi, M. and Vehtari, A. (2018) ‘Projective inference in high-dimensional problems: Prediction and feature selection’, arXiv preprint arXiv:1810.02406. Available at: https://arxiv.org/abs/1810.02406.
Piironen, J. and Vehtari, A. (2017) ‘Comparison of Bayesian predictive methods for model selection’, Statistics and Computing, 27(3), pp. 711–735. doi: 10.1007/s11222-016-9649-y.
Reunanen, J. (2003) ‘Overfitting in making comparisons between variable selection methods’, Journal of Machine Learning Research, 3, pp. 1371–1382.
Sivula, T., Magnusson, M. and Vehtari, A. (2020) ‘Uncertainty in Bayesian leave-one-out cross-validation based model comparison’, arXiv:2008.10296.
Sivula, T., Magnusson, M. and Vehtari, A. (2023) ‘Unbiased estimator for the variance of the leave-one-out cross-validation estimator for a bayesian normal model with fixed variance’, Communications in Statistics-Theory and Methods, 52(16), pp. 5877–5899.
Tosh, C., Greengard, P., Goodrich, B., Gelman, A., Vehtari, A. and Hsu, D. (2021) ‘The piranha problem: Large effects swimming in a small pond’, arXiv preprint arXiv:2105.13445.
Vehtari, A., Gelman, A. and Gabry, J. (2017) ‘Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC’, Statistics and Computing, 27(5), pp. 1413–1432. doi: 10.1007/s11222-016-9696-4.
Vehtari, A. and Lampinen, J. (2002) ‘Bayesian model assessment and comparison using cross-validation predictive densities’, Neural Computation, 14(10), pp. 2439–2468.
Vehtari, A., Mononen, T., Tolvanen, V., Sivula, T. and Winther, O. (2016) ‘Bayesian leave-one-out cross-validation approximations for gaussian latent variable models’, The Journal of Machine Learning Research, 17(1), pp. 3581–3618.
Vehtari, A. and Ojanen, J. (2012) ‘A survey of Bayesian predictive methods for model assessment, selection and comparison’, Statistics Surveys, 6, pp. 142–228. doi: 10.1214/12-SS102.
Vehtari, A., Simpson, D., Gelman, A., Yao, Y. and Gabry, J. (2019) ‘Pareto smoothed importance sampling’, arXiv preprint arXiv:1507.02646. Available at: https://arxiv.org/abs/1507.02646v6.
Yao, Y., Vehtari, A., Simpson, D. and Gelman, A. (2018) ‘Using stacking to average Bayesian predictive distributions (with discussion)’, Bayesian Analysis, 13(3), pp. 917–1003. doi: 10.1214/17-BA1091.
We thank Jonah Gabry, Ravin Kumar, Martin Modrák, and Erling Rognli for useful feedback on the draft of FAQ and all who have been asking questions and discussing answers.