Here are some answers by Aki Vehtari to frequently asked questions about cross-validation and `loo`

package. If you have further questions, please ask them in Stan discourse thread named Cross-validation FAQ.

Cross-validation is a family of techniques that try to estimate how well a model would predict previously unseen data by using fits of the model to a subset of the data to predict the rest of the data.

Cross-validation can be used to:

- Asses the predictive performance of a single model
- Asses model misspecification or calibration of the predictive distribution of a single model
- Compare multiple models
- Select a single model from multiple candidates
- Combine the predictions of multiple models

Even if the goal of the model is not to make predictions, a model which makes bad or badly calibrated predictions is less likely to provide useful insights to a phenomenon studied.

Two basic cases why to use cross-validation for one model are:

- We want to know how good predictions the model can make for future or otherwise unseen observations.
- We want to know if the model describes the observed data well, but we are not going make any predictions for the future.

More about these cases:

1 ) For example, Vehtari and Lampinen (2002) describe a model for predicting concrete quality based on amount of cement, sand properties (see Kalliomäki, Vehtari and Lampinen (2005)), and additives. One of the quality measurements is compressive strength 3 months after casting. For example, when constructing bridges, it is very useful to be able to predict the compressive strength before casting concrete. Vehtari and Lampinen (2002) estimated 90% quantile of absolute error for new castings, that is, they reported that in 90% of cases the difference between the prediction and the actual measurement 3 months after the casting is less than the given value (also other quantiles were reported to the concrete experts). This way it was possible to asses whether the prediction model was accurate enough to have practical relevance.

2a) Even if we are not interested in predicting the actual future, a model which could make good predictions has learned something useful from the data. For example, if a regression model is not able to predict better than null model (a model only for the marginal distribution of the data) then it has not learned anything useful from the predictors. Correspondingly for time series models the predictors for the next time step can be observation values in previous time steps.

2b) Instead of considering predictions for future, we can consider whether we can generalize from some observations to others. For example, in social science we might make a model explaining poll results with demographical data. To test the model, instead of considering future pollings, we could test whether the model can predict for a new state. If we have observed data from all states in USA, then there are no new states (or it can take unpredictable time before there are new states), but we can simulate a situation where we leave out data from one state and check can we generalize from other states to the left out state. This is sensible approach when we assume that states are exchangeable conditional on the information available (see, e.g., Gelman *et al.* (2013) Chapter 5 for exchangeability). The generalization ability from one entity (a person, state, etc) to other similar entity tells us that model has learned something useful. It is very important to think what is the level where the generalization is most interesting. For example, in cognitive science and psychology it would be more interesting to generalize from one person to another than within person data from one trial to another trial for the same person. In cognitive science and psychology studies it is common that the study population is young university students, and in such thus there are limitations what we can say about the generalization to whole human population. In polling data from all US states, the whole population of US states has been observed, but there is limitation how we can generalize to other countries or future years.

2c) In addition of assessing the predictive accuracy and generalizability, it is useful to assess how well calibrated is the uncertainty quantification of the predictive distribution. Cross-validation is useful when we don’t trust that the model is well specified, although many bad mis-specifications can be diagnosed also with simpler posterior predictive checking. See, for example, case study roaches.

Three basic cases for why to use cross-validation for many models are:

- We want to use the model with best predictions.
- We want to use the model which has learned most from the data and is providing best generalization between interesting entities.
- We want combine predictions of many models, weighted by the estimated predictive performance of each model.

More about these cases:

1 ) Use of cross-validation to select the model with best predictive performance is relatively safe if there are small or moderate number of models, and there is a lot of data compared to the model complexity or the best model is clearly best (Piironen and Vehtari, 2017, p. @Sivula+etal:2020:loo_uncertainty). See also Section How to use cross-validation for model selection?.

2a) Cross-validation is useful especially when there are posterior dependencies between parameters and examining the marginal posterior of a parameter is not very useful to determine whether the component related to that parameter is relevant. This happens, for example, in case of collinear predictors. See, for example, case studies collinear, mesquite, and bodyfat.

2b) Cross-validation is less useful for simple models with no posterior dependencies and assuming that simple model is not mis-specified. In that case the marginal posterior is less variable as it includes the modeling assumptions (which assume to be not mis-specified) while cross-validation uses non-model based approximation of the future data distribution which increases the variability. See, for example, case study betablockers.

2c) Cross-validation can provide quantitative measure, which should only complement but not replace understanding of qualitative patterns in the data (see, e.g., Navarro (2019)).

3 ) See more in How to use cross-validation for model averaging?.

See also the next Section “When not to use cross-validation?”, [How is cross-validation related to overfitting?}(#overfitting), and How to use cross-validation for model selection?.

In general there is no need to do any model selection (see more in How is cross-validation related to overfitting?, and How to use cross-validation for model selection?). The best approach is to build a rich model that includes all the uncertainties, do model checking, and possible model adjustments.

Cross-validation cannot answer directly the question “Do the data provide evidence for some effect being non-zero?” Using cross-validation to compare a model with an additional term to a model without that term is a kind of null hypothesis testing. Cross-validation can tell whether that extra term can improve the predictive accuracy. The improvement in the predictive accuracy is a function of signal-to-noise-ratio, the size of the actual effect, and how much the effect is correlating with other included effects. If cross-validation prefers the simpler model, it is not necessarily evidence for an effect being exactly zero, but it is possible that the effect is too small to make a difference, or due to the dependencies it doesn’t provide additional information compared to what is already included in the model. Often it makes more sense to just fit the larger model and explore the posterior of the relevant coefficient. Analysing the posterior can however be difficult if there are strong posterior dependencies.

Cross-validation is not good for selecting a model from a large number of models (see How to use cross-validation for model selection?)

- Vehtari, A., Gelman, A., and Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.
*Statistics and Computing*. 27(5), 1413–1432. online. - LOO glossary
- Model selection video lectures and Bayesian Data Analysis lectures 8.2, 9.1, 9.2.
- Decision theoretical background on Bayesian cross-validation can be found in the article A survey of Bayesian predictive methods for model assessment, selection and comparison (Vehtari and Ojanen, 2012).

It is important to separate

- the way how the data is divided in cross-validation, e.g. leave-one-out (LOO), leave-one-group-out (LOGO), and leave-future-out (LFO)
- the utility or loss, e.g. expected log predictive density (ELPD), root mean square error (RMSE), explained variance (\(R^2\))
- the computational method use to compute leave-one-out predictive distributions, e.g. K-fold-CV, Pareto smoothed importance sampling (PSIS),
- and the estimate obtained by combining these.

Different partitions of data are held out in different kinds of cross-validation.

- CV: cross-validation approach (no specific partition defined)
- LOO or LOO-CV: leave-one-out cross-validation approach (single observation)
- LFO: leave-future-out cross-validation approach (all future observations). See more in Can cross-validation be used for time series?.
- LOGO: leave-one-group-out cross-validation approach (a group of observations). See more in Can cross-validation be used for hierarchical / multilevel models?.

Which unit is systematically left out determines the predictive task that cross-validation assesses model performance on (see more in When is cross-validation valid?). CV, LOO, LFO and LOGO and other cross-validation approaches do not yet specify the utility or loss, or how the computation is made except that it involves estimating cross-validated predictive densities or probabilities.

First we need to define the utility or loss function which compares predictions to observations. These predictions can be considered to be for future observations, or for other exchangeable entities (see more in What is cross-validation?). Some examples:

- LPD or LPPD: Log pointwise predictive density for a new observation. For simplicity the LPD acronym is used also for expected log pointwise predictive probabilities for discrete models. Often a shorter term log score is used.
- RMSE: Root mean square error.
- ACC: Classification accuracy.
- \(R^2\): Explained variance (see, e.g., Gelman
*et al.*(2019)) - 90% quantile of absolute error (see, e.g., Vehtari and Lampinen (2002))

These are examples of utility and loss functions for using the model to predict the future data and then observing that data. Other utility and loss functions could also be used. See more in Can other utilities or losses be used than log predictive density?, Scoring rule in Wikipedia, and Gneiting and Raftery, 2012.

The value of the loss functions necessarily depends on the data we observe next. We can however try to estimate an *expectation* of the loss (a summary of average predictive performance over several predictions or expected predictive performance for one prediction) under the assumption that both the covariates and responses we currently have are representative of those we will observe in the future.

- ELPD: The theoretical expected log pointwise predictive density for a new observations (or other exchangeable entity) (Eq 1 in Vehtari, Gelman and Gabry (