Spatial regression models allow capturing spatial dependence under assumptions of spatially correlated errors and/or covariates. Once the model is fitted, its performance is usually assessed using spatial cross-validation techniques, which help further to estimate prediction error.
Spatial cross-validation methods rely on the key idea of "point separation", separating the training set from the validation set using a "buffer zone", to achieve independence between these sets. Otherwise, classical cross validation techniques can lead to poor evaluations under spatial dependence. The core issue in spatial cross-validation is choosing the buffer zone size h; if h is too small, spatial dependence remains; but if h is too large, too many training points are lost, possibly leading to poor estimation.
It is natural to ask whether one can select a good buffer size h to balance bias and variance. Chavez-Chong (2025) addresses this by defining the discrepancy L_h between the cross-validation error and an “ideal'' error. The optimal buffer size h minimizes L_h. However, her results are based on empirical versions of the risks obtained by simulations. This work establishes theoretical results for the function L_h without estimating the ideal risk directly.
Our strategy follows the line of work by Fermín and Ludena (2008) and Loubes and Ludena (2010), where model selection is based on a careful analysis of bias and variance. We consider a projection of the full data set S_n onto the training set, and define W_h as the matrix associated with this projection. First, we decompose L_h into two components: a variance term and a squared bias term. This decomposition helps to understand how the choice of the buffer size h impacts the quality of the prediction error estimation. Then we derive non-asymptotic upper bounds in probability, which are established under the assumption of Gaussian noise, using isoperimetric concentration inequalities. These bounds depend on the covariance matrix of the spatial noise, and on the projection matrices associated with the training set and Sn. Finally, since the estimation of the covariance matrix can be unstable under spatial dependence, we propose bootstrap-based corrections to improve the control and robustness of the theoretical bounds. We also introduce a corrected version of the cross-validation risk.
- Poster