Same-All Cross-validation (SAC)
- Same-All Cross-validation (SAC) is a method that compares models trained on individual subsets (SAME) versus pooled data (ALL) to assess cross-subset similarity in non-i.i.d. settings.
- SAC employs nested K-fold cross-validation and paired t-tests to statistically evaluate whether data pooling improves model performance.
- SAC provides actionable insights into when data heterogeneity, such as temporal or geographic differences, enhances predictive accuracy or leads to negative transfer.
Same-All Cross-validation (SAC) is a principled approach for quantifying the similarity of learnable or predictable patterns across distinct data subsets in supervised learning. SAC evaluates whether model performance on a given test subset improves or deteriorates when training is performed on pooled data from all subsets (the ALL split) compared to training only on the target subset (the SAME split). This methodology serves as a key component of SOAK (Same/Other/All K-fold cross-validation), providing statistically rigorous insight into when subset pooling is advantageous versus harmful, particularly in contexts of non-i.i.d. train/test distributions due to temporal, geographic, or otherwise labeled heterogeneity (Hocking et al., 2024).
1. Definition and Position within SOAK
SAC operates by comparing two K-fold cross-validation models for each subset : one trained exclusively on data from (SAME) and another trained on the union of all subsets (ALL). By focusing on the SAME versus ALL comparison, SAC isolates the effect of pooling on predictive error for each subset, omitting the “OTHER” model considered in the full SOAK framework.
In settings where the traditional i.i.d. assumption between train and test samples does not hold—such as temporal drift, spatial clustering, or other categorical splitting—SAC directly addresses the principal question: does pooling data across subsets yield improved predictive performance on new, potentially dissimilar, target subsets? If performance is enhanced by pooling, shared learnable structure is inferred; if degraded, this suggests qualitative differences between subsets leading to negative transfer (Hocking et al., 2024).
2. Algorithmic Procedure
SAC utilizes a nested K-fold cross-validation over all subset–fold pairs. For each subset and fold , the following definitions and model fits apply:
On each fold:
- Train on .
- Train on 0.
- Evaluate 1 and 2 on 3, obtaining errors 4 and 5 respectively.
After iterating over all folds, for each 6:
- Compute mean error across folds: 7, 8 analogously.
- Compute per-fold difference: 9.
- Perform a paired 0-test on 1.
This workflow yields for each subset a direct, statistically tested measure of the gain or loss accrued by pooling during training.
3. Mathematical Formulation and Similarity Score
The central quantitative output of SAC is the Same-All similarity score for each subset 2:
- Fold-wise difference: 3
- Mean Same-All similarity: 4
The sign of 5 encodes the relevance of pooling:
- If 6 (7), pooling reduces error, indicating high cross-subset similarity.
- If 8, pooling raises error, implying predictive dissimilarity and possible negative transfer.
This similarity score provides a standardized metric by which to judge the cohesiveness of patterns underlying labeled data groupings.
4. Statistical Inference and Confidence Estimation
Statistical significance of observed differences is assessed via a paired 9-test over the 0 values of 1 for each subset 2. Assuming approximate normality of these fold-level contrasts, the test statistic is
3
with 4 degrees of freedom. A two-sided 5-value is reported to assess the null hypothesis 6. The 7 confidence interval for the mean difference is:
8
Confidence intervals that lie entirely below zero indicate a significant benefit to pooling (9); intervals entirely above zero indicate harm from pooling.
5. Empirical Examples and Interpretations
Empirical studies using SAC have addressed datasets with meaningful partitioning by geography, time, or other categorical features. Key findings include:
| Dataset | Subset Type | SAC Outcome |
|---|---|---|
| CanadaFiresA/D | Satellite fires | Positive 0 (pooling harms) |
| FishSonar_river | Rivers | Positive 1 (pooling harms) |
| aztrees3/aztrees4 | Geographic quadrants | Positive 2 (pooling harms) |
| NSCH_autism | Survey years (2019/2020) | Small negative 3 (pooling aids) |
Interpretation of these findings:
- Strongly positive 4 and 5 are indicative of low inter-subset similarity; pooling degrades predictivity, suggesting distinct underlying generative mechanisms per subset.
- Negative and significant 6 values suggest learnable structure is sufficiently shared that pooling supports generalization (Hocking et al., 2024).
- Cases with 7 near zero are interpreted as neutral with respect to pooling.
A summary across all subsets—using min, max, and mean 8 and their associated 9-values—yields a granular view of transferability.
6. Practical Considerations and Methodological Limitations
Several operational and theoretical issues may influence SAC outcomes:
- Choice of 0: Larger 1 (e.g., 10) decreases bias in 2 but increases computational burden linearly. Sufficiently large 3 sets are required to ensure stable model fitting.
- Computational burden: Requires 4 model fits. Training ALL models is especially computationally intensive, often motivating use of regularized linear learners or parallelization strategies.
- Data heterogeneity: SAC assumes non-i.i.d. effects strictly from subset membership; within subset–fold cells, ordinary CV exchangeability must generally hold.
- Interpretational caution: While SAC identifies whether pooling is beneficial or detrimental, it does not reveal causal mechanisms, such as concept drift, covariate shift, or label noise. Diagnostic discriminability across subsets may arise from any of these, or other, latent sources.
These considerations frame the appropriate deployment of SAC and enable informed interpretation of its outcomes in practice.
7. Synthesis and Role within Data Science Methodology
SAC provides a crucial statistical diagnostic for pattern-sharing across labeled data partitions commonly encountered in modern data science. By formalizing a robust, fold-wise comparison of training strategies, SAC enables practitioners to empirically adjudicate between pooling and non-pooling training regimes for each target subset. This approach has particular relevance in settings with known or suspected non-i.i.d. structure—such as evolving time series, multi-region studies, or tiered population surveys—serving as an evidence-based guide for the principled combination (or separation) of data subsets to optimize predictive performance and generalization (Hocking et al., 2024).