Zoo of 36 Models: Ensemble for OoD
- Zoo of 36 Models is a curated collection of 35 diverse pre-trained models and fusion strategies designed for robust out-of-distribution generalization in computer vision.
- The approach leverages a leave-one-domain-out cross-validation metric that integrates inter-class discriminability and inter-domain stability for effective model ranking.
- A variational EM algorithm is applied for feature selection and ensemble fusion, significantly reducing redundant features while boosting accuracy on multiple benchmarks.
A "Zoo of 36 Models" refers to the large, systematically curated collection of pre-trained models (PTMs) assembled and analyzed within the ZooD paradigm, targeting the problem of out-of-distribution (OoD) generalization in computer vision. This ensemble, consisting of 35 distinct PTMs alongside the resulting fusion strategies, encompasses a spectrum of architectures, pre-training methods, and data regimes, enabling comprehensive empirical study of model selection, ranking, and ensemble approaches to robust generalization. The ZooD approach introduces efficient ranking metrics and variational feature selection, providing both a methodological framework and a substantive model resource for OoD research (Dong et al., 2022).
1. Construction and Composition of the Model Zoo
The Zoo comprises 35 PTMs categorized by model architecture, pre-training data, and training methodology. Models include canonical convolutional neural networks (CNNs), vision transformer (ViT) variants, and ensembles emerging from both standard empirical risk minimization (ERM) and alternative training schemes. Table 1 exemplifies this diversity, grouping the PTMs as follows:
| Group | Representative Models | Pre-training/Data Source |
|---|---|---|
| G1 | ResNet/ResNeXt/DenseNet/EfficientNet/Swin | Standard ImageNet-1K, ERM, Swin |
| G2 | ResNet-50 (robust/self-supervised) | Adv. (ℓ₂, ℓ∞), BYOL, MoCo-v2, SwAV |
| G3 | ResNets/ViTs on large-scale/weak supervision | YFCC-100M, IG-1B, WebImageText, ImageNet-22K |
Notable models include Swin-T/B and BEiT-based ViT models, CLIP models trained on WebImageText, and robust/self-supervised ResNet-50 variants (e.g., Adv. ℓ₂ with ε=0.5 and Adv. ℓ∞ with ε=4).
2. Model Ranking Metric Based on Cross-Domain Evaluation
To effectively leverage this repository of PTMs for OoD tasks, ZooD proposes a quantitative ranking strategy premised on leave-one-domain-out cross-validation. For each domain , the method computes a model score as the sum of:
- Inter-class discriminability, quantifying the conditional likelihood using a linear-Gaussian Laplace approximation, where and are optimized as evidence hyperparameters.
- Inter-domain stability, evaluating the likelihood under a Gaussian fit to to penalize domain-specific overfitting and reward stable features.
The average over domains yields the ZooD model score:
This construction ensures that models exhibiting both high inter-class separability and cross-domain invariance are favored.
3. Variational EM Algorithm for Feature Selection
Following selection of the top- models by ZooD score, their features are concatenated into a joint representation . To suppress spurious and non-informative features, ZooD employs a spike-and-slab Bayesian linear model with binary feature selection variables . The variational EM procedure updates the joint posterior approximations for the weights , selection indicators , and associated hyper-parameters via mean-field factorization. Stochastic variational inference—using sub-sampling of examples per iteration—enables computational scalability.
Features for which (with , e.g., 0.5) are dropped, retaining only informative components. This denoising step is shown to discard a majority of redundant features (75–99%) with measurable gains in downstream accuracy.
4. Ensemble Fusion and Final Classifier Construction
The overall fusion procedure consists of:
- Ranking all PTMs with the ZooD score, then selecting the top- (typically ).
- Concatenating the feature matrices from the selected models.
- Applying variational EM-based feature selection to the amalgamated representation.
- Training a logistic or linear classifier on the reduced feature set.
This ensemble and denoising strategy is designed to optimize both the discriminative and robustness properties inherent in the model zoo, without inducing excessive noise that typically plagues naïve feature-level model fusion.
5. Out-of-Distribution Generalization Benchmarks and Performance
ZooD is evaluated on seven OoD generalization benchmarks under leave-one-domain-out protocol, including PACS, VLCS, Office-Home, TerraIncognita, DomainNet, and NICO (Animals/Vehicles). In comparative analysis with previous state-of-the-art (SOTA) methods such as SWAD, IRM, MixStyle, and ERM, the following are observed:
- The best single PTM (ZooD Single) exceeds SOTA by +14 percentage points on Office-Home, +7.9 on PACS, and +1.7 on DomainNet.
- Top-3 PTM ensemble (ZooD Ensemble) further narrows the gap, especially on challenging domains.
- Feature selection via variational EM improves the average accuracy from 66.9% (SWAD) to 71.0%, retaining only a fraction of features (as little as 24.3% on PACS, up to 99.8% on DomainNet).
- Specifically, on DomainNet, accuracy improves from 46.5% (SWAD) to 50.6% (ZooD Feature Sel.), and computationally the approach is 1,000–10,000× faster than brute-force fine-tuning, requiring 0.3–11 GPU hours as opposed to 2.7k–17k GPU hours.
| Method | PACS | VLCS | Office-Home | TerraInc. | DomainNet | Avg |
|---|---|---|---|---|---|---|
| ZooD (Single) | 96.0 | 79.5 | 84.6 | 37.3 | 48.2 | 69.1 |
| ZooD (Ensemble) | 95.5 | 80.1 | 85.0 | 38.2 | 50.5 | 69.9 |
| ZooD (Feature Sel.) | 96.3 | 80.6 | 85.1 | 42.3 | 50.6 | 71.0 |
6. Categorization and Sources of Pre-Trained Models
The 35 PTMs span a wide array of publicly available models and research groups, including but not limited to:
- Standard ERM-trained CNNs and ViTs (ResNet, DenseNet, Swin, Inception, EfficientNet; mostly via PyTorch and Liu et al.)
- Robust and self-supervised ResNets (as in Salman et al., Ericsson et al.)
- Large-scale or weakly-supervised PTMs (trained on YFCC-100M, IG-1B, WebImageText, ImageNet-22K; from Yalniz et al., Radford et al., Wolf et al.)
This broad coverage ensures diversity in pre-training objectives, data domains, and architectural inductive biases relevant for robust cross-domain transfer.
7. Significance and Implications
The construction of a model zoo comprising 35 diverse PTMs, along with the ZooD ranking and feature selection pipeline, provides an efficient infrastructure for systematic OoD generalization research. The demonstrated improvement in accuracy and drastic reduction in computational overhead enables broader access and more extensive empirical evaluation of model selection and fusion hypotheses for robust machine learning. The approach penalizes overfit and domain-specific features, favoring models whose learned representations are both discriminative and invariant, a central desideratum in OoD learning. The availability of fine-tuning results for all models on seven datasets establishes a reference for future research and benchmarking in model zoo-based generalization (Dong et al., 2022).