Domain-Incremental Benchmarks

Updated 28 November 2025

Domain-incremental benchmarks are standardized evaluation suites for continual learning where the label space remains fixed while the input distribution shifts.
They assess performance via metrics such as accuracy, forgetting, and adaptation on non-i.i.d. data across vision, audio, graph, and multimodal tasks.
These benchmarks drive innovations in parameter isolation, replay, and distillation techniques to enhance robust lifelong learning.

Domain-incremental benchmarks (DIL benchmarks, also termed "domain-IL" in some literature) are standardized evaluation suites for continual learning under shifting input distributions, with the defining property that the label space is held fixed across tasks while the input domain changes. DIL is now a central scenario in incremental learning beyond classic class-incremental protocols, and is used to systematically measure the stability–plasticity tradeoff, parameter-isolation efficacy, and memory dynamics of learning algorithms under realistic, non-i.i.d. non-stationarity. This entry surveys foundational definitions, protocol design, canonical datasets, evaluation metrics, modeling challenges, and pivotal empirical results across vision, audio, graph, and multimodal domains.

1. Formal Definition and General Problem Statement

Let $\mathcal{C}$ denote a fixed set of $K$ classes. In the DIL setting, a sequential task stream $\mathcal{T}_1,\ldots,\mathcal{T}_N$ is provided, where each task $\mathcal{T}_t$ consists of data sampled from a distinct input distribution $P_t(x)$ , but with $y\in\mathcal{C}$ for all samples. At IL step $t$ , the model is trained only on $\mathcal{T}_t$ (previous data is unavailable), and is evaluated on all domains $\mathcal{T}_{1:t}$ to assess both plasticity (adaptation to new domains) and stability (retention of prior domains’ performance). The model's output layer is not expanded across tasks, as the class semantics remain identical.

Crucially, domain-incremental evaluation differs from class-incremental evaluation (where new classes are added per step) and from task-incremental evaluation (where both class and domain may change, and the task label is often provided at inference) (Ko et al., 2022). DIL benchmarks expose a model to potentially extreme domain shifts—such as sensor, viewpoint, style, weather, and generator differences—while requiring pooled predictions over all classes seen so far in all domains.

2. Canonical Benchmarks and Task Construction Methods

Benchmark construction in DIL demands careful curation to isolate domain shift effects. Standard datasets and task splits include:

Image Classification DIL (DomainNet, iDigits, CORe50): Six-domain DomainNet (clipart, infograph, painting, quickdraw, real, sketch; 345 classes) is the de facto DIL benchmark in vision (Geng et al., 18 Nov 2025, Wang et al., 29 May 2025, Park et al., 17 Sep 2024). The iDigits protocol (MNIST, USPS, SVHN, SynMNIST; 10 classes) and CORe50 (eight sessions, 50 classes) provide complementary low-resolution and object-centric scenarios (Park et al., 17 Sep 2024).
Object Detection DIL: D-RICO (Domain Realistic Incremental Object Detection), composed of 15 tasks from 14 driving and surveillance datasets, curates fixed-class detection across strong shifts (sensor, nocturnal, adverse weather, synthetic, event) (Neuwirth-Trapp et al., 19 Aug 2025). VOC/Clipart/Watercolor/Comic (Pascal VOC series) and BDD100K/Cityscape/Rainy Cityscape are other DIL splits for detection (Wang et al., 29 May 2025).
Semantic Segmentation DIL: Cityscapes $\rightarrow$ BDD100k $\rightarrow$ IDD enables evaluation of segmentation algorithms under geodiverse urban scenes, with fully overlapping or partially disjoint label spaces (Garg et al., 2021).
Audio DIL: Multi-city acoustic scene benchmarks (European cities, Lisbon, Lyon, Prague, Korea; 10 scenes), and Audioset $\rightarrow$ FSD50K for multi-label event classification (Mulimani et al., 23 Dec 2024).
Graph DIL: Node, link, and graph DIL is instantiated on OGBN-Proteins (species-based domain splits); Wiki-CS (edge splits by subfield); ogbg-molhiv (scaffold-based graph domains) (Ko et al., 2022).
Specialized DIL: Hard deepfake detection (CDDB-Hard; multiple forgery generators) (Wang et al., 29 May 2025), video human action recognition with user/scene/hybrid splits (Hu et al., 22 Dec 2024), and 30-domain ImageNet-Mix (style and corruption from ImageNet-R and -C) (Geng et al., 18 Nov 2025).

DIL construction principles include mutual exclusivity of domain splits, maintaining a fixed label set per task, and in some protocols, controlling class intersection across domains for diagnostic flexibility (Park et al., 17 Sep 2024, Xie et al., 2022).

Table 1: Representative Domain-Incremental Benchmarks

Domain	Dataset/Task	Domains (#)	Classes	Reference
Vision-Cls	DomainNet	6	345	(Geng et al., 18 Nov 2025 Park et al., 17 Sep 2024)
Vision-Det	D-RICO	15	3	(Neuwirth-Trapp et al., 19 Aug 2025)
Audio-Cls	Europe–Korea	5	10	(Mulimani et al., 23 Dec 2024)
Segmentation	CS,BDD,IDD	3	17–26	(Garg et al., 2021)
Graph-NC	PROTEINS	8	2	(Ko et al., 2022)
Multimodal-VL	MTIL	11	1,201	(Wang et al., 24 Jun 2025)

3. Protocols, Evaluation Metrics, and Inference Considerations

Protocols: In a typical DIL protocol, models are trained for each task $\mathcal{T}_t$ sequentially, without access to previous domain data or replay memory unless explicitly allowed (e.g., D-RICO includes replay variants (Neuwirth-Trapp et al., 19 Aug 2025)). Domain-agnostic and domain-aware inference modes are reported, where in the former, the test-time domain is not provided.

Metrics: Evaluation emphasizes both retention and adaptation across domains. Common metrics include average accuracy (or mIoU, mAP for segmentation/detection), average forgetting, intransigence (difficulty in acquiring new knowledge), and, when relevant, forward transfer (benefit on unseen domains).

Formally, post-task $t$ :

Average accuracy: $A_t = \frac{1}{t}\sum_{i=1}^t \text{acc}(M_t, \mathcal{T}_i)$ .
Forgetting: $F_t = \frac{1}{t-1}\sum_{j=1}^{t-1}[A_{j,j} - A_{t,j}]$ , with $A_{j,j}$ the accuracy on domain $j$ after learning it (Mulimani et al., 23 Dec 2024, Ko et al., 2022, Geng et al., 18 Nov 2025).
mIoU (segmentation): task–domain matrix, overall IL drop $\Delta_m\%$ (Garg et al., 2021).
[email protected]: for detection (Neuwirth-Trapp et al., 19 Aug 2025, Wang et al., 29 May 2025).
Intransigence and forward transfer as in (Ko et al., 2022, Neuwirth-Trapp et al., 19 Aug 2025).

Domain selection accuracy is also a focus for parameter-isolation approaches: $S_T = \frac{1}{N}\sum_{i=1}^N \mathbf{1}(\hat d_i = d_i)$ (Wang et al., 29 May 2025).

4. Modeling Strategies and Empirical Findings in DIL

Parameter Isolation: Approaches such as adapters, prompts, or domain-specific batchnorm allow per-domain specialization while maintaining a shared backbone (Geng et al., 18 Nov 2025, Park et al., 17 Sep 2024, Wang et al., 29 May 2025, Wang et al., 24 Jun 2025). For example, S-Prompts and PINA partition trainable prompt embeddings or adapters per domain, activated via a learned domain predictor (Park et al., 17 Sep 2024, Geng et al., 18 Nov 2025, Wang et al., 29 May 2025). Resulting systems can reach 84–89% accuracy (CORe50, iDigits) and minimize forgetting (Park et al., 17 Sep 2024).

Replay/Reservoir Methods: Storing a fraction of prior samples sharply reduces forgetting and nearly matches joint training performance in object detection (1–10% replay leads to mAP $\sim$ 43, close to joint mAP 44 in D-RICO) (Neuwirth-Trapp et al., 19 Aug 2025).

Distillation: Applied to discourage drift in shared weights or to regularize new adapters (e.g., KLD constraint in segmentation (Garg et al., 2021), KL/CAST in image classifiers (Park et al., 17 Sep 2024)). Distillation from “weak teachers” (i.e., past models trained on very different domains) is less effective, especially under strong shifts (Neuwirth-Trapp et al., 19 Aug 2025).

Domain Identification: In parameter-isolation frameworks, inference requires accurate domain-label selection for adapter/prompt routing. Trainable selectors (multi-level feature fusion, Gaussian mixture compressors as in SOYO) outperform fixed KNN or nearest mean, especially as the number of domains grows (Wang et al., 29 May 2025).

Hybrid and Task-Agnostic Modes: Task-agnostic DIL, where no domain label is available, requires domain prediction via entropy or clustering; stability and accuracy degrade with increased domain mismatch (Mulimani et al., 23 Dec 2024, Wang et al., 29 May 2025).

5. Cross-Domain Synthesis and DIL Benchmarks in Other Modalities

DIL is now standard across diverse modalities:

Vision–Language: In multi-domain CLIP adaptation (ChordPrompt MTIL benchmark), synthetic, real, and textured domains are incrementally introduced. Zero-shot accuracy is preserved due to frozen backbones and cross-modal prompt fusion, with only 9.5M prompt parameters updated vs. 211M for full fine-tuning (Wang et al., 24 Jun 2025).
Speech/Audio: For sequential noisy-acoustic conditions or crowdsourced environments, DIL benchmarks reveal that classic fine-tuning forgets previous domains, while domain-aware residual reparameterization achieves average accuracy up to 83% on severe cross-lingual and cross-urban shifts (Mulimani et al., 23 Dec 2024).
Graph: BeGin introduces node-level (OGBN-Proteins, 8 tasks), link-level (Wiki-CS, 54 tasks), and graph-level (OGBG-ppa, 11 tasks) DIL protocols that decouple domain and class evolution, supporting evaluation of replay and regularization methods (Ko et al., 2022).

The DIL protocol supports class-overlap, domain-overlap, and highly nonstationary splits (i.e., iCIFAR-20 ND) (Xie et al., 2022), and evaluates intra-class domain confusion and intra-domain class difficulty (ICON, IC in (Park et al., 17 Sep 2024)).

6. Key Empirical Results, Limitations, and Future Directions

DIL methods based on prompt/adapters (e.g., S-Prompts, LAE) and CAST regularizers show 5–15 pp boosts in average accuracy and 50% reduction in forgetting over prior baselines (Park et al., 17 Sep 2024).
Replay is a robust baseline in detection and segmentation, outperforming all distillation-only schemes unless memory constraints are binding (Neuwirth-Trapp et al., 19 Aug 2025).
Parameter-isolation scales to many domains, but domain-ID confusion and increasing storage cost (linear in domain count) remain open issues (Wang et al., 29 May 2025).
DIL “weak teacher” failures, where old models are themselves suboptimal on new domains, undermine conventional knowledge distillation (Neuwirth-Trapp et al., 19 Aug 2025).
DomainNet, ImageNet-R/C/Mix, CORe50, and iDigits now serve as the gold-standard task sequences for protocol comparability across studies (Geng et al., 18 Nov 2025).
Most DIL methods (including state-of-the-art) remain below the upper bound of individual per-domain models, suggesting fundamental limitations with unified architectures (Neuwirth-Trapp et al., 19 Aug 2025, Wang et al., 29 May 2025).

7. Benchmark Design Practices and Recommendations

Use strictly non-overlapping domain splits with fixed class semantics to ensure pure domain shift measurement (Park et al., 17 Sep 2024).
Report average accuracy and forgetting as core metrics, with mIoU/mAP for dense prediction tasks (Garg et al., 2021, Neuwirth-Trapp et al., 19 Aug 2025).
Adopt parameter-efficient isolation (adapters, prompts) with robust, trainable domain selectors for large-scale DIL (Wang et al., 29 May 2025).
Task permutations, as on DomainNet, are critical for assessing order-robustness (Geng et al., 18 Nov 2025).
In graph and audio DIL, preserve the dependency structure during data partitioning for realistic assessment (Ko et al., 2022, Mulimani et al., 23 Dec 2024).
Avoid “oracle”-level assumptive protocols (e.g., known domain at test) when targeting task-agnostic real-world scenarios.

Domain-incremental benchmarks now constitute a rigorous and diverse set of protocols for evaluating continual learning in the presence of realistic domain shifts. Their adoption across visual, audio, and graph modalities is critical for measuring true generalization under lifelong nonstationarity.