Cross-Task VICL Dataset Benchmark

Updated 27 November 2025

The dataset introduces a novel one-shot cross-task transfer protocol using implicit prompts to benchmark VLM generalization in low-level vision tasks.
It covers 12 tasks across restoration, removal, and enhancement with 19 directed transfer scenarios evaluated by pixelwise and perceptual quality metrics.
Advanced prompt generation via text embeddings and k-means ensures a diverse and semantically accurate set of demonstrations for effective task adaptation.

The Cross-Task VICL Dataset is the first systematic benchmark targeting visual in-context learning (VICL) across distinct low-level vision tasks. Unlike traditional VICL benchmarks, which constrain prompts and queries to the same transformation, the Cross-Task VICL Dataset probes the ability of large vision-LLMs (VLMs) to generalize by conditioning on demonstrations from a different but related image-to-image task, using only implicit text descriptions to bridge the transfer (Xia et al., 20 Nov 2025).

1. Problem Setting and Motivation

Visual in-context learning (VICL) refers to the capability of a VLM to perform novel visual tasks by conditioning on a small number of demonstration (input, output) image pairs, without any parameter updates. The Cross-Task VICL scenario asks whether a model can effectively solve a target task B (e.g., dehazing), when prompted solely with an in-context demonstration from a distinct source task A (e.g., deblurring), with the only additional information being an implicit natural language prompt explaining how the two tasks differ. This setup is motivated by the need to evaluate not just within-task adaptability, but the ability to reason across task boundaries and transfer manipulation skills under minimal supervision (Xia et al., 20 Nov 2025).

A dedicated dataset is necessary to enable systematic evaluation of such transfer. Prior VICL datasets provide demonstrations and queries from the same task family; hence, no resource existed to assess one-shot learning or prompt-based cross-task transfer in low-level vision.

2. Task Inventory and Dataset Composition

The Cross-Task VICL Dataset covers 12 canonical low-level vision tasks, balanced across restoration, removal, and generation/enhancement.

Task Categories and Example Tasks

Category	Tasks
Restoration	Deblurring, Dehazing, Demoiréing, Denoising, Deraining
Removal	Reflection removal, Shadow removal
Generation/Enhancement	Colorization, Harmonization, Inpainting, Light enhancement, Style transfer

From these, 19 directed cross-task transfer scenarios are constructed—9 designated as “top-tier” (joint fidelity and perceptual metric improvement), and 10 as “second-tier” (strong perceptual gain, competitive pixel metrics). Each transfer scenario defines a direction A→B (e.g., Deblurring→Deraining) and contains 2,000 four-picture examples: one input-output image pair from the source task, one input-output pair from the target task, and an implicit prompt that describes the contrast between tasks without naming them (Xia et al., 20 Nov 2025).

The total dataset size is 38,000 examples, with ∼2,000 examples per scenario, split 70/30 (train/held-out); official public splits are inherited when available, otherwise random.

3. Data Collection, Annotation, and Format

Raw images and ground-truth pairs are sourced from established public datasets for each low-level task:

Deblurring: GOPRO
Dehazing: D-HAZY
Demoiréing: UHDM
Denoising: SIDD/NTIRE
Deraining: RainCityscapes
Harmonization: iHarmony4
Inpainting, Colorization: DIV2K
Light enhancement: LoL
Reflection removal: SIR²
Shadow removal: ISTD
Style transfer: Night2Day

No new pixelwise annotation is required; instead, each cross-task example is formed by randomly sampling input/ground-truth pairs from A and B. Each example includes an “implicit prompt,” a natural language text automatically generated by prompting a 32B-scale VLM (Qwen2.5-VL-32B-Instruct) to compare the two tasks in terms of goal, input artifact, and visual change—while avoiding explicit task names (Xia et al., 20 Nov 2025).

Prompts are filtered for diversity using 384-dim sentence embeddings (all-MiniLM), with K-means clustering (K≈1,000), retaining maximally representative prompts per cluster until 2,000 prompts/scenario remain. Each final example is stored as a JSON object specifying image paths, task directions, and the selected prompt.

Example entry (abridged):

{
  "id": "Deblur2Demoire_0456",
  "A_task": "deblurring",
  "B_task": "demoireing",
  "I_A_in": "images/0456_A_in.png",
  "I_A_gt": "images/0456_A_gt.png",
  "I_B_in": "images/0456_B_in.png",
  "I_B_gt": "images/0456_B_gt.png",
  "implicit_prompt": "The first pair corrects motion blur to restore crisp contours and edges, while the second pair tackles moiré ripples caused by pixel interference. One sharpens smeared regions; the other smooths repeating pattern artifacts without losing fine texture."
}

4. Prompt Generation and Inference Protocol

Implicit prompts are generated by meta-prompting the teacher VLM to provide textual comparisons along specified axes (goal, input artifacts, visual outcome), then semantically filtered for coverage and diversity. At inference, K candidate prompts can be generated with a fine-tuned student VLM (Qwen2.5-VL-3B); the final prompt is selected by maximizing a composite score:

$S(p; A \to B) = \alpha \cdot \mathrm{CosSim}(E_{text}(p), E_{text}(D_{ref})) + \beta \cdot \mathrm{Metric_{IQA}}(M_{out}(p), I_B^{gt})$

where $E_{text}$ is a text encoder, $M_{out}(p)$ is model output conditioned on $p$ , and $\mathrm{Metric_{IQA}}$ is a full-reference image metric such as PSNR or SSIM. Weightings (e.g., $\alpha = 0.6$ , $\beta = 0.4$ ) are empirically tuned. This protocol enforces that a prompt is semantically consistent with an “oracle” teacher prompt and yields high-fidelity image generation (Xia et al., 20 Nov 2025).

5. Evaluation Metrics, Baselines, and Findings

Two main evaluation axes are used:

Pixelwise Fidelity: Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM).
Task-aware Perceptual Quality: VIEScore [Ku et al., ACL ’24], which combines Semantic Consistency (SC) and Perceptual Quality (PQ) as $O = \sqrt{SC \times PQ}$ , with $SC = \min \alpha_i, PQ = \min \beta_j$ , each in $[0,10]$ .

Baselines include a Gemini VLM prompted with a generic instruction (“apply same transformation to the query”) versus the T2T-VICL’s implicit prompt protocol. In the 9 top-tier transfer scenarios, the implicit prompt protocol outperforms the fixed prompt by 1–2 dB in PSNR and >1.0 VIEScore; for the remaining 10 scenarios, VIEScore improvement is robust (0.5–1.5) even if pixel metrics are slightly inferior ( $\leq$ 2 dB) (Xia et al., 20 Nov 2025).

6. Insights, Limitations, and Research Directions

Experiments indicate that large VLMs are able to utilize implicit, contrastive descriptions to guide one-shot cross-task image-to-image transfer, though transfer success is task-pair dependent (e.g., deblurring→demoiréing is significantly more tractable than dehazing→deraining). Perceptual/task-aware metrics (VIEScore) are more reflective of true generalization than pixelwise scores (Xia et al., 20 Nov 2025).

Limitations include restriction to 12 low-level tasks; no classification or high-level reasoning scenarios are currently included, and prompt quality varies for rare or edge-case task pairs due to automated generation. The dataset does not address zero-shot transfer without a visual prompt, nor does it support multi-step (A→B→C) transfers. This suggests the need for follow-up work targeting mid-/high-level tasks, chains of transformations, and human-in-the-loop prompt curation.

A plausible implication is that extending this protocol to semantic segmentation, depth, or multimodal VQA tasks—possibly by leveraging resources such as ViLCo-Bench (Tang et al., 19 Jun 2024) or TraVLR (Chow et al., 2021)—would further clarify the scope and limits of VLMs’ in-context cross-task adaptability.

The Cross-Task VICL Dataset addresses a distinct experimental regime compared to prior continual learning (e.g., ViLCo-Bench (Tang et al., 19 Jun 2024), which focuses on continual adaptation across language/video tasks on 10-minute egocentric clips), cross-modal reasoning (e.g., TraVLR (Chow et al., 2021), which targets transfer between text and images for scene reasoning), or multi-task semantic consistency (e.g., COCOCon (Maharana et al., 2023)). Its core novelty lies in the one-shot, prompt-based transfer protocol for image-to-image tasks and the systematic implicit prompt construction.

While HoloVIC (Ma et al., 5 Mar 2024) and ViLCo-Bench enable cross-task and multi-task experiments in their domains (detection/tracking and video-language continual learning, respectively), neither provides explicit support for prompt-driven, zero-/few-shot transfer between heterogeneous low-level image transformations with perceptual metrics at scale.

References:

"T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs" (Xia et al., 20 Nov 2025)
"ViLCo-Bench: VIdeo Language COntinual learning Benchmark" (Tang et al., 19 Jun 2024)
"TraVLR: Now You See It, Now You Don't!" (Chow et al., 2021)
"Exposing and Addressing Cross-Task Inconsistency in Unified Vision-LLMs" (Maharana et al., 2023)
"HoloVIC: Large-scale Dataset and Benchmark for Multi-Sensor Holographic Intersection and Vehicle-Infrastructure Cooperative" (Ma et al., 5 Mar 2024)