Demo-Driven Video In-Context Learning

Updated 23 March 2026

Demo-driven Video In-Context Learning is a paradigm that enables models to adapt rapidly by incorporating demonstration videos or video-text pairs at inference time.
It leverages techniques such as similarity-based demo retrieval, iterative refinement, and multimodal prompt formatting to execute few-shot video tasks effectively.
Quantitative results reveal significant accuracy gains and enhanced procedural reasoning across applications like video narration, classification, and UI control.

Demo-driven video in-context learning (VICL) refers to the class of methods that enable models to acquire, transfer, or generalize task knowledge from explicit demonstration examples—most critically, demonstration videos or video+text pairs—provided at inference time, rather than relying solely on parameters learned offline. This paradigm leverages the emergence of in-context learning abilities in large multimodal models, bridging the gap between static pre-trained knowledge and rapid adaptation to novel, often low-resource, domains.

1. Formalization and Motivating Tasks

In demo-driven VICL, the model receives as input a query (video segment and/or textual prompt) and a small set of demonstration exemplars, each itself a video or aligned sequence (possibly with associated text, actions, or labels). The task is to conditionally complete, classify, narrate, generate, or otherwise reason about the query by leveraging information distilled from the demos.

Different works instantiate the paradigm with formal objectives reflecting distinct capabilities. For example, Demo-ICL frames procedural acquisition as learning $P(A \mid V_{\text{test}}[0:t_1], Q, D; \theta)$ , where $D = \{D_1, \ldots, D_k\}$ are video or text demonstrations, and $A$ is the desired output (e.g., what happens next in $V_{\text{test}}$ ) (Dong et al., 9 Feb 2026). By contrast, VIOLA expresses VICL as prediction over $\mathcal{M}(x_{\text{test}}, \mathcal{C})$ , with $\mathcal{C}$ a retrieved demo context, further emphasizing few-shot label efficiency (Fujii et al., 22 Jan 2026).

Canonical tasks include:

Few-shot video narration and procedural reasoning (Yu et al., 2023, Dong et al., 9 Feb 2026)
Video classification, captioning, or QA in new domains (Kim et al., 2024, Fujii et al., 22 Jan 2026)
Computer-use/action trajectory imitation for agentic UI control (Liu et al., 6 Nov 2025, Song et al., 6 Oct 2025)
Video-to-video or image-to-video generation via demo-driven control (Liu et al., 2024, Sun et al., 2024, Fei et al., 2024, Zhang et al., 2024)

This generality reflects the core motivation: to reduce annotation and re-training costs for specialized, rare, or rapidly-changing video domains, enabling models to adapt or generalize with only a handful of user-provided exemplars.

2. Key Methodological Elements in VICL Pipelines

While instantiations vary, most demo-driven VICL pipelines adhere to the following stages:

Demonstration Pool Construction: Assemble a candidate set of demos from various sources (curated video banks (Dong et al., 9 Feb 2026), online tutorial videos (Song et al., 6 Oct 2025, Liu et al., 6 Nov 2025), instructional datasets, or user-supplied clips).
Demo Selection and Ranking: Employ similarity-based retrieval using joint video-text embedding spaces—e.g., cosine similarity of video and/or text features, optionally with learned weighting (Kim et al., 2024, Fujii et al., 22 Jan 2026). Confidence or label-reliability signals may influence selection (Fujii et al., 22 Jan 2026, Kim et al., 2024).
Prompt Formatting and Context Construction: Concatenate or otherwise interleave demonstrations and queries into a model-compatible prompt (video tokens and textual queries/answers) (Dong et al., 9 Feb 2026, Zhang et al., 2024, Yu et al., 2023). Modalities may be indicated via positional or special tokens.
Model Conditioning and Inference: Inject the demonstration context into the model, utilizing cross-attention, context tokens, or diffusion-based implicit conditioning (e.g., action latents in $\delta$ -Diffusion (Sun et al., 2024), LoRA-based adapters (Fei et al., 2024), or action prism features (Liu et al., 2024)). The model produces outputs conditioned on both the demos and query.
Iterative/Ensemble Refinement (as needed): For context windows too small for all demos, iterative retrieval and confidence-based refinement can extend effective demonstration count (Kim et al., 2024). Ensemble schemes aggregate multiple pseudo-labels from ICL batches for consensus (Xu et al., 2024).

Distinct innovations include density-uncertainty-weighted sampling for annotation efficiency (Fujii et al., 22 Jan 2026), hybrid pools of labeled and pseudo-labeled data with confidence modeling (Fujii et al., 22 Jan 2026), preference-based optimization for demo utilization (Dong et al., 9 Feb 2026), and direct video trajectory extraction from web tutorials (Song et al., 6 Oct 2025, Liu et al., 6 Nov 2025).

3. Architectural and Training Considerations

VICL approaches are realized atop a range of architectures:

Autoregressive Transformers trained on video token sequences (e.g., VQ-GAN compressed) support pure demonstration prefixing for zero-shot video imitation (Zhang et al., 2024).
Encoder–Decoder Multimodal LLMs combine frozen (or fine-tuned) visual frontends and large LMs with cross-modal attention or interleaved embedding fusion (Dong et al., 9 Feb 2026, Yu et al., 2023, Fujii et al., 22 Jan 2026).
Diffusion-based models condition sample generation on learned latent representations distilled from reference demos (e.g., action prism tokens (Liu et al., 2024), implicit action latents (Sun et al., 2024), panel-wise spatiotemporal blocks (Fei et al., 2024)).
UI/Workflow Agents use action-labeled demonstration trajectories, sometimes segmented by VLM or LLM analysis, and inject both image and action sequence tokens into the context (Song et al., 6 Oct 2025, Liu et al., 6 Nov 2025).

Training strategies span pure self-supervision (for imitation ability to emerge; (Zhang et al., 2024)), distribution-centric data curation (to elicit robust ICL abilities; (Yu et al., 2023)), in-context fine-tuning on demo-augmented corpora (e.g., SFT with hybrid demo-injected examples; (Dong et al., 9 Feb 2026)), preference optimization with demonstration-aware rewards (Dong et al., 9 Feb 2026), and LoRA-based lightweight adaptation to unlock cross-demo or in-context generation in massive models (Fei et al., 2024).

4. Evaluation Benchmarks and Quantitative Results

Research on demo-driven VICL proposes specialized benchmarks such as Demo-ICL-Bench, constructed from HowTo100M with rigorous annotation of stepwise instruction sequences and aligned video demonstrations (Dong et al., 9 Feb 2026). Secondary evaluation is performed on SOP generation sets (e.g., WONDERBREAD “Gold Demo” (Xu et al., 2024)), UI agent tasks (OSWorld (Song et al., 6 Oct 2025, Liu et al., 6 Nov 2025)), and multiple domain-specific video-action/caption datasets (e.g., DriveAct, EgoSurgery, UCF-Crime, CapERA; (Fujii et al., 22 Jan 2026, Kim et al., 2024)).

Quantitative outcomes widely confirm the value of demonstration-driven context:

VIOLA benchmarks demonstrate 19–54 point accuracy gains over zero-shot with tightly budgeted $B=20$ expert labels, and persistent advantage as $B$ scales (Fujii et al., 22 Jan 2026).
Demo-ICL yields 14.0% $\Delta_{\text{ICL}}$ for text-demo and 4.4% for video-demo tasks compared to demo-free baselines, with best open-source models below 30% on these challenging few-shot splits (Dong et al., 9 Feb 2026).
In SOP generation, in-context ensemble aggregation (ICE) improves recall (+6.7%), precision (+1.7%), and time-ordering accuracy (+4.3%) over 8-shot ICL (Xu et al., 2024).
Confidence-based iterative ICL (VideoICL) raises OOD classification by up to 33.2 percentage points relative to zero-shot (Kim et al., 2024).
In controllable video generation, in-context concatenation and LoRA tuning support persistent role control, style transfer, and multi-scene coherence (Fei et al., 2024, Sun et al., 2024).
For UI agents, demo-driven in-context trajectories consistently lift success rate by 2–4% over text-based or frame-only baselines (Liu et al., 6 Nov 2025, Song et al., 6 Oct 2025).

5. Distinguishing Features and Innovations

Major technical advancements over classical fine-tuning or zero-shot approaches include:

Label-Efficient Demo Selection: Density–uncertainty balancing (GMM + tokenwise entropy) for optimal expert annotation under strict budget (Fujii et al., 22 Jan 2026).
Hybrid Labeled/Pseudo-Labeled Pools: Confidence-aware retrieval and prompting to integrate uncertain pseudo-labels alongside ground-truth, and explicitly communicate reliability in prompt construction (Fujii et al., 22 Jan 2026).
Iterative and Ensemble ICL: Chunkwise demo selection with confidence-based early stopping (effective context extension) (Kim et al., 2024); pseudo-label proposal/voting to realize context-efficient ensemble ICL (Xu et al., 2024).
Distributional Data Curation for Emergent ICL: Training on “bursty,” skewed, and synonym-rich data distributions enables true few-shot transfer and semantic adaptation in VLMs (Yu et al., 2023).
Video-Driven Agentic Reasoning: Pipelines for online demonstration scraping, trajectory segmentation, in-context injection at every agentic timestep, and adaptive demo selection (Liu et al., 6 Nov 2025, Song et al., 6 Oct 2025).

6. Limitations and Outstanding Challenges

Noted constraints across sources:

Demo Acquisition and Alignment: Robustness depends on the quality and domain coverage of demonstration pools. Procedural alignment between demonstrations and query tasks remains challenging, especially for video-based demos (Dong et al., 9 Feb 2026).
Context Length and Compute: Long video and text sequences stress context windows; ensemble and chunkwise strategies partially alleviate but do not abolish this bottleneck (Kim et al., 2024, Xu et al., 2024).
Semantic Parsing and Transfer: Especially in video-demo ICL, current MLLMs often fail to abstract fine-grained motion or transfer procedural semantics as effectively as with explicit textual steps (Dong et al., 9 Feb 2026).
Evaluation: Some generation tasks lack standardized quantitative metrics (e.g., in-context video synthesis; (Fei et al., 2024)), or the best available metrics (e.g., FVD, CLIP similarity) may incompletely capture procedural alignment.
Hallucination and Consistency: For multi-step outputs (SOPs, UI actions), models can hallucinate or reorder actions, and may default to stylistic templates rather than demo-specific reasoning (Xu et al., 2024).

7. Future Directions

The literature suggests several promising research avenues:

Architectural Innovations: Dynamic demonstration memory, cross-modal indexing, and meta-RAG modules to enhance demo retrieval and usage efficiency (Dong et al., 9 Feb 2026).
Joint Multi-modal In-Context Learning: Direct fusion of video, text, speech, diagrams, and tool outputs in a unified prompt to support realistic multimodal workflows (Dong et al., 9 Feb 2026).
Adaptive and Feedback-driven Prompting: Learning to select, order, and weight demonstrations in-context, with user or agent feedback (Fujii et al., 22 Jan 2026, Kim et al., 2024).
Scaling and Open Resource: Leveraging massively increased demonstration banks, learned domain adaptation priors, and persistent memory for cross-session VICL (Yu et al., 2023, Song et al., 6 Oct 2025).
Task-specific Extensions: Beyond immediate imitation, applications may include meta-learning (learning to learn from demonstrations), long-horizon multi-step planning, and robust procedural transfer in real-world domains (e.g., surgery, manufacturing, UI manipulation).

References

VIOLA: Towards Video In-Context Learning with Minimal Annotations (Fujii et al., 22 Jan 2026)
Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition (Dong et al., 9 Feb 2026)
VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding (Kim et al., 2024)
Eliciting In-Context Learning in Vision-LLMs for Videos Through Curated Data Distributional Properties (Yu et al., 2023)
Learning from Online Videos at Inference Time for Computer-Use Agents (Liu et al., 6 Nov 2025)
Watch and Learn: Learning to Use Computers from Online Videos (Song et al., 6 Oct 2025)
Video Creation by Demonstration (Sun et al., 2024)
Video Diffusion Transformers are In-Context Learners (Fei et al., 2024)
Video In-context Learning: Autoregressive Transformers are Zero-Shot Video Imitators (Zhang et al., 2024)
In-Context Ensemble Learning from Pseudo Labels Improves Video-LLMs for Low-Level Workflow Understanding (Xu et al., 2024)
AICL: Action In-Context Learning for Video Diffusion Model (Liu et al., 2024)