Chain-of-Frames in Multimodal AI

Updated 3 July 2026

Chain-of-Frames (CoF) is a multimodal AI paradigm that extends step-wise reasoning to visual data by generating interpretable frame sequences.
It employs autoregressive models that condition each frame on prior context, thereby unifying video question answering, text-to-image generation, and video editing tasks.
Empirical results show that CoF improves model accuracy and reduces hallucinations, as demonstrated by significant gains on video-based benchmarks.

Chain-of-Frames (CoF) is a paradigm in multimodal artificial intelligence that extends the concept of step-wise reasoning—originating from chain-of-thought (CoT) approaches in LLMs—into the visual and video domain. In CoF, each generated video frame (or an explicit frame selection) constitutes an interpretable reasoning step, with the sequence of frames forming a "visual reasoning trace" grounded in the underlying data and task objective. This approach enables models not only to perceive or synthesize visual content, but also to expose their internal logic and intermediate state transitions through observable frame-by-frame evolution. CoF has been applied across diverse tasks such as video question answering, text-to-image generation, video editing, and diagnostic visual reasoning benchmarks, yielding advances in transparency, accuracy, and controllability.

1. Formal Definitions and Theoretical Foundations

Chain-of-Frames is formally defined as a framework where a model—given a task prompt (such as a question, command, or instruction)—generates or selects a sequence of $T$ video frames $\{f_1, f_2, ..., f_T\}$ , with each frame representing a distinct reasoning step (Liu et al., 17 Nov 2025, Guo et al., 30 Oct 2025). In the context of generative video models, this sequence is typically autoregressive: $p(f_{1:T} | x) = \prod_{t=1}^T p(f_t \mid f_{<t}, x)$ where $x$ denotes the prompt and each $f_t$ is generated conditioned on all previous frames and the prompt. In multimodal LLMs, frame selection and reference is tightly integrated into language-based reasoning, enabling each reasoning step to cite or request explicit frames ("In Frame 7, the key event occurs") (Ghazanfari et al., 31 May 2025).

A key distinction is that, whereas CoT operates over token sequences in the symbolic domain, CoF performs reasoning in latent visual space, enabling continuous, physically grounded simulations of processes such as motion, manipulation, or causal inference (Liu et al., 17 Nov 2025).

2. Core Methodologies and Model Architectures

Several methodological variants of CoF have emerged, each tailored to specific tasks and architectural regimes:

Frame-grounded LLM Reasoning: In video LLMs, CoF involves fine-tuning models to generate chain-of-thought that explicitly references the temporal indices of frames involved in reasoning. For example, the CoF-Data corpus supports supervised learning where each reasoning step is linked to an annotated frame, directly incorporating frame numbers in the output (Ghazanfari et al., 31 May 2025).
Interleaved Reasoning and Perception: Frameworks such as FrameMind employ a sequential decision process, where the agent alternates between emitting natural language reasoning (textual tokens) and actively acquiring additional visual information by selecting frames or clips (via actions like FrameAt or VideoClip). The policy over actions is learned via reinforcement learning, targeting rewards that balance accuracy, efficiency, and tool-use exploration (Ge et al., 28 Sep 2025).
Progressive Visual Refinement for Generation: In CoF-T2I, text-to-image generation is cast as a short chain of visual refinement steps. The process samples sequential latent states $z_1, ..., z_T$ , each decoded into a frame $F_t = D(z_t)$ , representing a coarse-to-fine or semantics-to-aesthetics pathway in image synthesis (Tong et al., 15 Jan 2026).
Video Editing with Explicit Visual Grounding: VideoCoF enforces a "see, reason, edit" workflow in video diffusion models: the model first predicts "reasoning tokens" (edit-region latents), visually grounding where the edit should occur, before generating the edited frames, thereby removing the need for expert-drawn masks and unifying editing tasks (Yang et al., 8 Dec 2025).

3. Data Construction, Benchmarking, and Evaluation

Advances in CoF are underpinned by curated datasets and diagnostic benchmarks:

CoF-Data contains over 160,000 triplets of questions, answers, and frame-grounded reasoning for real and synthetic videos. Synthetic data is generated via object-attribute templates, while real video data utilizes existing key-frame captions aligned to the video stream (Ghazanfari et al., 31 May 2025).
Gen-ViRe systematically decomposes CoF reasoning into six cognitive dimensions (perceptual, spatial/temporal, procedural, analogical, algorithmic/logical, and abstract reasoning), each with four specialized subtasks, for a total of 24. Every subtask is evaluated via detailed rubrics measuring correctness, temporal coherence, object permanence, and goal achievement (Liu et al., 17 Nov 2025).
MME-CoF comprises 59 multi-step reasoning challenges across 12 dimensions, including 3D/2D geometry, physics-based tasks, embodied manipulation, GUI reasoning, and more. Each test case is scored for instruction alignment, temporal consistency, visual stability, content fidelity, and relevance (Guo et al., 30 Oct 2025).

Standard evaluation strategies include manual and hybrid VLM-assisted (e.g., Gemini 2.5 Pro, GPT-4o) scoring pipelines, per-frame or end-to-end video assessments, and normalization to macro-mean performance across subtasks.

4. Experimental Findings and Empirical Trends

CoF models consistently outperform prior methods that lack frame-aware reasoning traces. Notable empirical findings include:

Performance Gains in Video LLMs: CoF-fine-tuned InternVL architectures achieve 4–9 point gains across benchmarks (Video-MME, MVBench, VSI-Bench) and sharply reduce hallucination errors relative to prior chain-of-thought or QA-only training (Ghazanfari et al., 31 May 2025).
Zero-Shot Reasoning Limitations: State-of-the-art video generators (e.g., Veo-3, Sora-2) exhibit emergent CoF patterns (spatial coherence, locally consistent traces) but underperform on long-horizon planning, geometric constraints, and causal or abstract logic, with mean GEMINI scores substantially below full competence (Guo et al., 30 Oct 2025).
Text-to-Image Generation Improvements: CoF-T2I achieves GenEval overall score of 0.86 (vs. 0.84 for MLLM competitors and 0.55 for base video model), with monotonic gains over the three CoF stages and ablation studies validating the value of explicit intermediate visual steps (Tong et al., 15 Jan 2026).
Video Editing Advances: VideoCoF attains state-of-the-art results on VideoCoF-Bench (e.g., Instruction-Following 8.97, Success Ratio 76.4%), outperforming multi-million sample competitors using only 50,000 training pairs; the visual grounding step yields precise edits even in multi-instance or fine-grained tasks (Yang et al., 8 Dec 2025).

5. Algorithmic Innovations: RL Formulations, Optimization, and Temporal Structure

Algorithmic innovations distinguish CoF approaches in several respects:

Reinforcement Learning for Frame Acquisition: FrameMind implements a multi-turn, RL-trained strategy where frame sampling/interleaving is governed by a policy $\pi_\theta(a_t \mid s_t)$ , optimized via group-relative PPO (DRFS-GRPO) without the need for frame-level ground-truth labels. The policy adaptively chooses whether to request a frame or progress the chain-of-thought based on accumulated evidence and task state (Ge et al., 28 Sep 2025).
Dynamic Resolution Frame Sampling: A DRFS curriculum exposes agents to diverse temporal–spatial trade-offs, varying the number and resolution of sampled frames per video segment to facilitate learned fidelity allocation (Ge et al., 28 Sep 2025).
Temporal Generalization in Video Diffusion: VideoCoF introduces a RoPE alignment scheme that uses non-colliding temporal indices for source, reasoning, and target frames, enabling length-extrapolation and stable motion alignment in editing tasks (Yang et al., 8 Dec 2025).
Independent Encoding to Avoid Artifacts: In CoF-T2I, each refinement frame is encoded as the first frame in the video-VAE's causal window, thus preventing motion blur and spatial distortion artifacts and ensuring that each reasoning step remains interpretable and decoupled (Tong et al., 15 Jan 2026).

6. Applications, Limitations, and Future Directions

CoF reasoning underpins advances in several application domains:

Application Area	CoF Role	Reported Outcomes
Video LLMs (QA, understanding)	Frame-citing reasoning traces	Improved accuracy, reduced hallucination (Ghazanfari et al., 31 May 2025)
Video-based visual reasoning	Solution path as frame sequence	Benchmarks for cognitive ability (Liu et al., 17 Nov 2025)
Text-to-image generation	Progressive visual refinement	Better semantic and aesthetic quality (Tong et al., 15 Jan 2026)
Video editing	Explicit edit-localization	Precise, mask-free edits (Yang et al., 8 Dec 2025)

Despite progress, current limitations include:

Deficits in long-range, causal, or geometric reasoning under zero-shot conditions; success rates drop to zero on constraint-heavy tasks (Guo et al., 30 Oct 2025, Liu et al., 17 Nov 2025).
Necessity for architecture-specific tokenization and input schemes; portability to novel model classes may require additional overhead (Ghazanfari et al., 31 May 2025).
Continued challenges in physically plausible simulation, symmetry, complex analogy, and medical or domain-specialized reasoning benchmarks (Liu et al., 17 Nov 2025).

Recommendations for future work include incorporating explicit physics priors, symbolic intermediate representations, adversarial CoF sequence training, multimodal self-critique, and hybrid human-AI evaluation mechanisms (Liu et al., 17 Nov 2025, Guo et al., 30 Oct 2025).

7. Significance and Context in Multimodal AI

The Chain-of-Frames paradigm signifies a shift from static perception and implicit reasoning toward continuous, embodied simulation in AI systems. By aligning reasoning steps with explicit frames, CoF delivers transparency, finer-grained controllability, and improved generalization, while also revealing the gap between visual fidelity and true cognitive depth. Benchmarks such as Gen-ViRe and MME-CoF provide quantitative diagnostics necessary for principled progress, while new RL- and diffusion-based techniques broaden the capacity for adaptive perception and dynamic visual inference. As the field advances, CoF is poised to form a foundational concept bridging perception, reasoning, and action in future multimodal world simulators and reasoning agents (Liu et al., 17 Nov 2025, Ge et al., 28 Sep 2025).