Papers
Topics
Authors
Recent
2000 character limit reached

Do-Undo Benchmark for Vision-Language Models

Updated 22 December 2025
  • The paper introduces a novel reversible image transformation framework that integrates flow matching, cross-entropy, and reversibility consistency losses.
  • It presents a curated dataset of 24,000 high-quality tuples from EpicKitchens videos, annotated with reversible action pairs to ground physical reasoning.
  • Empirical results show that Do-Undo models achieve enhanced causal consistency across varied prompt lengths, revealing key trade-offs in optimization.

The Do-Undo task and benchmark constitute a rigorous platform for evaluating and advancing the physical reasoning capabilities of vision-LLMs (VLMs) by requiring them to both generate physically plausible scene transformations in response to real-world actions and accurately reverse those changes, thereby reflecting true cause-and-effect relationships in the visual domain. This paradigm departs fundamentally from prior datasets and evaluation suites focused on object-level edits or static factual alignment, introducing instead a requirement for reversible, physically-grounded manipulations as observed in naturalistic video data (Mahajan et al., 15 Dec 2025).

1. Formal Task Definition and Underlying Objective

Let Eθ\mathcal{E}_\theta denote a vision–LLM parameterized by θ\theta, capable of conditional image generation. The Do-Undo task is defined over quadruples (Io,PF,IF,PR)(\mathbf{I}_o, P_F, \mathbf{I}_F, P_R), where:

  • IoRH×W×3\mathbf{I}_o \in \mathbb{R}^{H \times W \times 3} is the original input image.
  • PFP_F (“forward prompt”) is a natural-language specification of a reversible physical action to apply to Io\mathbf{I}_o.
  • IF\mathbf{I}_F is the post-action image observed in the real world, sourced from video.
  • PRP_R (“reverse prompt”) is the description of the inverse action meant to restore IF\mathbf{I}_F to Io\mathbf{I}_o.

The Do-Undo model performs two conditional generations:

I^F=Eθ(Io,PF),I^R=Eθ(IF,PR)\hat{\mathbf{I}}_F = \mathcal{E}_\theta(\mathbf{I}_o, P_F), \qquad \hat{\mathbf{I}}_R = \mathcal{E}_\theta(\mathbf{I}_F, P_R)

Training imposes three loss terms:

  • Flow-matching loss Lflow\mathcal{L}_{\rm flow}: Penalizes deviation from ground-truth images in a latent VAE-encoded space using rectified flow matching.
  • Cross-entropy loss LCE\mathcal{L}_{\rm CE}: For the interleaved vision–language tokens.
  • Reversibility consistency loss Lc=IoI^R1\mathcal{L}_c = \|\mathbf{I}_o - \hat{\mathbf{I}}_R\|_1: Encourages the model to reproduce the original image after both forward and reverse transformations.

The weighted total objective is:

Ltotal=Lflow+LCE+λLc\mathcal{L}_{\rm total} = \mathcal{L}_{\rm flow} + \mathcal{L}_{\rm CE} + \lambda\,\mathcal{L}_c

with λ0.5\lambda \approx 0.5.

This definition targets robust action grounding, enforcing a closure property for model predictions under physically reversible operations (Mahajan et al., 15 Dec 2025).

2. Dataset Construction and Action Coverage

The Do-Undo dataset is curated from 100 first-person “EpicKitchens” videos, yielding 24,000 high-quality (Io,PF,IF,PR)(\mathbf{I}_o, P_F, \mathbf{I}_F, P_R) tuples. Actions are annotated and filtered for reversibility using a predefined vocabulary of approximately 12 verbs admitting natural inverse counterparts, including:

  • pick-up \leftrightarrow put-down
  • open \leftrightarrow close
  • turn-on \leftrightarrow turn-off
  • grab \leftrightarrow place
  • move \leftrightarrow remove

Image pairs are extracted so that IF\mathbf{I}_F genuinely reflects real-world execution of PFP_F on Io\mathbf{I}_o. Automated screening via a Qwen-VL model ensures (i) proper visual quality, (ii) confirmed action completion, and (iii) persistent object visibility. Prompt expansion generates forward and reverse descriptions with contextual spatial and state-change information.

The primary test set includes 662 samples distributed across the top-8 action classes and several out-of-distribution (OOD) events (rotate, stack, remove-from), with both “long” (expanded, contextual) and “short” (minimal) prompt variants. Actions span hundreds of unique noun objects per verb, supporting robust semantic diversity (Mahajan et al., 15 Dec 2025).

3. Model Architecture and Optimization Procedure

The primary backbone is the BAGEL VLM, built on Qwen-2.5, which incorporates both a ViT encoder (for image understanding) and a VAE (for image synthesis latents). Training alternates among:

  • Text-only LLM tuning on image–caption data
  • Interleaved vision–text instruction-following on Do-Undo tuples
  • Supervised vision-language understanding (LLaVA ov dataset)

Image generation leverages rectified flow matching in 16-channel VAE latent space (downsampled ×8\times8). Consistency is enforced via the Lc\mathcal{L}_c L1L_1 loss between the original and twice-transformed (do+undo) image. KV caching is used for efficient autoregressive sampling (Mahajan et al., 15 Dec 2025).

4. Evaluation Protocol and Metrics

Performance is assessed separately on both the forward (“Do”) and reverse (“Undo”) stages using multiple axes:

  • Feature similarity: DINO-based cosine similarity between representations of IF\mathbf{I}_F, I^F\hat{\mathbf{I}}_F, Io\mathbf{I}_o, I^R\hat{\mathbf{I}}_R.
  • Semantic alignment: CLIP score between generated images and ground-truth image captions (CLIP-T-F, CLIP-T-R).
  • Optical-flow consistency: RAFT-based End Point Error (OF-F), and reversibility consistency EPE(IoI^F)EPE(IFI^R)|\mathrm{EPE}(\mathbf{I}_o \to \hat{\mathbf{I}}_F) - \mathrm{EPE}(\mathbf{I}_F \to \hat{\mathbf{I}}_R)|, with the ideal value tending to zero for perfect reversibility.
  • MLLM judge: Gemini-2 scores on four axes (0–10): instruction fidelity (IF), identity preservation (IDP), temporal coherence (TC), object consistency (OC).
  • Human preference: Direct comparison tests (48 instances) offering human annotator preferences.

These metrics jointly probe not only visual fidelity but also cause–effect coherence, explicit reversibility, and robustness to prompt formulation (Mahajan et al., 15 Dec 2025).

5. Baseline Comparisons and Empirical Outcomes

Comparative assessment among Qwen-Image-2509 (zero-shot), FluxKontext, BAGEL, and two Do-Undo model variants (with and without explicit consistency term) reveals the following “long prompt” quantitative results:

Method DINO-R↑ DINO-F↑ CLIP-T-R↑ CLIP-T-F↑ EPE(OF)↓ OF-F↓
Qwen-Image 0.861 0.788 0.196 0.188 0.050 78.78
BAGEL 0.812 0.770 0.274 0.275 0.102 79.20
FluxKontext 0.797 0.757 0.253 0.277 0.120 79.20
Do-Undo 0.807 0.764 0.268 0.270 0.122 85.40
Do-Undo (c) 0.795 0.744 0.264 0.265 0.113 84.05

MLLM judgements (Gemini-2):

Method IF↑ IDP↑ TC↑ OC↑
Qwen-Image 8.44 9.35 8.15 9.18
BAGEL 7.97 9.30 8.17 9.26
FluxKontext 6.69 8.82 6.91 8.81
Do-Undo 8.05 9.12 8.03 9.06
Do-Undo (c) 8.02 9.11 7.94 9.09

Human preference (forward+reverse): Do-Undo overall preferred 58.3% vs BAGEL 41.6%.

A per-action breakdown indicates that Do-Undo variants match or surpass baseline models on instruction fidelity for key verbs (pick-up, open, place) without substantial loss of identity preservation. Do-Undo exhibits greater robustness to varying prompt length compared to Qwen-Image, which is more prompt-sensitive (Mahajan et al., 15 Dec 2025).

6. Analysis of Failure Modes and Model Limitations

  • Physical reversibility remains difficult: Models often “half-do” actions, evidenced by incomplete object removal or addition and underwhelming DINO-F scores.
  • Hallucination remains a problem: Occasional introduction of spurious objects (phantom chopping boards, extraneous jars) by Gemini and FluxKontext demonstrates ongoing limitations in grounding and state fidelity.
  • Overweighting of consistency loss Lc\mathcal{L}_c: Excessive prioritization of reversibility degrades the fidelity of the forward transformation, highlighting the optimization trade-off between “do” and “undo” quality.
  • OOD actions: Despite limited training exposure, Do-Undo finetuning produces plausible scene changes for rare or structurally novel verbs (rotate, stack, disassemble), suggesting some emergent compositional priors in the VLM latent space.
  • Prompt-length sensitivity: Short prompts favor Qwen-Image, while BAGEL and Do-Undo maintain performance stability under both short and expanded prompts (Mahajan et al., 15 Dec 2025).

A plausible implication is that physics-aware data curation and explicit consistency training targets are essential but insufficient: further advances may require explicit integration of physical simulation or dynamics-prior modules.

7. Broader Implications and Future Directions

The Do-Undo task and benchmark supply a rigorous, interpretable framework for evaluating causally-consistent physical reasoning in multimodal generative systems. The reversible-action requirement operationalizes a “mental simulation” capacity valuable for embodied AI, advanced robotics, and physical scene understanding. Inclusion of real-world video-derived actions grounds predictions in empirical state transitions.

Future work could extend Do-Undo to richer action spaces, integrate explicit or differentiable physics engines, or support hierarchical/branched action chains. Mitigating hallucination and false state transitions, alongside scaling to longer action-temporal trajectories, remain open research problems. Adoption of benchmarks structurally analogous to Do-Undo is likely to accelerate progress in robust, physically-grounded generative modeling for instruction-following agents and simulation-augmented VLMs (Mahajan et al., 15 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Do-Undo Task and Benchmark.