Determining Helpful Intermediate Visual Representations for Explicit Supervision

Determine which intermediate visual representations, if any, would be helpful to provide as explicit supervision when training Large Multimodal Models on tasks that require selecting the most visually similar image to a reference image, where specifying effective intermediate steps is not obvious.

Background

The paper argues that current Large Multimodal Models are predominantly text-centric and struggle with tasks requiring rich visual reasoning. Many contemporary approaches attempt to improve performance by supervising intermediate visual steps (e.g., bounding boxes or helper images), but this imposes human-designed biases and is costly to annotate.

In the introduction, the authors highlight a concrete challenge in a visual similarity selection task: even if one wanted to use explicit supervision, it is unclear which intermediate visual abstractions would actually aid the model. This motivates their task-agnostic method (LIVR) that learns latent, implicit visual reasoning tokens without specifying such intermediates. The unresolved question is to identify what intermediate representations—if any—are genuinely helpful for these tasks under explicit supervision.

References

Training the model with explicit supervision is difficult as well since it is not clear what intermediate visual representations would be helpful to provide to the model.

Latent Implicit Visual Reasoning (2512.21218 - Li et al., 24 Dec 2025) in Section 1, Introduction (discussion around the visual similarity example)