Determining Helpful Intermediate Visual Representations for Explicit Supervision
Determine which intermediate visual representations, if any, would be helpful to provide as explicit supervision when training Large Multimodal Models on tasks that require selecting the most visually similar image to a reference image, where specifying effective intermediate steps is not obvious.
Sponsor
References
Training the model with explicit supervision is difficult as well since it is not clear what intermediate visual representations would be helpful to provide to the model.
— Latent Implicit Visual Reasoning
(2512.21218 - Li et al., 24 Dec 2025) in Section 1, Introduction (discussion around the visual similarity example)