Effective coupling of visual understanding and image generation for physical realism
Establish methods to effectively couple visual understanding with generative image editing within unified multimodal vision–language architectures so that improved scene understanding translates into physically realistic edits, addressing the observed gap where enhanced understanding alone does not yield better physical realism in instruction-based image editing.
References
This suggests that stronger understanding alone is insufficient, and effectively coupling understanding with generation remains an open challenge.
— PICABench: How Far Are We from Physically Realistic Image Editing?
(2510.17681 - Pu et al., 20 Oct 2025) in Section 4.2 (Benchmark Results)