Perturbation-invariant web understanding for MLLMs
Develop multimodal large language models for web understanding that maintain consistent, semantically faithful predictions under diverse non-semantic perturbations of webpages, including color shifts of UI elements, text-level edits to button labels (e.g., homoglyph substitutions and spacing), and DOM/layout rearrangements, thereby achieving perturbation-invariant behavior across clean and perturbed screenshots of the same page.
References
This underscores that perturbation-invariant web understanding remains an open challenge, requiring models that can abstract away from surface-level cues and maintain consistency under diverse non-semantic variations.
— Benchmarking MLLM-based Web Understanding: Reasoning, Robustness and Safety
(2509.21782 - Liu et al., 26 Sep 2025) in Appendix, Subsection "Key Findings" (bullet: Robustness is fragile under minor perturbations)