Dice Question Streamline Icon: https://streamlinehq.com

Perturbation-invariant web understanding for MLLMs

Develop multimodal large language models for web understanding that maintain consistent, semantically faithful predictions under diverse non-semantic perturbations of webpages, including color shifts of UI elements, text-level edits to button labels (e.g., homoglyph substitutions and spacing), and DOM/layout rearrangements, thereby achieving perturbation-invariant behavior across clean and perturbed screenshots of the same page.

Information Square Streamline Icon: https://streamlinehq.com

Background

WebRSSBench introduces adversarial evaluations that perturb webpages along three non-semantic dimensions—color, text, and layout—and measures whether models preserve their predictions across original and perturbed screenshots. Robust behavior requires identifying the same primary CTA under color changes, inferring identical button functions under text edits, and producing consistent page-purpose summaries under layout modifications.

Empirical results show widespread degradation and instability across state-of-the-art MLLMs under these perturbations, indicating that current systems overly depend on superficial cues and lack stable, abstraction-driven reasoning. The authors therefore highlight achieving perturbation-invariant web understanding as an open challenge.

References

This underscores that perturbation-invariant web understanding remains an open challenge, requiring models that can abstract away from surface-level cues and maintain consistency under diverse non-semantic variations.

Benchmarking MLLM-based Web Understanding: Reasoning, Robustness and Safety (2509.21782 - Liu et al., 26 Sep 2025) in Appendix, Subsection "Key Findings" (bullet: Robustness is fragile under minor perturbations)