Mixture-of-Grounding (MoG) for Web Agents
- Mixture-of-Grounding (MoG) is a multimodal architecture that fuses visual, SoM-hybrid, and DOM experts to resolve grounding from natural language intents to UI elements.
- It integrates expert outputs via a softmax-gated fusion mechanism, translating multimodal inputs into precise pixel coordinates or DOM selectors.
- Empirical evaluations on benchmarks like Online-Mind2Web show MoG enhances task success rates while reducing error modes compared to single-modality approaches.
Mixture-of-Grounding (MoG), as operationalized in the Avenir-Web agent, refers to a formal mixture-of-experts (MoE) architecture designed to robustly resolve grounding—that is, the mapping from natural language action intent to precise UI elements or coordinates—on complex, dynamic web interfaces. MoG fuses three specialized "grounding expert" modules, each focused on a distinct modality (visual, set-of-mark hybrid, and structural/DOM), coordinated by a softmax-gated fusion mechanism. This composite system addresses the limitations of unimodal or single-expert grounding, mitigating error modes arising from both visual layout complexity and semantic ambiguity within the Document Object Model (DOM).
1. Integration of Mixture-of-Grounding into Web Agents
The MoG module functions at the core of Avenir-Web’s execution loop, bridging the agent’s high-level “action intent” (produced by an LLM backbone) to a grounded web action. At each decision step, the system inputs a viewport screenshot , the structured DOM representation , and an embedding of the present action intent . MoG consumes these multimodal inputs and emits a concrete grounding: either a pixel coordinate for visual interaction or a DOM element selector for programmatic manipulation (Li et al., 2 Feb 2026).
The integration sequence is summarized as follows (paraphrased from the agent pipeline):
- User Instruction → Experience-Imitation Planner → Task Checklist & Perception (Image + DOM)
- Action Intent MoG
- MoG outputs grounded action for browser environment
- Agent updates outcome, checklist, and memory
2. Formal Structure: Experts, Gating, and Aggregation
2.1 Expert Feature Representations
Each MoG expert encodes its respective modality:
- Visual expert : , employing a ViT or CNN backbone to produce a feature embedding from raw pixels.
- Set-of-Mark (SoM) hybrid expert : , where is an overlaid tag map of detected interactive regions (links, buttons, etc.) annotated by ID.
- Structural/DOM expert : , capturing hierarchical and attribute-rich DOM structure using transformers or GNNs.
Each expert’s encoder maps input to a feature vector in .
2.2 Gating Network
Expert features are concatenated with the intent embedding: . The gating network computes mixture weights:
These coefficients softly select and combine grounding proposals from each expert.
2.3 Output Fusion
Each expert proposes a grounding:
- For pixel tasks:
- For selector tasks: , a probability distribution over candidate DOM nodes
Aggregated outputs are computed as:
- Pixel: ,
- Selector: ,
This design ensures flexible adaptation across web environments of varying visual and structural regularities.
3. Grounding Expert Specializations
The three MoG experts operate as follows:
| Expert Name | Primary Modality | Main Architecture | Output Type |
|---|---|---|---|
| Visual () | Raw RGB screenshot | ViT/CNN + cross-attention | [x, y] continuous |
| SoM-Hybrid () | Screenshot + tag map (regions + ID) | Vision-LLM with ID tokens | Region ID (discrete) |
| Structural () | Cleaned DOM/accessibility tree | Tree-transformer or GNN + node classifier | DOM node distribution |
The visual expert is robust to complex overlays and visually rich UIs; SoM-hybrid handles explicit region tagging and token-level disambiguation; the structural expert exploits explicit DOM/semantic structure for fine-grained element discrimination.
4. End-to-End Training Protocol
MoG is trained end-to-end on supervised grounding datasets of the form . The objective minimizes the negative log-likelihood of the target grounding under the overall mixture:
Here, is a Gaussian likelihood for coordinates or categorical probability for selectors. Expert-specific and mixture losses are decomposed as:
- Expert loss:
- Mixture loss:
Gradients are backpropagated jointly through the gating weights and each expert’s encoder parameters. Weight decay regularizes the model to prevent overfitting and expert collapse.
5. Empirical Advantages and Error Analysis
Avenir-Web’s MoG module mitigates characteristic failure cases in grounding:
- Layout or overlay features (e.g., iframes) that disrupt pure-DOM approaches are handled by the visual expert.
- Semantically ambiguous cases (such as repeated buttons) where coordinate-only experts fail are resolved by the structural or SoM experts.
Empirical evidence underscores MoG’s effectiveness: On the Online-Mind2Web benchmark, full Avenir-Web (with MoG) achieves 53.7% task success (Gemini 3 Pro backbone). Ablation replacing MoG with a visual-only expert reduces performance on a 50-task subset from 48.0% to 40.0%. This 8-point absolute drop demonstrates the necessity of multimodal expert fusion; no single grounding modality consistently suffices (Li et al., 2 Feb 2026).
Further, MoG yields both higher accuracy and lower latency by producing almost always a single MLLM call per step, in contrast to the multi-round resolution required by SoM-centric chains.
| Configuration | Success Rate |
|---|---|
| Full Avenir-Web (MoG) | 48.0% |
| Without MoG (visual-only grounding) | 40.0% |
6. Significance and Design Implications
Mixture-of-Grounding embeds expert specialization and multimodal fusion into web agent grounding, directly addressing brittle points in both perception (visual complexity) and structure (DOM idiosyncrasies) within modern web interfaces. The formal mixture weights enable dynamic allocation of "trust" across modalities, while the capacity to backpropagate throughout the fusion enables optimal conditional weighting.
A plausible implication is that, as web interfaces grow in dynamism and complexity, mixture-based grounding architectures will become central for robust multimodal agents operating in real-world environments. MoG’s demonstrated advantage in both live success rates and computational efficiency establishes a practical baseline for future research on web action grounding (Li et al., 2 Feb 2026).