Mixture-of-Grounding (MoG) for Web Agents

Updated 16 March 2026

Mixture-of-Grounding (MoG) is a multimodal architecture that fuses visual, SoM-hybrid, and DOM experts to resolve grounding from natural language intents to UI elements.
It integrates expert outputs via a softmax-gated fusion mechanism, translating multimodal inputs into precise pixel coordinates or DOM selectors.
Empirical evaluations on benchmarks like Online-Mind2Web show MoG enhances task success rates while reducing error modes compared to single-modality approaches.

Mixture-of-Grounding (MoG), as operationalized in the Avenir-Web agent, refers to a formal mixture-of-experts (MoE) architecture designed to robustly resolve grounding—that is, the mapping from natural language action intent to precise UI elements or coordinates—on complex, dynamic web interfaces. MoG fuses three specialized "grounding expert" modules, each focused on a distinct modality (visual, set-of-mark hybrid, and structural/DOM), coordinated by a softmax-gated fusion mechanism. This composite system addresses the limitations of unimodal or single-expert grounding, mitigating error modes arising from both visual layout complexity and semantic ambiguity within the Document Object Model (DOM).

1. Integration of Mixture-of-Grounding into Web Agents

The MoG module functions at the core of Avenir-Web’s execution loop, bridging the agent’s high-level “action intent” (produced by an LLM backbone) to a grounded web action. At each decision step, the system inputs a viewport screenshot $I_t \in \mathbb{R}^{H \times W \times 3}$ , the structured DOM representation $D_t$ , and an embedding of the present action intent $\ell_t \in \mathbb{R}^d$ . MoG consumes these multimodal inputs and emits a concrete grounding: either a pixel coordinate for visual interaction or a DOM element selector for programmatic manipulation (Li et al., 2 Feb 2026).

The integration sequence is summarized as follows (paraphrased from the agent pipeline):

User Instruction → Experience-Imitation Planner → Task Checklist & Perception (Image + DOM)
Action Intent $\rightarrow$ MoG
MoG outputs grounded action for browser environment
Agent updates outcome, checklist, and memory

2. Formal Structure: Experts, Gating, and Aggregation

2.1 Expert Feature Representations

Each MoG expert encodes its respective modality:

Visual expert $E_v$ : $f_v = \phi_v(I_t)$ , employing a ViT or CNN backbone to produce a feature embedding from raw pixels.
Set-of-Mark (SoM) hybrid expert $E_s$ : $f_s = \phi_s(I_t, T_t)$ , where $T_t$ is an overlaid tag map of detected interactive regions (links, buttons, etc.) annotated by ID.
Structural/DOM expert $E_d$ : $f_d = \phi_d(D_t)$ , capturing hierarchical and attribute-rich DOM structure using transformers or GNNs.

Each expert’s encoder $\phi_i$ maps input to a feature vector in $\mathbb{R}^m$ .

2.2 Gating Network

Expert features are concatenated with the intent embedding: $h_i = [f_i; \ell_t] \in \mathbb{R}^{m+d}$ . The gating network computes mixture weights:

$g_i = w_g^\top h_i + b_g, \qquad \alpha_i = \frac{\exp(g_i)}{\sum_{j \in \{v, s, d\}} \exp(g_j)}$

These $\alpha_i$ coefficients softly select and combine grounding proposals from each expert.

2.3 Output Fusion

Each expert proposes a grounding:

For pixel tasks: $y_i = (x_i, y_i)$
For selector tasks: $p_i(k)$ , a probability distribution over candidate DOM nodes

Aggregated outputs are computed as:

Pixel: $\hat{x} = \sum_i \alpha_i x_i$ , $\hat{y} = \sum_i \alpha_i y_i$
Selector: $p_{\text{final}}(k) = \sum_i \alpha_i p_i(k)$ , $\hat{k} = \arg\max_k p_{\text{final}}(k)$

This design ensures flexible adaptation across web environments of varying visual and structural regularities.

3. Grounding Expert Specializations

The three MoG experts operate as follows:

Expert Name	Primary Modality	Main Architecture	Output Type
Visual ( $E_v$ )	Raw RGB screenshot	ViT/CNN + cross-attention	[x, y] continuous
SoM-Hybrid ( $E_s$ )	Screenshot + tag map (regions + ID)	Vision-LLM with ID tokens	Region ID (discrete)
Structural ( $E_d$ )	Cleaned DOM/accessibility tree	Tree-transformer or GNN + node classifier	DOM node distribution

The visual expert is robust to complex overlays and visually rich UIs; SoM-hybrid handles explicit region tagging and token-level disambiguation; the structural expert exploits explicit DOM/semantic structure for fine-grained element discrimination.

4. End-to-End Training Protocol

MoG is trained end-to-end on supervised grounding datasets of the form $(I^n, D^n, \ell^n, y^n)$ . The objective minimizes the negative log-likelihood of the target grounding $y^*$ under the overall mixture:

$L(\theta) = -\log \left[\sum_{i \in \{v, s, d\}} \alpha_i P_i(y^* \mid I, D, \ell) \right] + \lambda \sum_i \| \theta_i \|^2$

Here, $P_i$ is a Gaussian likelihood for coordinates or categorical probability for selectors. Expert-specific and mixture losses are decomposed as:

Expert loss: $L_i = -\log P_i(y^*)$
Mixture loss: $L_{\text{mix}} = -\log \sum_i \alpha_i \exp(-L_i)$

Gradients are backpropagated jointly through the gating weights and each expert’s encoder parameters. Weight decay $\lambda$ regularizes the model to prevent overfitting and expert collapse.

5. Empirical Advantages and Error Analysis

Avenir-Web’s MoG module mitigates characteristic failure cases in grounding:

Layout or overlay features (e.g., iframes) that disrupt pure-DOM approaches are handled by the visual expert.
Semantically ambiguous cases (such as repeated buttons) where coordinate-only experts fail are resolved by the structural or SoM experts.

Empirical evidence underscores MoG’s effectiveness: On the Online-Mind2Web benchmark, full Avenir-Web (with MoG) achieves 53.7% task success (Gemini 3 Pro backbone). Ablation replacing MoG with a visual-only expert reduces performance on a 50-task subset from 48.0% to 40.0%. This 8-point absolute drop demonstrates the necessity of multimodal expert fusion; no single grounding modality consistently suffices (Li et al., 2 Feb 2026).

Further, MoG yields both higher accuracy and lower latency by producing almost always a single MLLM call per step, in contrast to the multi-round resolution required by SoM-centric chains.

Configuration	Success Rate
Full Avenir-Web (MoG)	48.0%
Without MoG (visual-only grounding)	40.0%

6. Significance and Design Implications

Mixture-of-Grounding embeds expert specialization and multimodal fusion into web agent grounding, directly addressing brittle points in both perception (visual complexity) and structure (DOM idiosyncrasies) within modern web interfaces. The formal mixture weights enable dynamic allocation of "trust" across modalities, while the capacity to backpropagate throughout the fusion enables optimal conditional weighting.

A plausible implication is that, as web interfaces grow in dynamism and complexity, mixture-based grounding architectures will become central for robust multimodal agents operating in real-world environments. MoG’s demonstrated advantage in both live success rates and computational efficiency establishes a practical baseline for future research on web action grounding (Li et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mixture-of-Grounding (MoG).

Mixture-of-Grounding (MoG) for Web Agents

1. Integration of Mixture-of-Grounding into Web Agents

2. Formal Structure: Experts, Gating, and Aggregation

2.1 Expert Feature Representations

2.2 Gating Network

2.3 Output Fusion

3. Grounding Expert Specializations

4. End-to-End Training Protocol

5. Empirical Advantages and Error Analysis

6. Significance and Design Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Mixture-of-Grounding (MoG) for Web Agents

1. Integration of Mixture-of-Grounding into Web Agents

2. Formal Structure: Experts, Gating, and Aggregation

2.1 Expert Feature Representations

2.2 Gating Network

2.3 Output Fusion

3. Grounding Expert Specializations

4. End-to-End Training Protocol

5. Empirical Advantages and Error Analysis

6. Significance and Design Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research