Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mixture-of-Grounding (MoG) for Web Agents

Updated 16 March 2026
  • Mixture-of-Grounding (MoG) is a multimodal architecture that fuses visual, SoM-hybrid, and DOM experts to resolve grounding from natural language intents to UI elements.
  • It integrates expert outputs via a softmax-gated fusion mechanism, translating multimodal inputs into precise pixel coordinates or DOM selectors.
  • Empirical evaluations on benchmarks like Online-Mind2Web show MoG enhances task success rates while reducing error modes compared to single-modality approaches.

Mixture-of-Grounding (MoG), as operationalized in the Avenir-Web agent, refers to a formal mixture-of-experts (MoE) architecture designed to robustly resolve grounding—that is, the mapping from natural language action intent to precise UI elements or coordinates—on complex, dynamic web interfaces. MoG fuses three specialized "grounding expert" modules, each focused on a distinct modality (visual, set-of-mark hybrid, and structural/DOM), coordinated by a softmax-gated fusion mechanism. This composite system addresses the limitations of unimodal or single-expert grounding, mitigating error modes arising from both visual layout complexity and semantic ambiguity within the Document Object Model (DOM).

1. Integration of Mixture-of-Grounding into Web Agents

The MoG module functions at the core of Avenir-Web’s execution loop, bridging the agent’s high-level “action intent” (produced by an LLM backbone) to a grounded web action. At each decision step, the system inputs a viewport screenshot ItRH×W×3I_t \in \mathbb{R}^{H \times W \times 3}, the structured DOM representation DtD_t, and an embedding of the present action intent tRd\ell_t \in \mathbb{R}^d. MoG consumes these multimodal inputs and emits a concrete grounding: either a pixel coordinate for visual interaction or a DOM element selector for programmatic manipulation (Li et al., 2 Feb 2026).

The integration sequence is summarized as follows (paraphrased from the agent pipeline):

  • User Instruction → Experience-Imitation Planner → Task Checklist & Perception (Image + DOM)
  • Action Intent \rightarrow MoG
  • MoG outputs grounded action for browser environment
  • Agent updates outcome, checklist, and memory

2. Formal Structure: Experts, Gating, and Aggregation

2.1 Expert Feature Representations

Each MoG expert encodes its respective modality:

  • Visual expert EvE_v: fv=ϕv(It)f_v = \phi_v(I_t), employing a ViT or CNN backbone to produce a feature embedding from raw pixels.
  • Set-of-Mark (SoM) hybrid expert EsE_s: fs=ϕs(It,Tt)f_s = \phi_s(I_t, T_t), where TtT_t is an overlaid tag map of detected interactive regions (links, buttons, etc.) annotated by ID.
  • Structural/DOM expert EdE_d: fd=ϕd(Dt)f_d = \phi_d(D_t), capturing hierarchical and attribute-rich DOM structure using transformers or GNNs.

Each expert’s encoder ϕi\phi_i maps input to a feature vector in Rm\mathbb{R}^m.

2.2 Gating Network

Expert features are concatenated with the intent embedding: hi=[fi;t]Rm+dh_i = [f_i; \ell_t] \in \mathbb{R}^{m+d}. The gating network computes mixture weights:

gi=wghi+bg,αi=exp(gi)j{v,s,d}exp(gj)g_i = w_g^\top h_i + b_g, \qquad \alpha_i = \frac{\exp(g_i)}{\sum_{j \in \{v, s, d\}} \exp(g_j)}

These αi\alpha_i coefficients softly select and combine grounding proposals from each expert.

2.3 Output Fusion

Each expert proposes a grounding:

  • For pixel tasks: yi=(xi,yi)y_i = (x_i, y_i)
  • For selector tasks: pi(k)p_i(k), a probability distribution over candidate DOM nodes

Aggregated outputs are computed as:

  • Pixel: x^=iαixi\hat{x} = \sum_i \alpha_i x_i, y^=iαiyi\hat{y} = \sum_i \alpha_i y_i
  • Selector: pfinal(k)=iαipi(k)p_{\text{final}}(k) = \sum_i \alpha_i p_i(k), k^=argmaxkpfinal(k)\hat{k} = \arg\max_k p_{\text{final}}(k)

This design ensures flexible adaptation across web environments of varying visual and structural regularities.

3. Grounding Expert Specializations

The three MoG experts operate as follows:

Expert Name Primary Modality Main Architecture Output Type
Visual (EvE_v) Raw RGB screenshot ViT/CNN + cross-attention [x, y] continuous
SoM-Hybrid (EsE_s) Screenshot + tag map (regions + ID) Vision-LLM with ID tokens Region ID (discrete)
Structural (EdE_d) Cleaned DOM/accessibility tree Tree-transformer or GNN + node classifier DOM node distribution

The visual expert is robust to complex overlays and visually rich UIs; SoM-hybrid handles explicit region tagging and token-level disambiguation; the structural expert exploits explicit DOM/semantic structure for fine-grained element discrimination.

4. End-to-End Training Protocol

MoG is trained end-to-end on supervised grounding datasets of the form (In,Dn,n,yn)(I^n, D^n, \ell^n, y^n). The objective minimizes the negative log-likelihood of the target grounding yy^* under the overall mixture:

L(θ)=log[i{v,s,d}αiPi(yI,D,)]+λiθi2L(\theta) = -\log \left[\sum_{i \in \{v, s, d\}} \alpha_i P_i(y^* \mid I, D, \ell) \right] + \lambda \sum_i \| \theta_i \|^2

Here, PiP_i is a Gaussian likelihood for coordinates or categorical probability for selectors. Expert-specific and mixture losses are decomposed as:

  • Expert loss: Li=logPi(y)L_i = -\log P_i(y^*)
  • Mixture loss: Lmix=logiαiexp(Li)L_{\text{mix}} = -\log \sum_i \alpha_i \exp(-L_i)

Gradients are backpropagated jointly through the gating weights and each expert’s encoder parameters. Weight decay λ\lambda regularizes the model to prevent overfitting and expert collapse.

5. Empirical Advantages and Error Analysis

Avenir-Web’s MoG module mitigates characteristic failure cases in grounding:

  • Layout or overlay features (e.g., iframes) that disrupt pure-DOM approaches are handled by the visual expert.
  • Semantically ambiguous cases (such as repeated buttons) where coordinate-only experts fail are resolved by the structural or SoM experts.

Empirical evidence underscores MoG’s effectiveness: On the Online-Mind2Web benchmark, full Avenir-Web (with MoG) achieves 53.7% task success (Gemini 3 Pro backbone). Ablation replacing MoG with a visual-only expert reduces performance on a 50-task subset from 48.0% to 40.0%. This 8-point absolute drop demonstrates the necessity of multimodal expert fusion; no single grounding modality consistently suffices (Li et al., 2 Feb 2026).

Further, MoG yields both higher accuracy and lower latency by producing almost always a single MLLM call per step, in contrast to the multi-round resolution required by SoM-centric chains.

Configuration Success Rate
Full Avenir-Web (MoG) 48.0%
Without MoG (visual-only grounding) 40.0%

6. Significance and Design Implications

Mixture-of-Grounding embeds expert specialization and multimodal fusion into web agent grounding, directly addressing brittle points in both perception (visual complexity) and structure (DOM idiosyncrasies) within modern web interfaces. The formal mixture weights enable dynamic allocation of "trust" across modalities, while the capacity to backpropagate throughout the fusion enables optimal conditional weighting.

A plausible implication is that, as web interfaces grow in dynamism and complexity, mixture-based grounding architectures will become central for robust multimodal agents operating in real-world environments. MoG’s demonstrated advantage in both live success rates and computational efficiency establishes a practical baseline for future research on web action grounding (Li et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mixture-of-Grounding (MoG).