Papers
Topics
Authors
Recent
Search
2000 character limit reached

MMR-Bench: Adaptive Routing in MLLMs

Updated 1 February 2026
  • MMR-Bench is a comprehensive benchmark for adaptive multimodal routing, quantifying cost-accuracy trade-offs via pre-tabulated model outputs and standardized budget constraints.
  • It employs modality-aware fusion and normalized cost evaluation to compare diverse routing policies from basic random selection to sophisticated parametric methods.
  • Empirical results demonstrate that adaptive routing achieves higher accuracy at reduced costs and generalizes well across varied vision–language tasks.

MMR-Bench is a comprehensive benchmark for adaptive routing among multimodal LLMs (MLLMs). It targets the core issue in practical MLLM deployment—no single model is uniformly optimal across the heterogeneous spectrum of vision–language tasks and computational efficiency constraints. MMR-Bench presents a controlled, offline framework for quantifying and comparing routing strategies, enabling precise cost–accuracy trade-offs using pre-tabulated model outputs and costs over a broad suite of multimodal tasks. Its design incorporates explicit modality-aware fusion, standardized budget constraints, and representative policies from trivial bounds to parametric decision rules, defining a foundation for scalable, efficient, and @@@@1@@@@ in real-world scenarios (Ma et al., 25 Jan 2026).

1. Motivation and Problem Scope

Vision–language workloads require diverse operational capabilities, such as lightweight OCR, complex diagram reasoning, or general visual question answering (VQA). The MLLM ecosystem is highly heterogeneous in architecture, alignment, and inference cost: compact open-weight models are efficient for routine perception, while large commercial APIs dominate complex reasoning but incur high latency or financial cost. Traditional approaches that deploy a single 'jack-of-all-trades' MLLM cannot simultaneously optimize for both utility and efficiency—forcing a trade-off between over-provisioned compute on simple tasks and loss of accuracy on difficult ones. Query-level routing addresses this by allocating model resources per-instance, but in the multimodal domain faces new challenges: joint representation of text-image signals, nonuniform model cost profiles, and lack of standardized benchmarks for budget-sensitive routing (Ma et al., 25 Jan 2026).

2. Dataset Composition and Task Suite

MMR-Bench is constructed as an offline table of 11,000\approx 11,000 multimodal instances xi=(xitext,xiimg)x_i=(x^{\mathrm{text}}_i, x^{\mathrm{img}}_i), paired with:

  • Modality availability vectors mi{0,1}2\mathbf{m}_i\in\{0,1\}^2 (presence of text/image),
  • Model utility vectors ui=(ui,1,...,ui,K)[0,1]K\mathbf{u}_i=(u_{i,1},...,u_{i,K})\in[0,1]^K (normalized accuracy or task-specific score of each of KK candidate models),
  • Model cost vectors ci=(ci,1,...,ci,K)R+K\mathbf{c}_i=(c_{i,1},...,c_{i,K})\in\mathbb{R}_+^K (normalized inference cost based on provider pricing, e.g., OpenRouter \$ per 1M output tokens).

Tasks cover three primary scenarios:

  1. Document-centric OCR and understanding: Includes benchmarks such as OCRBench and SEED-Bench V2 Plus; queries range from simple text extraction to layout-conditioned generation.
  2. General VQA and grounding: MMStar, RealWorldQA and related benchmarks; queries span natural images, charts, and scene analysis with both open-ended and multiple-choice formats.
  3. Multimodal math and diagram reasoning: MathVista, MathVerse, MathVision; queries require compositional reasoning over diagrams, equations, and text.

This variety ensures that model utility and cost profiles vary nontrivially across instances, making routing meaningful and non-degenerate (Ma et al., 25 Jan 2026).

3. Modality Fusion and Routing Feature Space

Effective multimodal routing demands representations that encode both visual and textual complexity. MMR-Bench supplies:

  • Frozen CLIP-based text/image embeddings for every instance.
  • Routable feature spaces: ϕtext(x)\phi_{\mathrm{text}}(x) (text-only), ϕimg(x)\phi_{\mathrm{img}}(x) (image-only), and fused multimodal ϕmm(x)\phi_{\mathrm{mm}}(x).

Fusion is facilitated via a confidence estimator—using prototype similarity scores and embedding norms per modality—followed by softmax weighting, inclusion of the elementwise product, and feature difference:

zi=wtxtxtxt+wimgximg+α(xtxtximg)+βxtxtximgz_i = w_{\mathrm{txt}}\,x_{\mathrm{txt}} + w_{\mathrm{img}}\,x_{\mathrm{img}} + \alpha(x_{\mathrm{txt}} \odot x_{\mathrm{img}}) + \beta|x_{\mathrm{txt}} - x_{\mathrm{img}}|

This design enables routers to exploit multimodal cues for optimal policy learning (Ma et al., 25 Jan 2026).

4. Candidate Model Pool and Cost Normalization

The routing space is defined over a diverse pool of ten models:

Type Example Models Cost Range (\$/1M tokens)
Commercial API GPT-5-0807, Claude 3.7, Gemini 2.5 \$10–\$3
Open-weight VL backbone InternVL3 78B, Qwen2.5-VL (3B, 7B, 72B), Gemma3 4B \$0.03–\$0.14

Each model's raw cost per 1M output tokens is normalized: ci,j=Costraw(j)maxkCostraw(k)c_{i,j} = \frac{\text{Cost}_{\text{raw}}(j)}{\max_{k}\text{Cost}_{\text{raw}}(k)}. This provides uniform budgeting across models and queries, supporting direct metric computation of utility–cost trade-offs (Ma et al., 25 Jan 2026).

5. Routing Policies and Formalism

Routing policies in MMR-Bench range from trivial to complex, including:

Policy Description
RandomRouter Uniform selection over candidate models
OracleRouter Selects model with true maximal utility given cost
KNNRouter Nearest-neighbor selection in feature space
KMeansRouter Cluster-based assignment
LinearRouter Linear regression on utility/cost
LinearMFRouter Low-rank matrix-factorization variant
MLPRouter Shallow multi-layer perceptron
MLPMFRouter Matrix-factorization style shallow MLP
CrossModalRouter Cross-modal attention-based selection

Parametric routers predict {u^i,j,c^i,j}\{\hat u_{i,j}, \hat c_{i,j}\} from input features ziz_i and select:

j(i;λ)=argminj[1u^i,j+λc^i,j]j^\star(i;\lambda) = \arg\min_j\, [1 - \hat u_{i,j} + \lambda\,\hat c_{i,j}]

where λ0\lambda \ge 0 trades off cost-sensitivity, enabling sweeping along the Pareto frontier p(c)p(c) (Ma et al., 25 Jan 2026).

6. Evaluation Metrics and Empirical Findings

Performance metrics are defined offline by aggregating model outputs:

  • Normalized AUC:

nAUC=1cmaxcmincmincmaxp(c)dc\mathrm{nAUC} = \frac{1}{c_{\max}-c_{\min}} \int_{c_{\min}}^{c_{\max}} p(c) \, \mathrm{d}c

  • Peak Score:

Ps=maxcp(c)P_s = \max_c p(c)

  • Quality-Neutral Cost (QNC):

QNC=1cbestmin{c:p(c)pbest}\mathrm{QNC} = \frac{1}{c_{\mathrm{best}}} \min\,\{c : p(c) \geq p_{\mathrm{best}}\}

where pbestp_{\mathrm{best}} and cbestc_{\mathrm{best}} are the accuracy and cost of the strongest single model. QNC <1<1 indicates the router matches best-model accuracy at lower cost.

Core empirical outcomes:

  • Multimodal routing outperforms unimodal: At matched cost, multimodal policies achieve higher accuracy than text- or image-only policies, especially for image-governed queries.
  • Efficient Pareto gains: Adaptive routing exceeds the best single model’s accuracy at approximately 33% of its cost under tight budget constraints.
  • Generalization: Routing policies trained on subsets generalize zero-shot within scenario (e.g., OCR, VQA, math) and even to text-only datasets (GSM8K, MMLU, ARC) by masking the image channel, showing modality-agnostic difficulty modeling (Ma et al., 25 Jan 2026).

7. Significance, Limitations, and Future Directions

MMR-Bench isolates multimodal model selection as an explicit research task, providing standardized, reproducible cost–accuracy evaluations. It highlights the necessity of multimodal routing for real-world deployments, where unimodal signals systematically misallocate compute. Limitations include reliance on frozen features, fixed candidate pools, and the offline evaluation nature (no 'live' API calls). A plausible implication is that future work should incorporate dynamic state, uncertainty modeling, or retraining on live feedback. The zero-shot generalization of routing policies suggests that difficulty signals learned in multimodal domains transfer robustly, motivating further work on hybrid and cross-modal architectures, dynamic budget management, and multi-agent systems for vision–LLM orchestration (Ma et al., 25 Jan 2026).


MMR-Bench defines a reference framework for adaptive and budget-aware multimodal model routing, enabling empirical benchmarking of policies, illuminating cost–utility trade-offs, and charting pathways for scalable MLLM system design and deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MMR-Bench.