MMR-Bench: Adaptive Routing in MLLMs
- MMR-Bench is a comprehensive benchmark for adaptive multimodal routing, quantifying cost-accuracy trade-offs via pre-tabulated model outputs and standardized budget constraints.
- It employs modality-aware fusion and normalized cost evaluation to compare diverse routing policies from basic random selection to sophisticated parametric methods.
- Empirical results demonstrate that adaptive routing achieves higher accuracy at reduced costs and generalizes well across varied vision–language tasks.
MMR-Bench is a comprehensive benchmark for adaptive routing among multimodal LLMs (MLLMs). It targets the core issue in practical MLLM deployment—no single model is uniformly optimal across the heterogeneous spectrum of vision–language tasks and computational efficiency constraints. MMR-Bench presents a controlled, offline framework for quantifying and comparing routing strategies, enabling precise cost–accuracy trade-offs using pre-tabulated model outputs and costs over a broad suite of multimodal tasks. Its design incorporates explicit modality-aware fusion, standardized budget constraints, and representative policies from trivial bounds to parametric decision rules, defining a foundation for scalable, efficient, and @@@@1@@@@ in real-world scenarios (Ma et al., 25 Jan 2026).
1. Motivation and Problem Scope
Vision–language workloads require diverse operational capabilities, such as lightweight OCR, complex diagram reasoning, or general visual question answering (VQA). The MLLM ecosystem is highly heterogeneous in architecture, alignment, and inference cost: compact open-weight models are efficient for routine perception, while large commercial APIs dominate complex reasoning but incur high latency or financial cost. Traditional approaches that deploy a single 'jack-of-all-trades' MLLM cannot simultaneously optimize for both utility and efficiency—forcing a trade-off between over-provisioned compute on simple tasks and loss of accuracy on difficult ones. Query-level routing addresses this by allocating model resources per-instance, but in the multimodal domain faces new challenges: joint representation of text-image signals, nonuniform model cost profiles, and lack of standardized benchmarks for budget-sensitive routing (Ma et al., 25 Jan 2026).
2. Dataset Composition and Task Suite
MMR-Bench is constructed as an offline table of multimodal instances , paired with:
- Modality availability vectors (presence of text/image),
- Model utility vectors (normalized accuracy or task-specific score of each of candidate models),
- Model cost vectors (normalized inference cost based on provider pricing, e.g., OpenRouter \$ per 1M output tokens).
Tasks cover three primary scenarios:
- Document-centric OCR and understanding: Includes benchmarks such as OCRBench and SEED-Bench V2 Plus; queries range from simple text extraction to layout-conditioned generation.
- General VQA and grounding: MMStar, RealWorldQA and related benchmarks; queries span natural images, charts, and scene analysis with both open-ended and multiple-choice formats.
- Multimodal math and diagram reasoning: MathVista, MathVerse, MathVision; queries require compositional reasoning over diagrams, equations, and text.
This variety ensures that model utility and cost profiles vary nontrivially across instances, making routing meaningful and non-degenerate (Ma et al., 25 Jan 2026).
3. Modality Fusion and Routing Feature Space
Effective multimodal routing demands representations that encode both visual and textual complexity. MMR-Bench supplies:
- Frozen CLIP-based text/image embeddings for every instance.
- Routable feature spaces: (text-only), (image-only), and fused multimodal .
Fusion is facilitated via a confidence estimator—using prototype similarity scores and embedding norms per modality—followed by softmax weighting, inclusion of the elementwise product, and feature difference:
This design enables routers to exploit multimodal cues for optimal policy learning (Ma et al., 25 Jan 2026).
4. Candidate Model Pool and Cost Normalization
The routing space is defined over a diverse pool of ten models:
| Type | Example Models | Cost Range (\$/1M tokens) |
|---|---|---|
| Commercial API | GPT-5-0807, Claude 3.7, Gemini 2.5 | \$10–\$3 |
| Open-weight VL backbone | InternVL3 78B, Qwen2.5-VL (3B, 7B, 72B), Gemma3 4B | \$0.03–\$0.14 |
Each model's raw cost per 1M output tokens is normalized: . This provides uniform budgeting across models and queries, supporting direct metric computation of utility–cost trade-offs (Ma et al., 25 Jan 2026).
5. Routing Policies and Formalism
Routing policies in MMR-Bench range from trivial to complex, including:
| Policy | Description |
|---|---|
| RandomRouter | Uniform selection over candidate models |
| OracleRouter | Selects model with true maximal utility given cost |
| KNNRouter | Nearest-neighbor selection in feature space |
| KMeansRouter | Cluster-based assignment |
| LinearRouter | Linear regression on utility/cost |
| LinearMFRouter | Low-rank matrix-factorization variant |
| MLPRouter | Shallow multi-layer perceptron |
| MLPMFRouter | Matrix-factorization style shallow MLP |
| CrossModalRouter | Cross-modal attention-based selection |
Parametric routers predict from input features and select:
where trades off cost-sensitivity, enabling sweeping along the Pareto frontier (Ma et al., 25 Jan 2026).
6. Evaluation Metrics and Empirical Findings
Performance metrics are defined offline by aggregating model outputs:
- Normalized AUC:
- Peak Score:
- Quality-Neutral Cost (QNC):
where and are the accuracy and cost of the strongest single model. QNC indicates the router matches best-model accuracy at lower cost.
Core empirical outcomes:
- Multimodal routing outperforms unimodal: At matched cost, multimodal policies achieve higher accuracy than text- or image-only policies, especially for image-governed queries.
- Efficient Pareto gains: Adaptive routing exceeds the best single model’s accuracy at approximately 33% of its cost under tight budget constraints.
- Generalization: Routing policies trained on subsets generalize zero-shot within scenario (e.g., OCR, VQA, math) and even to text-only datasets (GSM8K, MMLU, ARC) by masking the image channel, showing modality-agnostic difficulty modeling (Ma et al., 25 Jan 2026).
7. Significance, Limitations, and Future Directions
MMR-Bench isolates multimodal model selection as an explicit research task, providing standardized, reproducible cost–accuracy evaluations. It highlights the necessity of multimodal routing for real-world deployments, where unimodal signals systematically misallocate compute. Limitations include reliance on frozen features, fixed candidate pools, and the offline evaluation nature (no 'live' API calls). A plausible implication is that future work should incorporate dynamic state, uncertainty modeling, or retraining on live feedback. The zero-shot generalization of routing policies suggests that difficulty signals learned in multimodal domains transfer robustly, motivating further work on hybrid and cross-modal architectures, dynamic budget management, and multi-agent systems for vision–LLM orchestration (Ma et al., 25 Jan 2026).
MMR-Bench defines a reference framework for adaptive and budget-aware multimodal model routing, enabling empirical benchmarking of policies, illuminating cost–utility trade-offs, and charting pathways for scalable MLLM system design and deployment.