VL-RouterBench: VLM Routing Benchmark

Updated 5 January 2026

VL-RouterBench is a large-scale, reproducible benchmark for vision–language routing that defines clear quality and cost metrics.
It aggregates 30,540 samples across 14 datasets and assesses 17 vision–language models using detailed matrices and Pareto-frontier analysis.
The evaluation protocol, featuring multiple router strategies and transparent cost accounting, supports systematic and comparative VLM research.

VL-RouterBench is a large-scale, reproducible benchmark specifically designed for vision–LLM (VLM) routing. Its primary objective is to enable systematic and fair evaluation of model-routing strategies by controlling for both answer quality and computational cost across a comprehensive suite of multimodal tasks, grounded in real inference logs and transparent cost accounting (Huang et al., 29 Dec 2025).

1. Dataset and Model Scope

VL-RouterBench encompasses a diverse set of tasks and model families:

Samples and Task Groups: The benchmark aggregates 30,540 samples spanning 14 datasets, covering three principal groups:
- General: MMBench (3,217), MMStar (1,500), MMMU (11,500), RealWorldQA (765), InfoVQA (30,035 questions over 5,485 images), HallusionBench (1,129)
- STEM: MathVista (6,141), MathVision (3,040), MathVerse (2,612), AI2D (≈5,000 diagrams, 15,000 Qs)
- Charts & OCR: TextVQA (45,336 questions over 28,408 images), ChartQA (20,882 charts, 9,608 human questions + 23,111 auto questions), DocVQA (50,000 questions over 12,767 documents), OCRBench (1,000 questions over five OCR sub-tasks)
Model Spectrum: Evaluation is performed over 17 vision–LLMs: 15 open-source (ranging from 1B to 78B parameters, e.g., Janus-Pro-1B, DeepSeek-VL2, Gemma3-27B, InternVL2.5-78B, Qwen2.5-VL-72B, SmolVLM2) and two commercial APIs, GPT-4o and Gemini-Flash-2.5. This yields a total of 519,180 sample–model pairs and 34,494,977 input/output tokens.
Cost Tracking: For each sample–model pair, input and output token counts are logged, with cost accounting (in USD per 1M tokens) per avenue.

2. Construction of Quality and Cost Matrices

At the core of the benchmarking protocol are explicitly defined matrices quantifying answer correctness and execution cost per sample–model pair:

Quality Matrix ( $Q$ ): For $N=30{,}540$ samples and $M=17$ models, $Q\in \{0,1\}^{N\times M}$ holds entries $Y_{i,j}$ , set to 1 if a rule-based match identifies model $j$ 's answer on sample $i$ as correct, 0 otherwise.
Cost Matrix ( $C$ ): Each entry $C_{i,j}$ is determined by

$C_{i,j} = (\#\mathrm{tok\_in}_{i,j} \cdot c_j^{\mathrm{in}}) + (\#\mathrm{tok\_out}_{i,j} \cdot c_j^{\mathrm{out}})$

where $c_j^{\mathrm{in}}$ and $c_j^{\mathrm{out}}$ are input/output costs for model $j$ .

Normalization: While raw matrices are unnormalized, cost normalization is performed for certain metrics via log scaling and min–max adjustment.

This methodology supports granular performance analysis at the sample–model level for both accuracy and economic efficiency.

3. Evaluation Protocol and Metrics

The VL-RouterBench evaluation paradigm emphasizes the trade-off between quality and cost, and provides several ensemble metrics:

Splitting: All samples are partitioned into training (70%), development (10%), and test (20%) sets.
Soft-Label Training: For each sample $i$ and model $j$ , the soft label

$t_i^{(\lambda)}(j) = \frac{\mathbf{1}\{Y_{i,j}=1\}\,\exp(-\lambda\,C_{i,j})}{\sum_{k:Y_{i,k}=1}\exp(-\lambda\,C_{i,k})}$

is used to control the trade-off between accuracy and cost, parameterized by $\lambda\in\{0,10,100,1000,10000,\infty\}$ .

Metrics:
- Average Accuracy ( $A_{\text{avg}}$ ): Fraction of correct answers across test samples.
- Average Cost ( $C_{\text{avg}}$ ): Mean cost across routed test samples (reported as \$ per 10k samples).
- Throughput ( $T$ ): Measured in thousands of tokens per second over routing decisions.
- Cost Normalization ( $C_{\text{norm}}$ ):
$C_{\text{norm}} = 100 \times \frac{\log_2(C_\mathrm{max}) - \log_2(C_{\text{avg}})}{\log_2(C_\mathrm{max}) - \log_2(C_\mathrm{min})}$

where $C_{\text{min}}$ and $C_{\text{max}}$ denote single-model min/max over all test samples. - Joint Ranking Score ( $S(\beta)$ ):

$S(\beta) = \frac{(1+\beta)\,\bar{A}\,C_{\mathrm{norm}}}{\beta\,\bar{A} + C_{\mathrm{norm}}}, \quad \beta>0$

with $\beta=0.1$ by default, to put greater emphasis on accuracy.

Routers are always evaluated at multiple $\lambda$ values; the highest point on the (Cost, Accuracy) Pareto frontier, as measured by $S$ , is reported per method.

4. Routing Algorithms and Baselines

Ten learned router strategies and three non-parametric baselines constitute the evaluated methods:

Baselines:
- Oracle: For each sample, selects the cheapest correct model or, if none correct, the absolute cheapest model.
- Strongest: Uses the single highest-accuracy model for all samples.
- Cheapest: Uses the lowest-cost model universally.
Feature-Level Routers (frozen encoders plus classifier):
- KNN, PRkNN, One-Vs-Rest (OVR), K-means, Linear, MLP.
End-to-End Routers (fine-tune multimodal encoder + classifier):
- CosineClassifier, RouterDC (dual-contrastive), ZOOTER, VLC.

Ablation studies demonstrate that feature-level routers benefit from high-dimensional BGE-M3 and SigLIP-L-16 embeddings, and that “Normalize-Concat” fusion surpasses more complex alternatives. Among encoder backbones, LXMERT yields the strongest routing accuracy.

5. Benchmarking Results and Analysis

The following table (selected entries from the original data) summarizes the top results per routing strategy:

Router	Avg Acc (%)	Avg Cost (\$/10K)	Rank Score	Throughput (K tok/s)
Oracle	95.60	0.37	93.68	—
Strongest	78.01	2.72	68.88	—
Cheapest	62.43	0.14	64.63	—
MLP	77.49	1.13	74.23	146.7
VLC	78.09	1.23	74.33	6.74
RouterDC	77.52	1.04	74.59	6.31
ZOOTER	74.65	0.93	72.58	7.25
PRkNN	70.68	0.41	71.09	2.73
KNN	66.26	0.38	67.13	3.14

Key findings include:

All learned routers substantially outperform Strongest at comparable or lower cost, with RouterDC, VLC, and MLP all achieving ranking scores around 74, compared to Strongest's 68.88.
A significant performance gap persists between the best actual routers and the Oracle (rank score 93.68), indicating both the headroom for advances in routing logic and the inherent difficulty in achieving cost-optimal accuracy.
Ablations show that simple fusion strategies and more expressive embeddings yield consistent improvements.

A plausible implication is that further gains require routers that can exploit finer-grained visual cues and capture nuanced textual structure, leading to more accurate and cost-efficient routing decisions.

6. Open-Source Assets and Community Impact

VL-RouterBench provides full transparency and reproducibility for vision–language routing evaluation:

All data construction scripts, inference logs, routing model training code, and evaluation tools are publicly available (https://github.com/K1nght/VL-RouterBench).
Step-by-step procedures support new dataset/model integration.
Automated tools enable Pareto-frontier curve plotting and ranking score computation.
Standardized splits and hyperparameter settings ensure fair comparison and facilitate reproducible research.

By establishing a unified protocol with comprehensive metrics on 14 real-world VLM tasks and 17 models, VL-RouterBench provides a foundation for reproducible research and direct comparison of routing strategies, supporting advances in both model and router development (Huang et al., 29 Dec 2025).

Markdown Upgrade to Chat

References (1)

VL-RouterBench: A Benchmark for Vision-Language Model Routing (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VL-RouterBench.