VL-RouterBench: VLM Routing Benchmark
- VL-RouterBench is a large-scale, reproducible benchmark for vision–language routing that defines clear quality and cost metrics.
- It aggregates 30,540 samples across 14 datasets and assesses 17 vision–language models using detailed matrices and Pareto-frontier analysis.
- The evaluation protocol, featuring multiple router strategies and transparent cost accounting, supports systematic and comparative VLM research.
VL-RouterBench is a large-scale, reproducible benchmark specifically designed for vision–LLM (VLM) routing. Its primary objective is to enable systematic and fair evaluation of model-routing strategies by controlling for both answer quality and computational cost across a comprehensive suite of multimodal tasks, grounded in real inference logs and transparent cost accounting (Huang et al., 29 Dec 2025).
1. Dataset and Model Scope
VL-RouterBench encompasses a diverse set of tasks and model families:
- Samples and Task Groups: The benchmark aggregates 30,540 samples spanning 14 datasets, covering three principal groups:
- General: MMBench (3,217), MMStar (1,500), MMMU (11,500), RealWorldQA (765), InfoVQA (30,035 questions over 5,485 images), HallusionBench (1,129)
- STEM: MathVista (6,141), MathVision (3,040), MathVerse (2,612), AI2D (≈5,000 diagrams, 15,000 Qs)
- Charts & OCR: TextVQA (45,336 questions over 28,408 images), ChartQA (20,882 charts, 9,608 human questions + 23,111 auto questions), DocVQA (50,000 questions over 12,767 documents), OCRBench (1,000 questions over five OCR sub-tasks)
- Model Spectrum: Evaluation is performed over 17 vision–LLMs: 15 open-source (ranging from 1B to 78B parameters, e.g., Janus-Pro-1B, DeepSeek-VL2, Gemma3-27B, InternVL2.5-78B, Qwen2.5-VL-72B, SmolVLM2) and two commercial APIs, GPT-4o and Gemini-Flash-2.5. This yields a total of 519,180 sample–model pairs and 34,494,977 input/output tokens.
- Cost Tracking: For each sample–model pair, input and output token counts are logged, with cost accounting (in USD per 1M tokens) per avenue.
2. Construction of Quality and Cost Matrices
At the core of the benchmarking protocol are explicitly defined matrices quantifying answer correctness and execution cost per sample–model pair:
- Quality Matrix (): For samples and models, holds entries , set to 1 if a rule-based match identifies model 's answer on sample as correct, 0 otherwise.
- Cost Matrix (): Each entry is determined by
where and are input/output costs for model .
- Normalization: While raw matrices are unnormalized, cost normalization is performed for certain metrics via log scaling and min–max adjustment.
This methodology supports granular performance analysis at the sample–model level for both accuracy and economic efficiency.
3. Evaluation Protocol and Metrics
The VL-RouterBench evaluation paradigm emphasizes the trade-off between quality and cost, and provides several ensemble metrics:
- Splitting: All samples are partitioned into training (70%), development (10%), and test (20%) sets.
- Soft-Label Training: For each sample and model , the soft label
is used to control the trade-off between accuracy and cost, parameterized by .
- Metrics:
- Average Accuracy (): Fraction of correct answers across test samples.
- Average Cost (): Mean cost across routed test samples (reported as \$ per 10k samples).
- Throughput (): Measured in thousands of tokens per second over routing decisions.
- Cost Normalization ():
where and denote single-model min/max over all test samples. - Joint Ranking Score ():
with by default, to put greater emphasis on accuracy.
Routers are always evaluated at multiple values; the highest point on the (Cost, Accuracy) Pareto frontier, as measured by , is reported per method.
4. Routing Algorithms and Baselines
Ten learned router strategies and three non-parametric baselines constitute the evaluated methods:
Baselines:
- Oracle: For each sample, selects the cheapest correct model or, if none correct, the absolute cheapest model.
- Strongest: Uses the single highest-accuracy model for all samples.
- Cheapest: Uses the lowest-cost model universally.
- Feature-Level Routers (frozen encoders plus classifier):
- KNN, PRkNN, One-Vs-Rest (OVR), K-means, Linear, MLP.
- End-to-End Routers (fine-tune multimodal encoder + classifier):
- CosineClassifier, RouterDC (dual-contrastive), ZOOTER, VLC.
Ablation studies demonstrate that feature-level routers benefit from high-dimensional BGE-M3 and SigLIP-L-16 embeddings, and that “Normalize-Concat” fusion surpasses more complex alternatives. Among encoder backbones, LXMERT yields the strongest routing accuracy.
5. Benchmarking Results and Analysis
The following table (selected entries from the original data) summarizes the top results per routing strategy:
| Router | Avg Acc (%) | Avg Cost (\$/10K) | Rank Score | Throughput (K tok/s) |
|---|---|---|---|---|
| Oracle | 95.60 | 0.37 | 93.68 | — |
| Strongest | 78.01 | 2.72 | 68.88 | — |
| Cheapest | 62.43 | 0.14 | 64.63 | — |
| MLP | 77.49 | 1.13 | 74.23 | 146.7 |
| VLC | 78.09 | 1.23 | 74.33 | 6.74 |
| RouterDC | 77.52 | 1.04 | 74.59 | 6.31 |
| ZOOTER | 74.65 | 0.93 | 72.58 | 7.25 |
| PRkNN | 70.68 | 0.41 | 71.09 | 2.73 |
| KNN | 66.26 | 0.38 | 67.13 | 3.14 |
Key findings include:
- All learned routers substantially outperform Strongest at comparable or lower cost, with RouterDC, VLC, and MLP all achieving ranking scores around 74, compared to Strongest's 68.88.
- A significant performance gap persists between the best actual routers and the Oracle (rank score 93.68), indicating both the headroom for advances in routing logic and the inherent difficulty in achieving cost-optimal accuracy.
- Ablations show that simple fusion strategies and more expressive embeddings yield consistent improvements.
A plausible implication is that further gains require routers that can exploit finer-grained visual cues and capture nuanced textual structure, leading to more accurate and cost-efficient routing decisions.
6. Open-Source Assets and Community Impact
VL-RouterBench provides full transparency and reproducibility for vision–language routing evaluation:
- All data construction scripts, inference logs, routing model training code, and evaluation tools are publicly available (https://github.com/K1nght/VL-RouterBench).
- Step-by-step procedures support new dataset/model integration.
- Automated tools enable Pareto-frontier curve plotting and ranking score computation.
- Standardized splits and hyperparameter settings ensure fair comparison and facilitate reproducible research.
By establishing a unified protocol with comprehensive metrics on 14 real-world VLM tasks and 17 models, VL-RouterBench provides a foundation for reproducible research and direct comparison of routing strategies, supporting advances in both model and router development (Huang et al., 29 Dec 2025).