CARA: Cross-Attention Routing Adapter
- CARA is a cost-aware LLM selection framework that uses a single-head cross-attention mechanism to model fine-grained query–model interactions.
- It jointly predicts response quality and generation cost through dual prediction branches with an exponential reward mechanism, improving efficiency and performance.
- Empirical evaluations on RouterBench show CARA outperforms baseline routers by up to 6.6% in quality improvements while retaining a lightweight design.
The Cross-Attention Routing Adapter (CARA) is a unified framework for the cost-aware selection of LLMs that employs a single-head cross-attention architecture to model fine-grained query–model interactions. CARA is designed to dynamically select the most appropriate LLM for each user prompt by jointly predicting response quality and generation cost, balancing these factors through an exponential reward mechanism. The resulting router is lightweight, generalizes across domains, and demonstrates efficiency and performance improvements over established baselines on the RouterBench benchmark (Pulishetty et al., 11 Sep 2025).
1. Architectural Overview
CARA is positioned as an intermediary in front of a candidate pool of LLMs. For each incoming prompt, the system undertakes the following process:
- Prompt Embedding: Each prompt is embedded with a fixed sentence encoder (DistilBERT), generating .
- Model Embeddings: Each LLM is represented by a pre-computed embedding , where corresponds to prompt clusters, with reflecting the model's average performance on cluster .
- Cross-Attention Block: CARA uses a single-head cross-attention mechanism parameterized by three learnable projection matrices: (query), (key), and (value), where is the internal dimension (e.g., ).
- Batch Processing: For a batch of prompts, queries , keys , and values are computed, with stacking the model embeddings.
- Attention Computation: Each prompt produces a -length weight vector via cross-attention, which is used to aggregate information from the model embeddings.
This unified architectural design enables a "query-to-models" interaction that captures both prompt-specific and model-specific characteristics relevant for routing decisions.
2. Mathematical Formulation
Given a prompt embedding and stacked model embeddings :
- Projection:
- Attention Output:
For model :
CARA maintains two identical attention branches with separate projection matrices for quality and cost prediction:
- Quality Branch: Generates a summary vector mapped by head to predicted quality scores for each model.
- Cost Branch: Generates a summary vector mapped by to predicted cost scores .
In implementation, is passed through a small linear layer for each branch to output the -dimensional prediction vectors.
3. Joint Prediction and Exponential Reward
CARA produces model-wise predictions for:
- Response Quality (): An estimate of the true task metric (e.g., accuracy, exact match) for model on the prompt.
- Generation Cost (): The predicted per-token API cost for using model .
At selection time, a user specifies willingness-to-pay parameter . For each model , CARA computes the exponential trade-off reward:
Model selection is performed by
Unlike the linear alternative , the exponential reward is bounded in and exhibits reduced sensitivity to small variations in , as empirically validated.
4. Training Objectives and Optimization
Separate mean squared error (MSE) losses are used for each prediction branch:
- Quality Loss:
- Cost Loss:
Both branches are trained with the Adam optimizer, weight decay, and a CosineAnnealingLR schedule. Experimentally:
- Quality: learning rate , weight decay , 1000 epochs, batch size 1024.
- Cost: learning rate , weight decay , hidden dimension .
Each model embedding is only recomputed when models change, not retrained.
5. Empirical Evaluation on RouterBench
RouterBench spans 11 LLMs across 8 benchmarks: MMLU, GSM8K, HellaSwag, ARC, Winogrande, MBPP, MT-Bench, and RAG; each sample is annotated with ground-truth response quality per LLM and corresponding API cost.
CARA's performance is compared with KNN, MLP, SVM-based routers, and the LLM-Blender ensemble. An oracle using ground-truth establishes the upper bound.
- Primary Metric: Average Improvement in Quality (AIQ), defined as area under the cost–quality Pareto frontier (normalized).
- Maximum Quality across is also reported.
Notable empirical findings:
| Router | AIQ (Pool 1) | Max Perf (Pool 1) |
|---|---|---|
| CARA | 0.7274 | 0.7808 |
| KNN | 0.7061 | — |
| MLP | 0.6760 | — |
| SVM | 0.7022 | — |
- CARA achieves up to AIQ and in maximum quality (over best predictive baseline).
- On Pools 2–4, CARA improves AIQ by over KNN and over SVM.
- The system consistently sits at the upper (cost-effective) region of cost–performance plots across domains.
6. Efficiency, Lightweight Design, and Generalization
CARA introduces minimal overhead:
- Architecture adds only a single-head cross-attention block and two linear prediction heads.
- End-to-end code occupies less than 1 GB GPU memory; training completes in 30 minutes (NVIDIA A100), with inference requiring 10 minutes per 10,000 prompts.
- Model embeddings are fixed; adding or removing LLMs requires only recomputation of for new models, not router retraining.
- In cross-domain and dataset-wise generalization (MMLU subdomains, ARC challenge), CARA matches or exceeds the cost–quality trade-off relative to all baselines.
A plausible implication is that CARA's modular design and reliance on stable embeddings contribute to robust transfer performance and scalability, supporting "train-and-forget" deployment across evolving LLM landscapes.
7. Summary and Context
The Cross-Attention Routing Adapter (CARA) addresses cost-aware LLM selection by leveraging a unified single-head cross-attention mechanism, explicit prediction of both model quality and generation cost, and an exponential routing reward for stability with respect to cost–quality trade-off preferences. On the RouterBench benchmark, CARA surpasses existing routers in both efficiency and empirical performance. The architectural simplicity, compatibility with fixed and extensible LLM pools, and robust generalization establish CARA as a new standard for scalable LLM routing in heterogeneous, cost-constrained environments (Pulishetty et al., 11 Sep 2025).