CARA: Cross-Attention Routing Adapter

Updated 12 December 2025

CARA is a cost-aware LLM selection framework that uses a single-head cross-attention mechanism to model fine-grained query–model interactions.
It jointly predicts response quality and generation cost through dual prediction branches with an exponential reward mechanism, improving efficiency and performance.
Empirical evaluations on RouterBench show CARA outperforms baseline routers by up to 6.6% in quality improvements while retaining a lightweight design.

The Cross-Attention Routing Adapter (CARA) is a unified framework for the cost-aware selection of LLMs that employs a single-head cross-attention architecture to model fine-grained query–model interactions. CARA is designed to dynamically select the most appropriate LLM for each user prompt by jointly predicting response quality and generation cost, balancing these factors through an exponential reward mechanism. The resulting router is lightweight, generalizes across domains, and demonstrates efficiency and performance improvements over established baselines on the RouterBench benchmark (Pulishetty et al., 11 Sep 2025).

1. Architectural Overview

CARA is positioned as an intermediary in front of a candidate pool of $K$ LLMs. For each incoming prompt, the system undertakes the following process:

Prompt Embedding: Each prompt is embedded with a fixed sentence encoder (DistilBERT), generating $q \in \mathbb{R}^{768}$ .
Model Embeddings: Each LLM $j$ is represented by a pre-computed embedding $m_j \in \mathbb{R}^C$ , where $C = 20$ corresponds to prompt clusters, with $m_j[k]$ reflecting the model's average performance on cluster $k$ .
Cross-Attention Block: CARA uses a single-head cross-attention mechanism parameterized by three learnable projection matrices: $W^Q \in \mathbb{R}^{d \times 768}$ (query), $W^K \in \mathbb{R}^{d \times C}$ (key), and $W^V \in \mathbb{R}^{d \times C}$ (value), where $d$ is the internal dimension (e.g., $d=20$ ).
Batch Processing: For a batch of $B$ prompts, queries $Q = W^Q q \in \mathbb{R}^{B \times d}$ , keys $K = M W^K \in \mathbb{R}^{K \times d}$ , and values $V = M W^V \in \mathbb{R}^{K \times d}$ are computed, with $M$ stacking the $K$ model embeddings.
Attention Computation: Each prompt produces a $K$ -length weight vector $\alpha$ via cross-attention, which is used to aggregate information from the model embeddings.

This unified architectural design enables a "query-to-models" interaction that captures both prompt-specific and model-specific characteristics relevant for routing decisions.

2. Mathematical Formulation

Given a prompt embedding $q \in \mathbb{R}^{768}$ and stacked model embeddings $M \in \mathbb{R}^{K \times C}$ :

Projection:
- $Q = W^Q q \in \mathbb{R}^d$
- $K = M W^K \in \mathbb{R}^{K \times d}$
- $V = M W^V \in \mathbb{R}^{K \times d}$
Attention Output:
- $\alpha = \mathrm{softmax}(Q K^\top / \sqrt{d}) \in \mathbb{R}^{1 \times K}$
- $h = \alpha V \in \mathbb{R}^{1 \times d}$

For model $j$ :

$\alpha_j = \frac{\exp(Q \cdot K_j / \sqrt{d})}{\sum_{i=1}^K \exp(Q \cdot K_i / \sqrt{d})}$

CARA maintains two identical attention branches with separate projection matrices for quality and cost prediction:

Quality Branch: Generates a summary vector $h^s$ mapped by head $W^s$ to predicted quality scores $\hat{s}_j$ for each model.
Cost Branch: Generates a summary vector $h^c$ mapped by $W^c$ to predicted cost scores $\hat{c}_j$ .

In implementation, $h$ is passed through a small linear layer for each branch to output the $K$ -dimensional prediction vectors.

3. Joint Prediction and Exponential Reward

CARA produces model-wise predictions for:

Response Quality ( $\hat{s}_j$ ): An estimate of the true task metric (e.g., accuracy, exact match) for model $j$ on the prompt.
Generation Cost ( $\hat{c}_j$ ): The predicted per-token API cost for using model $j$ .

At selection time, a user specifies willingness-to-pay parameter $\lambda > 0$ . For each model $j$ , CARA computes the exponential trade-off reward:

$R_2(j) = \hat{s}_j \cdot \exp\left(-\frac{\hat{c}_j}{\lambda}\right)$

Model selection is performed by

$j^* = \arg\max_j R_2(j)$

Unlike the linear alternative $R_1 = \hat{s}_j - (\frac{1}{\lambda}) \hat{c}_j$ , the exponential reward $R_2$ is bounded in $[0,1]$ and exhibits reduced sensitivity to small variations in $\lambda$ , as empirically validated.

4. Training Objectives and Optimization

Separate mean squared error (MSE) losses are used for each prediction branch:

Quality Loss:

$L_{\text{perf}} = \mathbb{E}_{q, j} \left[ \left(\hat{s}_j(q) - s_j(q)\right)^2 \right]$

Cost Loss:

$L_{\text{cost}} = \mathbb{E}_{q, j} \left[ \left(\hat{c}_j(q) - c_j(q)\right)^2 \right]$

Both branches are trained with the Adam optimizer, weight decay, and a CosineAnnealingLR schedule. Experimentally:

Quality: learning rate $1\times10^{-3}$ , weight decay $1\times10^{-5}$ , 1000 epochs, batch size 1024.
Cost: learning rate $1\times10^{-4}$ , weight decay $1\times10^{-7}$ , hidden dimension $d=20$ .

Each model embedding $m_j$ is only recomputed when models change, not retrained.

5. Empirical Evaluation on RouterBench

RouterBench spans 11 LLMs across 8 benchmarks: MMLU, GSM8K, HellaSwag, ARC, Winogrande, MBPP, MT-Bench, and RAG; each sample is annotated with ground-truth response quality per LLM and corresponding API cost.

CARA's performance is compared with KNN, MLP, SVM-based routers, and the LLM-Blender ensemble. An oracle using ground-truth $(s, c)$ establishes the upper bound.

Primary Metric: Average Improvement in Quality (AIQ), defined as area under the cost–quality Pareto frontier (normalized).
Maximum Quality across $\lambda$ is also reported.

Notable empirical findings:

Router	AIQ (Pool 1)	Max Perf (Pool 1)
CARA	0.7274	0.7808
KNN	0.7061	—
MLP	0.6760	—
SVM	0.7022	—

CARA achieves up to $+6.6\%$ AIQ and $+2.9\%$ in maximum quality (over best predictive baseline).
On Pools 2–4, CARA improves AIQ by $+23.9\%$ over KNN and $+27.3\%$ over SVM.
The system consistently sits at the upper (cost-effective) region of cost–performance plots across domains.

6. Efficiency, Lightweight Design, and Generalization

CARA introduces minimal overhead:

Architecture adds only a single-head cross-attention block and two linear prediction heads.
End-to-end code occupies less than 1 GB GPU memory; training completes in $\lesssim$ 30 minutes (NVIDIA A100), with inference requiring $\lesssim$ 10 minutes per 10,000 prompts.
Model embeddings are fixed; adding or removing LLMs requires only recomputation of $m_j$ for new models, not router retraining.
In cross-domain and dataset-wise generalization (MMLU subdomains, ARC challenge), CARA matches or exceeds the cost–quality trade-off relative to all baselines.

A plausible implication is that CARA's modular design and reliance on stable embeddings contribute to robust transfer performance and scalability, supporting "train-and-forget" deployment across evolving LLM landscapes.

7. Summary and Context

The Cross-Attention Routing Adapter (CARA) addresses cost-aware LLM selection by leveraging a unified single-head cross-attention mechanism, explicit prediction of both model quality and generation cost, and an exponential routing reward for stability with respect to cost–quality trade-off preferences. On the RouterBench benchmark, CARA surpasses existing routers in both efficiency and empirical performance. The architectural simplicity, compatibility with fixed and extensible LLM pools, and robust generalization establish CARA as a new standard for scalable LLM routing in heterogeneous, cost-constrained environments (Pulishetty et al., 11 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

One Head, Many Models: Cross-Attention Routing for Cost-Aware LLM Selection (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Cross-Attention Routing Adapter (CARA).