Papers
Topics
Authors
Recent
2000 character limit reached

CARA: Cross-Attention Routing Adapter

Updated 12 December 2025
  • CARA is a cost-aware LLM selection framework that uses a single-head cross-attention mechanism to model fine-grained query–model interactions.
  • It jointly predicts response quality and generation cost through dual prediction branches with an exponential reward mechanism, improving efficiency and performance.
  • Empirical evaluations on RouterBench show CARA outperforms baseline routers by up to 6.6% in quality improvements while retaining a lightweight design.

The Cross-Attention Routing Adapter (CARA) is a unified framework for the cost-aware selection of LLMs that employs a single-head cross-attention architecture to model fine-grained query–model interactions. CARA is designed to dynamically select the most appropriate LLM for each user prompt by jointly predicting response quality and generation cost, balancing these factors through an exponential reward mechanism. The resulting router is lightweight, generalizes across domains, and demonstrates efficiency and performance improvements over established baselines on the RouterBench benchmark (Pulishetty et al., 11 Sep 2025).

1. Architectural Overview

CARA is positioned as an intermediary in front of a candidate pool of KK LLMs. For each incoming prompt, the system undertakes the following process:

  • Prompt Embedding: Each prompt is embedded with a fixed sentence encoder (DistilBERT), generating qR768q \in \mathbb{R}^{768}.
  • Model Embeddings: Each LLM jj is represented by a pre-computed embedding mjRCm_j \in \mathbb{R}^C, where C=20C = 20 corresponds to prompt clusters, with mj[k]m_j[k] reflecting the model's average performance on cluster kk.
  • Cross-Attention Block: CARA uses a single-head cross-attention mechanism parameterized by three learnable projection matrices: WQRd×768W^Q \in \mathbb{R}^{d \times 768} (query), WKRd×CW^K \in \mathbb{R}^{d \times C} (key), and WVRd×CW^V \in \mathbb{R}^{d \times C} (value), where dd is the internal dimension (e.g., d=20d=20).
  • Batch Processing: For a batch of BB prompts, queries Q=WQqRB×dQ = W^Q q \in \mathbb{R}^{B \times d}, keys K=MWKRK×dK = M W^K \in \mathbb{R}^{K \times d}, and values V=MWVRK×dV = M W^V \in \mathbb{R}^{K \times d} are computed, with MM stacking the KK model embeddings.
  • Attention Computation: Each prompt produces a KK-length weight vector α\alpha via cross-attention, which is used to aggregate information from the model embeddings.

This unified architectural design enables a "query-to-models" interaction that captures both prompt-specific and model-specific characteristics relevant for routing decisions.

2. Mathematical Formulation

Given a prompt embedding qR768q \in \mathbb{R}^{768} and stacked model embeddings MRK×CM \in \mathbb{R}^{K \times C}:

  • Projection:
    • Q=WQqRdQ = W^Q q \in \mathbb{R}^d
    • K=MWKRK×dK = M W^K \in \mathbb{R}^{K \times d}
    • V=MWVRK×dV = M W^V \in \mathbb{R}^{K \times d}
  • Attention Output:
    • α=softmax(QK/d)R1×K\alpha = \mathrm{softmax}(Q K^\top / \sqrt{d}) \in \mathbb{R}^{1 \times K}
    • h=αVR1×dh = \alpha V \in \mathbb{R}^{1 \times d}

For model jj:

αj=exp(QKj/d)i=1Kexp(QKi/d)\alpha_j = \frac{\exp(Q \cdot K_j / \sqrt{d})}{\sum_{i=1}^K \exp(Q \cdot K_i / \sqrt{d})}

CARA maintains two identical attention branches with separate projection matrices for quality and cost prediction:

  • Quality Branch: Generates a summary vector hsh^s mapped by head WsW^s to predicted quality scores s^j\hat{s}_j for each model.
  • Cost Branch: Generates a summary vector hch^c mapped by WcW^c to predicted cost scores c^j\hat{c}_j.

In implementation, hh is passed through a small linear layer for each branch to output the KK-dimensional prediction vectors.

3. Joint Prediction and Exponential Reward

CARA produces model-wise predictions for:

  • Response Quality (s^j\hat{s}_j): An estimate of the true task metric (e.g., accuracy, exact match) for model jj on the prompt.
  • Generation Cost (c^j\hat{c}_j): The predicted per-token API cost for using model jj.

At selection time, a user specifies willingness-to-pay parameter λ>0\lambda > 0. For each model jj, CARA computes the exponential trade-off reward:

R2(j)=s^jexp(c^jλ)R_2(j) = \hat{s}_j \cdot \exp\left(-\frac{\hat{c}_j}{\lambda}\right)

Model selection is performed by

j=argmaxjR2(j)j^* = \arg\max_j R_2(j)

Unlike the linear alternative R1=s^j(1λ)c^jR_1 = \hat{s}_j - (\frac{1}{\lambda}) \hat{c}_j, the exponential reward R2R_2 is bounded in [0,1][0,1] and exhibits reduced sensitivity to small variations in λ\lambda, as empirically validated.

4. Training Objectives and Optimization

Separate mean squared error (MSE) losses are used for each prediction branch:

  • Quality Loss:

Lperf=Eq,j[(s^j(q)sj(q))2]L_{\text{perf}} = \mathbb{E}_{q, j} \left[ \left(\hat{s}_j(q) - s_j(q)\right)^2 \right]

  • Cost Loss:

Lcost=Eq,j[(c^j(q)cj(q))2]L_{\text{cost}} = \mathbb{E}_{q, j} \left[ \left(\hat{c}_j(q) - c_j(q)\right)^2 \right]

Both branches are trained with the Adam optimizer, weight decay, and a CosineAnnealingLR schedule. Experimentally:

  • Quality: learning rate 1×1031\times10^{-3}, weight decay 1×1051\times10^{-5}, 1000 epochs, batch size 1024.
  • Cost: learning rate 1×1041\times10^{-4}, weight decay 1×1071\times10^{-7}, hidden dimension d=20d=20.

Each model embedding mjm_j is only recomputed when models change, not retrained.

5. Empirical Evaluation on RouterBench

RouterBench spans 11 LLMs across 8 benchmarks: MMLU, GSM8K, HellaSwag, ARC, Winogrande, MBPP, MT-Bench, and RAG; each sample is annotated with ground-truth response quality per LLM and corresponding API cost.

CARA's performance is compared with KNN, MLP, SVM-based routers, and the LLM-Blender ensemble. An oracle using ground-truth (s,c)(s, c) establishes the upper bound.

  • Primary Metric: Average Improvement in Quality (AIQ), defined as area under the cost–quality Pareto frontier (normalized).
  • Maximum Quality across λ\lambda is also reported.

Notable empirical findings:

Router AIQ (Pool 1) Max Perf (Pool 1)
CARA 0.7274 0.7808
KNN 0.7061
MLP 0.6760
SVM 0.7022
  • CARA achieves up to +6.6%+6.6\% AIQ and +2.9%+2.9\% in maximum quality (over best predictive baseline).
  • On Pools 2–4, CARA improves AIQ by +23.9%+23.9\% over KNN and +27.3%+27.3\% over SVM.
  • The system consistently sits at the upper (cost-effective) region of cost–performance plots across domains.

6. Efficiency, Lightweight Design, and Generalization

CARA introduces minimal overhead:

  • Architecture adds only a single-head cross-attention block and two linear prediction heads.
  • End-to-end code occupies less than 1 GB GPU memory; training completes in \lesssim30 minutes (NVIDIA A100), with inference requiring \lesssim10 minutes per 10,000 prompts.
  • Model embeddings are fixed; adding or removing LLMs requires only recomputation of mjm_j for new models, not router retraining.
  • In cross-domain and dataset-wise generalization (MMLU subdomains, ARC challenge), CARA matches or exceeds the cost–quality trade-off relative to all baselines.

A plausible implication is that CARA's modular design and reliance on stable embeddings contribute to robust transfer performance and scalability, supporting "train-and-forget" deployment across evolving LLM landscapes.

7. Summary and Context

The Cross-Attention Routing Adapter (CARA) addresses cost-aware LLM selection by leveraging a unified single-head cross-attention mechanism, explicit prediction of both model quality and generation cost, and an exponential routing reward for stability with respect to cost–quality trade-off preferences. On the RouterBench benchmark, CARA surpasses existing routers in both efficiency and empirical performance. The architectural simplicity, compatibility with fixed and extensible LLM pools, and robust generalization establish CARA as a new standard for scalable LLM routing in heterogeneous, cost-constrained environments (Pulishetty et al., 11 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Cross-Attention Routing Adapter (CARA).