Preference-Prior LinUCB Routing (PILOT)

Updated 1 September 2025

The paper introduces PILOT, which extends LinUCB by integrating offline human preference data to improve adaptive LLM routing.
PILOT employs a shared embedding space and online ridge regression to continually refine model affinity while dynamically balancing exploration and exploitation.
The algorithm incorporates a knapsack-based cost policy to adhere to budget constraints and reduce online regret in large-scale LLM deployments.

Preference-prior Informed LinUCB for Adaptive Routing (PILOT) is an algorithmic framework developed to address the challenge of intelligent, cost-effective routing among LLMs in scenarios where both cost variability and evolving user preferences are significant factors. PILOT extends the classical linear upper confidence bound (LinUCB) algorithm to the large-scale, budget-constrained, and partially-supervised context of LLM routing by leveraging offline human preference data as a prior, contextual bandit techniques for online learning, and a knapsack-based policy for dynamic resource allocation (Panda et al., 28 Aug 2025). The following sections provide an in-depth account of its conceptual foundations, embedding architecture, online updating mechanisms, cost-awareness, and its implications for scalable and adaptive LLM deployment.

1. Contextual Bandit Formulation for LLM Routing

PILOT frames LLM routing as a contextual multi-armed bandit (CMAB) problem. Each incoming query effectively serves as a "context" and the available LLMs constitute the "arms." Given a query, the routing decision corresponds to selecting the LLM likely to provide the most acceptable output for this specific context. The stochastic reward model reflects the observed response quality, encoded as human feedback or automated heuristics. The principal aim is to maximize accumulated reward (i.e., delivered quality) while enforcing runtime constraints, such as fixed or user-defined LLM usage budgets.

Unlike traditional supervised routing, PILOT does not require exhaustive pairwise evaluation of all queries and LLMs; instead, it leverages partial bandit feedback—receiving a reward only from the chosen LLM—making it suitable and efficient for real-world, large-scale settings where comprehensive supervision is intractable.

2. Preference-Prior Informed Shared Embedding Space

Central to PILOT is the initialization of a shared low-dimensional embedding space, allowing both queries and LLMs to be compared via affinity measured by cosine similarity. Construction occurs in two stages:

Offline Pretraining with Human Preferences:
- Starting from existing query embeddings, a learned affine projection $\psi(q) = W\phi(q) + b$ maps queries to the shared space.
- Human preference judgments (i.e., soft pairwise wins between LLMs for specific queries) inform a triplet loss: queries where model $a$ is preferred are pulled towards $a$ 's embedding and repelled from others.
- LLM embeddings $\theta_a^{\text{pref}}$ are subsequently optimized by cross-entropy against preference data.
Online Bandit Refinement:
- When a new query $q_t$ arrives, it is mapped to $\hat\psi(q_t)$ and normalized.
- Feedback from the selected LLM updates $\theta_a$ embeddings using online ridge regression, thus continually adapting the affinity structure in the shared space.

This embedding-based affinity, initialized using robust human feedback, confers strong inductive bias: routing starts with a meaningful preference prior and only needs to adjust based on observed outcomes.

3. Online LinUCB Updating with Preference Priors

The online decision process in PILOT integrates preference priors into the LinUCB update structure as follows:

For each LLM $a$ , parameter matrices $A_a$ and $b_a$ are updated at step $t$ by:

$A_a^{(t)} = A_a^{(t-1)} + \hat\psi(q_t)\hat\psi(q_t)^T, \quad b_a^{(t)} = b_a^{(t-1)} + r_t \hat\psi(q_t)$

where $r_t$ is the observed reward.

The embedding estimate $\tilde\theta_a^{(t)}$ at each round is:

$\tilde\theta_a^{(t)} = (A_a^{(t)})^{-1}b_a^{(t)}$

Regularization and preference integration are achieved by initializing:

$A_a^{(0)} = \lambda_a I, \quad b_a^{(0)} = \lambda_a \theta_a^{\text{pref}}$

where $\lambda_a$ controls the strength of the prior.

At each round, the routing policy selects the LLM maximizing the following upper confidence bound:

$a_t = \underset{a}{\arg\max} \left\{ \cos(\hat\psi(q_t), \tilde\theta_a^{(t)}) + \alpha \sqrt{\hat\psi(q_t)^T (A_a^{(t)})^{-1} \hat\psi(q_t)} \right\}$

where $\alpha$ is the exploration coefficient.

This structure rewards both exploitation (estimated reward) and exploration (model uncertainty), with the early reliance on prior knowledge and later refinement via online feedback, resulting in reduced regret when compared to bandit approaches lacking preference-informed priors.

4. Budget-Aware Online Cost Policy

Practical deployment of LLMs incurs query-specific per-token costs that can vary between models. PILOT formalizes adherence to overall budget constraints via an online multi-choice knapsack problem (ON-MCKP):

Each query presents a set of candidate LLMs, each with an associated expected reward (affinity score) and cost (token-based).
For each candidate, a threshold $th_t^l$ is computed:

$th_t^l = \frac{\cos(\hat\psi(q_t), \hat\theta_l^{(t)})}{\left( \frac{UB \cdot e}{LB} \right)^{z_t} \cdot (LB/e)}$

where $UB$ and $LB$ respectively denote the known upper and lower bounds on the reward-to-cost ratio, and $z_t$ is a normalized expenditure measure.

Only LLMs whose costs fall below this dynamic threshold are eligible.
Budget is partitioned over the query trajectory into bins; any unused budget from one bin spills to the next, providing both flexibility and strict overall adherence.

This dual-objective approach—maximizing quality subject to cost—enables the PILOT router to satisfy operational constraints central to production LLM deployment.

5. Practical Benefits and Deployment Implications

PILOT offers advantages making it suitable for large-scale, real-world LLM query routing:

Sample Efficiency: By initializing with preference data, online exploration is minimized; system regret is substantially lower than methods lacking an informative prior.
Resource Optimization: The cost-aware binning strategy prevents budget overrun and supports heterogeneous cost profiles, a typical feature of commercial LLM deployments.
Scalability: The embedding-based structure accommodates high-dimensional query/model spaces without incurring combinatorial explosion.
Policy Modularity: Affinity estimation and cost policy are decoupled, enabling independent improvements or adaptations for application-specific objectives.

A plausible implication is that PILOT methodology can be adapted beyond LLM routing, for instance in multi-agent task allocation (Panayotov et al., 10 Mar 2025), whenever human preference data and operational budgets are available.

6. Relation to Other Adaptive Routing Frameworks

The PILOT approach conceptually aligns with a broader class of adaptive routing algorithms that incorporate multi-factor utility or cost functions and RL-driven weighting mechanisms (Panayotov et al., 10 Mar 2025). Whereas earlier frameworks rely on explicit task-characteristic features and learned weights to guide discrete routing (e.g., in AI multi-agent networks), PILOT operates in a continuous shared embedding space, explicitly combines offline human knowledge with online reward feedback, and targets the bandit/knapsack regime typical of costly inference settings.

The integration of preference-informed priors, upper-confidence-driven arm selection, and a knapsack-based cost policy constitutes the central innovation distinguishing PILOT from legacy shortest-path, static-priority, or non-budgeted bandit methods.

7. Summary Table: Core Components of PILOT

Component	Description	Mathematical Detail / Key Mechanism
Shared Embedding	Aligns queries and LLMs based on human preferences	$\psi(q)$ , $\theta_a^{\text{pref}}$ , optimized via triplet and CE loss
LinUCB Extension	Online adaptation using prior-initialized LLM embeddings	$A_a^{(t)}, b_a^{(t)}, \tilde\theta_a^{(t)}$ updates
LLM Selection	UCB on reward + model uncertainty	$a_t = \arg\max_a \{ \text{reward} + \text{exploration} \}$
Cost Policy	Dynamic thresholding based on knapsack bins and cost ratios	$th_t^l$ as above; budget allocation across bins

This structure captures the sequential decision-making, preference weighting, and budget awareness central to PILOT's demonstrable improvements for adaptive LLM routing (Panda et al., 28 Aug 2025).

PDF Markdown Chat (Pro)

References (2)

Adaptive LLM Routing under Budget Constraints (2025)

Adaptive routing protocols for determining optimal paths in AI multi-agent systems: a priority- and learning-enhanced approach (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Preference-prior Informed Linucb for Adaptive Routing (PILOT).