PILOT: Preference-Prior Informed LinUCB

Updated 2 September 2025

The paper introduces PILOT, a cost-aware contextual bandit algorithm that integrates offline human preference data via a shared embedding space for adaptive LLM routing.
It employs LinUCB-style online updates and a multi-choice knapsack policy to balance exploration, exploitation, and budget constraints dynamically.
Extensive evaluations show PILOT achieves near-GPT-4 performance at only 25% of the cost, demonstrating robust adaptability across varying embedding architectures.

Preference-Prior Informed LinUCB (PILOT) is a contextual bandit algorithm specifically designed for adaptive routing of user queries to LLMs under budget constraints. PILOT integrates offline human preference data as an inductive prior in the spirit of LinUCB, leveraging a shared embedding space for queries and models, and employs an online multi-choice knapsack policy for resource-aware selection. By combining preference-aware model initialization and exploration-exploitation strategies, PILOT achieves cost-efficient, high-performance model routing in large-scale, dynamic LLM serving environments (Panda et al., 28 Aug 2025).

1. Conceptual Framework

PILOT formulates the LLM routing problem as a contextual bandit, departing from prior supervised approaches that require exhaustively observing rewards for all query–model pairs. Each routing decision involves selecting an LLM (“arm”) for a given query (“context”) to maximize cumulative reward—reward being the measured quality of an LLM’s response as determined by human preference or evaluative heuristics.

The principal innovation is the introduction of a “preference prior” over the contextual bandit’s reward model: prior to online deployment, extensive human preference data (e.g., pairwise model comparisons for queries) are used to construct a shared embedding space and initialize the model-specific reward parameters. During online operation, PILOT employs LinUCB-style upper confidence bounds, adjusting model parameters incrementally from observed bandit feedback (reward for the chosen model only).

Moreover, PILOT is coupled with an online cost control policy, treating the model selection as a knapsack problem. Each model’s selection incurs a query-dependent cost (e.g., token usage), and the algorithm dynamically manages the system’s operation within a user budget while maximizing reward.

2. Shared Embedding Space and Preference Prior

A central component in PILOT is the shared dₘ-dimensional embedding space linking query and LLM representations:

Query Embedding: An input query $q$ is encoded via a pretrained embedding model $\phi(q)$ , then projected to the shared space by a trainable linear map $\psi(q) = W \phi(q) + b$ .
LLM Embedding: Each LLM $l_i$ is associated with a parameter vector $\theta_i$ in the same space.
Alignment: The affinity between a query and an LLM is measured by the cosine similarity $\cos(\psi(q), \theta_i)$ .
Pretraining: The shared space and model vectors are jointly pretrained using human preference data. Training objectives include a cosine distance-based triplet loss (to pull preferred responses closer to the corresponding query) and a pairwise cross-entropy loss to match empirical preferences. For a query $q$ , the probability that $l_i$ is preferred over $l_j$ is modeled as

$p_i = \frac{\exp(\cos(\theta_i, \psi(q)))}{\exp(\cos(\theta_i, \psi(q))) + \exp(\cos(\theta_j, \psi(q)))}$

After offline pretraining, the resulting $\theta_i^{\text{pref}}$ act as a prior for online bandit updates, reducing the regret associated with cold-start exploration.

3. Online Bandit Learning and LinUCB Update

In online deployment, PILOT continuously refines its model selection via contextual bandit learning:

Context: For each query $q_t$ , the context vector is $\psi(q_t)$ .
Online Ridge Regression: For each arm (model) $a$ , PILOT maintains

$\begin{align*} A_a^t &= A_a^{t-1} + \psi(q_t)\psi(q_t)^\top\ b_a^t &= b_a^{t-1} + r_t\psi(q_t) \end{align*}$

with initializations $A_a^0 = \lambda_a I$ , $b_a^0 = \lambda_a \theta_a^{\text{pref}}$ , where $\lambda_a$ is a prior strength hyperparameter.

Prediction: The updated parameter estimate is

$\tilde{\theta}_a^t = (A_a^t)^{-1}b_a^t$

Upper Confidence Bound: Model selection is determined by

$a_t = \arg\max_a \bigg[\cos(\psi(q_t), \tilde{\theta}_a^t) + \alpha\sqrt{\psi(q_t)^\top (A_a^t)^{-1}\psi(q_t)}\bigg]$

where $\alpha$ is an exploration parameter.

Only the selected model receives a reward update (bandit feedback), progressively aligning the reward estimates with actual user preferences in the current distribution.

4. Online Cost Policy: Multi-Choice Knapsack Optimization

To enforce budget constraints at query time, PILOT uses an online multi-choice knapsack policy:

Cost-Aware Arm Selection: For each candidate model $l$ at time $t$ , define a threshold

$th_t^l = \frac{\cos(\hat{\psi}(q_t), \hat{\theta}_l^t)}{ \big( (UB \cdot e / LB)^{z_t} \cdot (LB/e) \big) }$

where $UB$ and $LB$ are upper and lower bounds for the reward-to-cost ratio and $z_t$ is the current normalized budget utilization.

Eligibility: Models whose costs fall below $th_t^l$ are eligible for selection.
Budget Management: Queries are partitioned into bins, with each bin allocated a portion of the total budget $B$ . Unused budget spills into subsequent bins, enforcing strict adherence to budget constraints over the global session.
Optimization: At each decision, select the eligible model with maximal expected reward such that the aggregate cost remains within budget.

5. Mathematical Formulation Summary

The table below summarizes the principal mathematical quantities defining PILOT:

Quantity	Formula / Description
Query–Model Affinity	$E[r_t\|a,q_t] = \cos(\hat{\psi}(q_t), \hat{\theta}_a)$
Online LinUCB Update	$A_a^t, b_a^t$ as above; $\tilde{\theta}_a^t = (A_a^t)^{-1} b_a^t$
UCB Arm Selection	$a_t = \arg\max_a [\cos(\psi(q_t), \tilde{\theta}_a^t) + \alpha \dots]$
Preference-Prior Initialization	$A_a^0 = \lambda_a I$ , $b_a^0 = \lambda_a \theta_a^{\text{pref}}$
Cost Policy Eligibility Threshold	$th_t^l$ as given above; set $E_t = \{l: c_l \leq th_t^l\}$

This mathematical structure grounds PILOT in both efficient exploration (via preference-based regularization) and rigorous budget-constrained exploitation.

6. Experimental Evaluation and Performance

Extensive evaluation of PILOT employs Routerbench (covering commonsense, math, code, and scientific queries) and the MMLU benchmark:

Quality–Cost Tradeoff: In multi-model routing, PILOT attains approximately 93% of GPT-4’s deployment performance at only 25% of its cost.
Cumulative Regret and Adaptation: In binary routing settings (e.g., GPT-4 vs. Mistral-7B), PILOT achieves lower cumulative regret and higher realized reward than standard baselines such as LinUCB, Epoch-Greedy, and Random selection.
Robustness to Embedding Encoder: Experiments with different query embedders (OpenAI text-embedding-3-small, Instructor-XL) show that PILOT's performance is stable across architectures.
Qualitative Routing: For easy queries, PILOT prefers low-cost models; for complex or high-stakes queries (e.g., from MMLU, ARC Challenge), PILOT routes to higher-quality but costlier models, respecting the system’s budget at all times.
Cost Policy Efficacy: The binning-based online cost control succeeds in balancing value and expenditure, with unutilized budget efficiently spilling over and preventing both overspending and performance collapse.

PILOT’s architecture embodies the transfer of human preference information—obtained offline—into rapid, high-confidence exploration in interactive settings. This design parallels a general trend in modern contextual bandit research to use preference-induced priors to reduce sample complexity and regret.

There are strong connections to other preference-prior approaches, including APRIL (Akrour et al., 2012), which studies preference-based reinforcement learning with an active querying policy, as well as recommendation algorithms that leverage hierarchical/structured feedback to guide exploration in large action spaces (Zuo et al., 2022). PILOT extends these concepts by incorporating the preference prior directly into the LinUCB parameterization and tightly coupling routing with resource-aware optimization. A plausible implication is that PILOT’s paradigm can generalize to other selection/allocation problems with similar feedback and budget characteristics.

8. Potential Applications and Future Directions

PILOT is immediately applicable to large-scale, dynamic LLM serving platforms where maximizing utility per token cost is critical (e.g., cloud-based AI assistants, SaaS LLM routers, federated LLM service markets). The design is also extensible:

Alternative Resource Constraints: PILOT’s knapsack-based policy can be adapted for latency, energy, or quality-of-service constraints.
Preference Priors Beyond LLMs: The shared embedding and preference prior machinery is transferrable to multi-modal selectors, recommender systems, and combinatorial action spaces.
Interactive Human-in-the-Loop Systems: Scenarios where human preference data is incrementally collected align well with PILOT’s two-stage (offline+online) updating scheme.
Further Research: Analysis of regret bounds for non-stationary or adversarial query distributions, richer context modeling (e.g., session-aware features), and preference-based transfer across deployment domains represents logical next steps.

In summary, Preference-Prior Informed LinUCB (PILOT) offers a unified framework for integrating human preferences, contextual bandit optimization, and budget-aware selection, providing a mathematically sound and empirically validated solution for adaptive model routing in contemporary LLM systems.

PDF Markdown Chat (Pro)

References (3)

Adaptive LLM Routing under Budget Constraints (2025)

APRIL: Active Preference-learning based Reinforcement Learning (2012)

Hierarchical Conversational Preference Elicitation with Bandit Feedback (2022)

Follow Topic

Get notified by email when new papers are published related to Preference-Prior Informed LinUCB (PILOT).

PILOT: Preference-Prior Informed LinUCB

1. Conceptual Framework

2. Shared Embedding Space and Preference Prior

3. Online Bandit Learning and LinUCB Update

4. Online Cost Policy: Multi-Choice Knapsack Optimization

5. Mathematical Formulation Summary

6. Experimental Evaluation and Performance

8. Potential Applications and Future Directions

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

PILOT: Preference-Prior Informed LinUCB

1. Conceptual Framework

2. Shared Embedding Space and Preference Prior

3. Online Bandit Learning and LinUCB Update

4. Online Cost Policy: Multi-Choice Knapsack Optimization

5. Mathematical Formulation Summary

6. Experimental Evaluation and Performance

7. Connections, Significance, and Related Directions

8. Potential Applications and Future Directions

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research