Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 27 tok/s
GPT-5 High 22 tok/s Pro
GPT-4o 89 tok/s
GPT OSS 120B 457 tok/s Pro
Kimi K2 169 tok/s Pro
2000 character limit reached

PILOT: Preference-Prior Informed LinUCB

Updated 2 September 2025
  • The paper introduces PILOT, a cost-aware contextual bandit algorithm that integrates offline human preference data via a shared embedding space for adaptive LLM routing.
  • It employs LinUCB-style online updates and a multi-choice knapsack policy to balance exploration, exploitation, and budget constraints dynamically.
  • Extensive evaluations show PILOT achieves near-GPT-4 performance at only 25% of the cost, demonstrating robust adaptability across varying embedding architectures.

Preference-Prior Informed LinUCB (PILOT) is a contextual bandit algorithm specifically designed for adaptive routing of user queries to LLMs under budget constraints. PILOT integrates offline human preference data as an inductive prior in the spirit of LinUCB, leveraging a shared embedding space for queries and models, and employs an online multi-choice knapsack policy for resource-aware selection. By combining preference-aware model initialization and exploration-exploitation strategies, PILOT achieves cost-efficient, high-performance model routing in large-scale, dynamic LLM serving environments (Panda et al., 28 Aug 2025).

1. Conceptual Framework

PILOT formulates the LLM routing problem as a contextual bandit, departing from prior supervised approaches that require exhaustively observing rewards for all query–model pairs. Each routing decision involves selecting an LLM (“arm”) for a given query (“context”) to maximize cumulative reward—reward being the measured quality of an LLM’s response as determined by human preference or evaluative heuristics.

The principal innovation is the introduction of a “preference prior” over the contextual bandit’s reward model: prior to online deployment, extensive human preference data (e.g., pairwise model comparisons for queries) are used to construct a shared embedding space and initialize the model-specific reward parameters. During online operation, PILOT employs LinUCB-style upper confidence bounds, adjusting model parameters incrementally from observed bandit feedback (reward for the chosen model only).

Moreover, PILOT is coupled with an online cost control policy, treating the model selection as a knapsack problem. Each model’s selection incurs a query-dependent cost (e.g., token usage), and the algorithm dynamically manages the system’s operation within a user budget while maximizing reward.

2. Shared Embedding Space and Preference Prior

A central component in PILOT is the shared dₘ-dimensional embedding space linking query and LLM representations:

  • Query Embedding: An input query qq is encoded via a pretrained embedding model ϕ(q)\phi(q), then projected to the shared space by a trainable linear map ψ(q)=Wϕ(q)+b\psi(q) = W \phi(q) + b.
  • LLM Embedding: Each LLM lil_i is associated with a parameter vector θi\theta_i in the same space.
  • Alignment: The affinity between a query and an LLM is measured by the cosine similarity cos(ψ(q),θi)\cos(\psi(q), \theta_i).
  • Pretraining: The shared space and model vectors are jointly pretrained using human preference data. Training objectives include a cosine distance-based triplet loss (to pull preferred responses closer to the corresponding query) and a pairwise cross-entropy loss to match empirical preferences. For a query qq, the probability that lil_i is preferred over ljl_j is modeled as

pi=exp(cos(θi,ψ(q)))exp(cos(θi,ψ(q)))+exp(cos(θj,ψ(q)))p_i = \frac{\exp(\cos(\theta_i, \psi(q)))}{\exp(\cos(\theta_i, \psi(q))) + \exp(\cos(\theta_j, \psi(q)))}

After offline pretraining, the resulting θipref\theta_i^{\text{pref}} act as a prior for online bandit updates, reducing the regret associated with cold-start exploration.

3. Online Bandit Learning and LinUCB Update

In online deployment, PILOT continuously refines its model selection via contextual bandit learning:

  • Context: For each query qtq_t, the context vector is ψ(qt)\psi(q_t).
  • Online Ridge Regression: For each arm (model) aa, PILOT maintains

Aat=Aat1+ψ(qt)ψ(qt) bat=bat1+rtψ(qt)\begin{align*} A_a^t &= A_a^{t-1} + \psi(q_t)\psi(q_t)^\top\ b_a^t &= b_a^{t-1} + r_t\psi(q_t) \end{align*}

with initializations Aa0=λaIA_a^0 = \lambda_a I, ba0=λaθaprefb_a^0 = \lambda_a \theta_a^{\text{pref}}, where λa\lambda_a is a prior strength hyperparameter.

  • Prediction: The updated parameter estimate is

θ~at=(Aat)1bat\tilde{\theta}_a^t = (A_a^t)^{-1}b_a^t

  • Upper Confidence Bound: Model selection is determined by

at=argmaxa[cos(ψ(qt),θ~at)+αψ(qt)(Aat)1ψ(qt)]a_t = \arg\max_a \bigg[\cos(\psi(q_t), \tilde{\theta}_a^t) + \alpha\sqrt{\psi(q_t)^\top (A_a^t)^{-1}\psi(q_t)}\bigg]

where α\alpha is an exploration parameter.

Only the selected model receives a reward update (bandit feedback), progressively aligning the reward estimates with actual user preferences in the current distribution.

4. Online Cost Policy: Multi-Choice Knapsack Optimization

To enforce budget constraints at query time, PILOT uses an online multi-choice knapsack policy:

  • Cost-Aware Arm Selection: For each candidate model ll at time tt, define a threshold

thtl=cos(ψ^(qt),θ^lt)((UBe/LB)zt(LB/e))th_t^l = \frac{\cos(\hat{\psi}(q_t), \hat{\theta}_l^t)}{ \big( (UB \cdot e / LB)^{z_t} \cdot (LB/e) \big) }

where UBUB and LBLB are upper and lower bounds for the reward-to-cost ratio and ztz_t is the current normalized budget utilization.

  • Eligibility: Models whose costs fall below thtlth_t^l are eligible for selection.
  • Budget Management: Queries are partitioned into bins, with each bin allocated a portion of the total budget BB. Unused budget spills into subsequent bins, enforcing strict adherence to budget constraints over the global session.
  • Optimization: At each decision, select the eligible model with maximal expected reward such that the aggregate cost remains within budget.

5. Mathematical Formulation Summary

The table below summarizes the principal mathematical quantities defining PILOT:

Quantity Formula / Description
Query–Model Affinity E[rta,qt]=cos(ψ^(qt),θ^a)E[r_t|a,q_t] = \cos(\hat{\psi}(q_t), \hat{\theta}_a)
Online LinUCB Update Aat,batA_a^t, b_a^t as above; θ~at=(Aat)1bat\tilde{\theta}_a^t = (A_a^t)^{-1} b_a^t
UCB Arm Selection at=argmaxa[cos(ψ(qt),θ~at)+α]a_t = \arg\max_a [\cos(\psi(q_t), \tilde{\theta}_a^t) + \alpha \dots]
Preference-Prior Initialization Aa0=λaIA_a^0 = \lambda_a I, ba0=λaθaprefb_a^0 = \lambda_a \theta_a^{\text{pref}}
Cost Policy Eligibility Threshold thtlth_t^l as given above; set Et={l:clthtl}E_t = \{l: c_l \leq th_t^l\}

This mathematical structure grounds PILOT in both efficient exploration (via preference-based regularization) and rigorous budget-constrained exploitation.

6. Experimental Evaluation and Performance

Extensive evaluation of PILOT employs Routerbench (covering commonsense, math, code, and scientific queries) and the MMLU benchmark:

  • Quality–Cost Tradeoff: In multi-model routing, PILOT attains approximately 93% of GPT-4’s deployment performance at only 25% of its cost.
  • Cumulative Regret and Adaptation: In binary routing settings (e.g., GPT-4 vs. Mistral-7B), PILOT achieves lower cumulative regret and higher realized reward than standard baselines such as LinUCB, Epoch-Greedy, and Random selection.
  • Robustness to Embedding Encoder: Experiments with different query embedders (OpenAI text-embedding-3-small, Instructor-XL) show that PILOT's performance is stable across architectures.
  • Qualitative Routing: For easy queries, PILOT prefers low-cost models; for complex or high-stakes queries (e.g., from MMLU, ARC Challenge), PILOT routes to higher-quality but costlier models, respecting the system’s budget at all times.
  • Cost Policy Efficacy: The binning-based online cost control succeeds in balancing value and expenditure, with unutilized budget efficiently spilling over and preventing both overspending and performance collapse.

PILOT’s architecture embodies the transfer of human preference information—obtained offline—into rapid, high-confidence exploration in interactive settings. This design parallels a general trend in modern contextual bandit research to use preference-induced priors to reduce sample complexity and regret.

There are strong connections to other preference-prior approaches, including APRIL (Akrour et al., 2012), which studies preference-based reinforcement learning with an active querying policy, as well as recommendation algorithms that leverage hierarchical/structured feedback to guide exploration in large action spaces (Zuo et al., 2022). PILOT extends these concepts by incorporating the preference prior directly into the LinUCB parameterization and tightly coupling routing with resource-aware optimization. A plausible implication is that PILOT’s paradigm can generalize to other selection/allocation problems with similar feedback and budget characteristics.

8. Potential Applications and Future Directions

PILOT is immediately applicable to large-scale, dynamic LLM serving platforms where maximizing utility per token cost is critical (e.g., cloud-based AI assistants, SaaS LLM routers, federated LLM service markets). The design is also extensible:

  • Alternative Resource Constraints: PILOT’s knapsack-based policy can be adapted for latency, energy, or quality-of-service constraints.
  • Preference Priors Beyond LLMs: The shared embedding and preference prior machinery is transferrable to multi-modal selectors, recommender systems, and combinatorial action spaces.
  • Interactive Human-in-the-Loop Systems: Scenarios where human preference data is incrementally collected align well with PILOT’s two-stage (offline+online) updating scheme.
  • Further Research: Analysis of regret bounds for non-stationary or adversarial query distributions, richer context modeling (e.g., session-aware features), and preference-based transfer across deployment domains represents logical next steps.

In summary, Preference-Prior Informed LinUCB (PILOT) offers a unified framework for integrating human preferences, contextual bandit optimization, and budget-aware selection, providing a mathematically sound and empirically validated solution for adaptive model routing in contemporary LLM systems.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube