Token-wise Constrained Optimization

Updated 10 February 2026

Token-wise constrained optimization is a framework where each token’s decision variable is optimized under hard and soft constraints to meet global or local requirements.
It enables fine-grained control in LLM deployments, improving RLHF, queueing-aware inference, and resource efficiency with explicit token-level constraints.
The approach guarantees optimality through convex/concave formulations, KKT conditions, and fixed-point iterations, complemented by integer rounding for practical use.

Token-wise constrained optimization objectives refer to a class of optimization problems and algorithmic frameworks in modern machine learning—especially in LLMs and multi-modal learning—where decision variables, loss functions, or control signals are defined or manipulated at the level of individual tokens, subject to global or local constraints. These objectives appear pervasively in RLHF, policy optimization, queueing-aware LLM deployment, and resource-efficient transformer architectures. Unlike sequence-level or batch-level objectives, the token-wise approach enables finer granularity of control, superior alignment with heterogeneous preferences or resource limits, and the possibility to impose explicit constraints (e.g., budgets, stability, or memory) at the token granularity.

1. Formal Definitions and General Principles

In token-wise constrained optimization, the global objective is a function of variable sets or signals indexed by token position $t$ or task types $i$ . The constraints can be hard (e.g., resource budgets, queue stability, marginal constraints) or soft (regularization, entropy penalties). The canonical forms, as exemplified in LLM system design and reward-based fine-tuning, include:

Decision variables: $t_i \ge 0$ (number of "reasoning" tokens per task $i$ ), token-level routing choices, per-token action probabilities, or transport weights.
Objectives: Weighted sums or products over per-token utility, reward, or accuracy, penalized by global system latency, memory, or divergence.
Constraints: Linear (e.g., total token budget $\sum_i \pi_i t_i \leq T_{\text{budget}}$ ), non-linear (e.g., queueing stability $\rho = \lambda E[S] < 1$ ), or combinatorial (e.g., discrete token assignment with capacity).
Optimization regime: Convex or strictly concave objectives are preferred, as uniqueness and existence are guaranteed (Ozbas et al., 15 Jan 2026), although some objectives (e.g., in allocation or RL) may be non-convex.

This general structure permits modeling diverse real-world requirements directly at token granularity.

2. Token-wise Constrained Optimization in LLM Deployment

Queueing-aware token allocation for LLM inference servers is rigorously posed as a strictly concave constrained optimization problem in (Ozbas et al., 15 Jan 2026). The setting models $N$ heterogeneous query types with known priors $\pi_i$ , each receiving a tunable number of "thinking" tokens $t_i$ . The service time is affine in $t_i$ : $i$ 0, and task accuracy is a concave function $i$ 1. The system operates as an $i$ 2 queue; the mean latency $i$ 3 depends on both the first and second moments of service time. The objective function,

$i$ 4

is subject to:

Token budget constraint: $i$ 5,
Queue stability: $i$ 6,
Positivity: $i$ 7.

Strict concavity guarantees a unique optimum, characterized by the KKT system, yielding a coupled fixed-point solution projectable onto the feasible region; efficient convergence is established via contraction bounds for fixed-point iteration, and global convergence for projected-gradient methods. Integer rounding preserves feasibility and enables practical deployment with quantified performance loss (Ozbas et al., 15 Jan 2026).

3. Token-wise Reward Shaping and Policy Optimization in RLHF

Token-wise constraints and objectives are central to recent advances in RLHF and policy optimization:

Token-importance guided DPO (TI-DPO): This framework (Yang et al., 26 May 2025) replaces the uniform per-token contribution in Direct Preference Optimization (DPO) with learned token-importance weights $i$ 8 (computed via gradient-attribution mechanisms) that prioritize "critical" tokens. The loss is:

$i$ 9

where $t_i \ge 0$ 0 is a sum of weighted log-probability ratios over tokens in winner/loser sequences, plus an explicit triplet loss enforcing token-level separation. This directs fine-tuning toward locally important tokens, yields improved alignment, and increases robustness to label noise and judgment uncertainty.

Token Hidden Reward (THR) in GRPO: In (Deng et al., 4 Oct 2025), the per-token THR quantifies each token's influence on a correct group-level reward in mathematical reasoning. The objective introduces token-level weights $t_i \ge 0$ 1 that reweight the per-token PPO advantage, modulating exploration and exploitation. The value of $t_i \ge 0$ 2 allows tuning the optimization dynamic toward greedy accuracy (exploitation) or sample diversity (exploration).
Token-Level Policy Optimization (TEPO): TEPO (Lin et al., 10 Oct 2025) aggregates a group-level scalar advantage $t_i \ge 0$ 3 to each token in a sequence using a Markov likelihood geometric mean step. The PPO objective is averaged over all tokens, and importance sampling is performed at the sequence level and broadcast to tokens, yielding a stable, low-variance, and order-invariant update suitable for sparse reward settings.
Token-level guidance in DPO/PPO decomposition (TGDPO and RTO): TGDPO (Zhu et al., 17 Jun 2025) and RTO (Zhong et al., 2024) both decompose sequence-level RLHF objectives into independent token-level subproblems using learned per-token reward surrogates derived from DPO-style log-ratio signals, yielding token-wise KL-penalized PPO objectives, followed by aggregation in a Bradley-Terry or MDP framework.

4. Token-wise Constrained Optimal Transport and Attention Allocation

Token-wise constraints are not limited to policy gradients or budgeted inference. They are also employed for fine-grained alignment in multi-modal architectures and efficient attention:

Tokenwise Optimal Transport in TokenCLIP: (Zhou et al., 24 Oct 2025) frames dynamic visual-to-textual subspace alignment as an entropic optimal transport (OT) problem where the transport plan $t_i \ge 0$ 4 is constrained by prescribed (uniform) marginals (each token sends/receives equal mass). This prevents trivial solutions and forces semantic specialization. The solution, computed via Sinkhorn-Knopp, is further sparsified by top- $t_i \ge 0$ 5 sparsification, preserving only the most relevant subspaces per token, and then normalized. The resulting assignment matrix enforces coverage and specialization constraints at the token level, and is used to compute per-token classification logits and the dynamic alignment loss.
Memory/Compute-constrained token-wise routing in Transformers (mixSGA): (Song et al., 16 Jun 2025) introduces a capacity-constrained token-to-expert routing framework, in which a tokenwise importance score assigns tokens to attention experts under strict memory (KV) constraints. The assignment is constructed via top- $t_i \ge 0$ 6 selection of highest-scoring tokens per expert, subject to fixed per-expert capacities. During inference, deterministic one-hot routing enforces per-token capacity, and an auxiliary loss aligns training and inference routing decisions.

5. Algorithmic and Theoretical Properties

The well-posedness, optimality, and convergence of token-wise constrained optimization objectives typically rely on concavity, coordinate-separable constraints, and contractivity:

Strict concavity and uniqueness: If the utility (e.g., accuracy) is concave and the cost/penalty (e.g., latency or KL) is convex, the overall penalized objective (accuracy minus latency or deviation penalty) is strictly concave over the stability region (Ozbas et al., 15 Jan 2026), guaranteeing existence and uniqueness of the optimal solution.
KKT/fixed-point characterizations: The Lagrangian formalism applies, with Lagrange multipliers for each hard constraint (budget, stability, nonnegativity), yielding necessary and sufficient conditions for optimality. In practice, the solution can be phrased as a projected fixed-point iteration or via projected gradient ascent, with explicit global step-size bounds.
Integer rounding and performance guarantees: After finding continuous optimal token allocations, integer rounding preserves feasibility. Taylor-theoretic bounds yield performance loss scaling as $t_i \ge 0$ 7, vanishing as per-token service time coefficients decrease (Ozbas et al., 15 Jan 2026).
Exploration–exploitation trade-offs and stability: In RLHF, THR-guided weighting (Deng et al., 4 Oct 2025) and TEPO's Markov-likelihood approach (Lin et al., 10 Oct 2025) provide knobs to control exploration and exploitation without entropy bonuses. Empirical results highlight robustness to entropy collapse and improved sample efficiency over sequence-level baselines.

6. Applications, Impact, and Comparative Results

Token-wise constrained optimization frameworks have demonstrated empirical superiority in a range of alignment, reasoning, and efficiency benchmarks:

LLM server throughput/accuracy: Queueing-aware optimal allocation achieves high-accuracy responses under strict latency constraints, supporting SLAs in heterogeneous inference settings (Ozbas et al., 15 Jan 2026).
RLHF on reasoning tasks: TI-DPO, THR, TEPO, and TGDPO all achieve statistically significant improvements in mathematical reasoning, instruction following, and diverse preference benchmarks, with average accuracy gains of 1–7.5 points over standard DPO or PPO baselines (Yang et al., 26 May 2025, Deng et al., 4 Oct 2025, Lin et al., 10 Oct 2025, Zhu et al., 17 Jun 2025).
Efficiency-aware language modeling: mixSGA's dynamic expert routing improves perplexity and ROUGE-L over static or greedy-kv optimization by enforcing global memory budgets without discarding tokens (Song et al., 16 Jun 2025).
Zero-shot anomaly detection: TokenCLIP’s tokenwise OT assignment yields superior anomaly segmentation by enforcing token-marginal constraints and semantic sparsity, outperforming token-agnostic alignments (Zhou et al., 24 Oct 2025).

7. Synthesis and Outlook

Token-wise constrained optimization objectives represent a significant advance in modeling and controlling learning, inference, and resource allocation at the most granular level in neural architectures. Their rigorous formulation as convex/concave programs or optimal transport problems ensures theoretical guarantees and practical convergence. The approach enables precise trade-off management between competing metrics, precise alignment with human preferences, targeted deployment of computation and memory, and robust exploration–exploitation dynamics. As architectures and deployment models grow in heterogeneity and scale, token-wise constrained optimization is likely to remain central to both theoretical analysis and practical system design in LLMs and beyond (Ozbas et al., 15 Jan 2026, Yang et al., 26 May 2025, Deng et al., 4 Oct 2025, Lin et al., 10 Oct 2025, Song et al., 16 Jun 2025, Zhu et al., 17 Jun 2025, Zhou et al., 24 Oct 2025, Zhong et al., 2024).