Trajectory Alignment Coefficient (TAC) Overview

Updated 7 April 2026

Trajectory Alignment Coefficient (TAC) is a statistical metric that quantifies how closely a candidate reward function aligns with stakeholder pairwise preferences over trajectories using Kendall’s Tau-b.
It is invariant to positive linear transformations and potential-based shaping, ensuring robust comparisons across different reward designs in reinforcement learning systems.
The differentiable variant, Soft-TAC, employs a tanh-based relaxation to allow gradient-based optimization of reward models, demonstrating improved performance in various control and simulation tasks.

The Trajectory Alignment Coefficient (TAC) is a statistical metric for quantifying how closely the preferences induced by a candidate reward function in reinforcement learning (RL) align with a stakeholder’s pairwise preferences over trajectories or trajectory distributions. Formally grounded in the theory of partial rankings, TAC enables practitioners to assess reward alignment without access to a ground-truth reward function. Its computation leverages Kendall’s Tau-b rank correlation, offering invariance to potential-based shaping, insensitivity to positive linear transformations, and direct applicability in both reward design and reward learning workflows. Recent work has also introduced a differentiable variant, Soft-TAC, for direct optimization of alignment via gradient-based methods.

1. Formal Definition and Mathematical Structure

Given a Markov Decision Process (MDP) $(S, A, r, p, \mu, \gamma)$ , a trajectory $\tau = (s_0, a_0, s_1, a_1, \dots)$ , and cumulative discounted return $G_r(\tau) = \sum_{t=0}^T \gamma^t r(s_t, a_t, s_{t+1})$ , human preferences are elicited as a set of pairwise comparisons $\mathcal{D}_h = \{(\tau^i, \tau^j, y)\}$ , $y \in \{+1, 0, -1\}$ . For a candidate $(r, \gamma)$ , induced preferences for each pair are computed via $G_r(\tau)$ .

Define:

$P$ : number of concordant pairs (human and reward agree),
$Q$ : number of discordant pairs,
$X_0$ : pairs tied only in the reward ranking,
$\tau = (s_0, a_0, s_1, a_1, \dots)$ 0: pairs tied only in the human ranking.

The Trajectory Alignment Coefficient is then:

$\tau = (s_0, a_0, s_1, a_1, \dots)$ 1

with $\tau = (s_0, a_0, s_1, a_1, \dots)$ 2. $\tau = (s_0, a_0, s_1, a_1, \dots)$ 3 indicates perfect agreement, $\tau = (s_0, a_0, s_1, a_1, \dots)$ 4 perfect disagreement, and $\tau = (s_0, a_0, s_1, a_1, \dots)$ 5 chance-level correlation. Notably, TAC operates solely on pairwise preferences and requires neither ground-truth reward nor access to absolute reward magnitudes (Muslimani et al., 8 Mar 2025, Muslimani et al., 23 Jan 2026).

2. Theoretical Properties and Invariances

TAC possesses several desirable theoretical properties:

Ground-truth Agnosticism: Only requires stakeholder preferences and the candidate reward function, not any oracle or predefined ground-truth reward.
Invariance to Positive Linear Transformations: If $\tau = (s_0, a_0, s_1, a_1, \dots)$ 6 with $\tau = (s_0, a_0, s_1, a_1, \dots)$ 7, then the ordering of expected returns and thus TAC are preserved.
Invariance to Potential-Based Shaping: For reward shaping of the form $\tau = (s_0, a_0, s_1, a_1, \dots)$ 8 and pairwise comparisons of distributions over identical initial state distributions $\tau = (s_0, a_0, s_1, a_1, \dots)$ 9, the induced orderings—and hence TAC—remain invariant. This holds by design if all trajectories or distributions share the same initial state distribution, which matches standard practice.
Applicability to Trajectory Distributions: Can be formulated over probability distributions $G_r(\tau) = \sum_{t=0}^T \gamma^t r(s_t, a_t, s_{t+1})$ 0 on trajectories, defining orderings via expected returns $G_r(\tau) = \sum_{t=0}^T \gamma^t r(s_t, a_t, s_{t+1})$ 1 (Muslimani et al., 8 Mar 2025, Muslimani et al., 23 Jan 2026).

3. Computation and Algorithms

Practical Workflow

Trajectory Sampling: Sample a finite collection of qualitatively diverse trajectories or trajectory distributions (e.g., from various policies, demonstrations, or stochastic rollouts).
Human Preference Collection: Present trajectory (or distribution) pairs to human stakeholders for pairwise ordering ( $G_r(\tau) = \sum_{t=0}^T \gamma^t r(s_t, a_t, s_{t+1})$ 2).
Reward-Induced Ordering: For each pair, compute $G_r(\tau) = \sum_{t=0}^T \gamma^t r(s_t, a_t, s_{t+1})$ 3 and $G_r(\tau) = \sum_{t=0}^T \gamma^t r(s_t, a_t, s_{t+1})$ 4 (or their expectations over $G_r(\tau) = \sum_{t=0}^T \gamma^t r(s_t, a_t, s_{t+1})$ 5, $G_r(\tau) = \sum_{t=0}^T \gamma^t r(s_t, a_t, s_{t+1})$ 6). Assign induced direction by sign, yielding the reward-based partial ranking.
Compute TAC: Count concordant, discordant, and singly-tied pairs. Plug into the $G_r(\tau) = \sum_{t=0}^T \gamma^t r(s_t, a_t, s_{t+1})$ 7 formula.
Iterative Use: Practitioners use TAC as an alignedness signal for reward selection or tuning.

High-Level TAC Algorithm

$y \in \{+1, 0, -1\}$ 4 (Muslimani et al., 23 Jan 2026)

4. Differentiable Approximation: Soft-TAC

To enable direct optimization, Soft-TAC relaxes the sign function via a scaled $G_r(\tau) = \sum_{t=0}^T \gamma^t r(s_t, a_t, s_{t+1})$ 8:

$G_r(\tau) = \sum_{t=0}^T \gamma^t r(s_t, a_t, s_{t+1})$ 9

$\mathcal{D}_h = \{(\tau^i, \tau^j, y)\}$ 0

As $\mathcal{D}_h = \{(\tau^i, \tau^j, y)\}$ 1, Soft-TAC converges to the original TAC for strictly ordered, tie-free data. Training a parametric reward model $\mathcal{D}_h = \{(\tau^i, \tau^j, y)\}$ 2 uses the Soft-TAC loss:

$\mathcal{D}_h = \{(\tau^i, \tau^j, y)\}$ 3

Soft-TAC is robust to label noise and, under noise-free, realizable settings, global minima correspond to reward models matching all preferences. Gradients are efficiently computable by backpropagating through $\mathcal{D}_h = \{(\tau^i, \tau^j, y)\}$ 4 and return summations (Muslimani et al., 23 Jan 2026).

5. Empirical Evaluation and Use Cases

Reward Design and Selection

User Study in Hungry–Thirsty and Lunar Lander Domains: In the gridworld, 11 RL practitioners selecting reward functions with TAC feedback exhibited a 1.5x workload reduction, were more likely (success rate increase of 41%) to select reward functions yielding higher-performing policies, and showed a strong preference (82–100%) for having alignment feedback. In Lunar Lander, alignment feedback via TAC improved agent landing success (mean $\mathcal{D}_h = \{(\tau^i, \tau^j, y)\}$ 5 vs $\mathcal{D}_h = \{(\tau^i, \tau^j, y)\}$ 6, $\mathcal{D}_h = \{(\tau^i, \tau^j, y)\}$ 7) and reduced workload ( $\mathcal{D}_h = \{(\tau^i, \tau^j, y)\}$ 8 vs $\mathcal{D}_h = \{(\tau^i, \tau^j, y)\}$ 9, $y \in \{+1, 0, -1\}$ 0) (Muslimani et al., 8 Mar 2025, Muslimani et al., 23 Jan 2026).

Reward Learning

Soft-TAC in Gran Turismo 7: Using Soft-TAC for reward learning from human pairwise preferences yielded agents in GT7 that achieved higher time-trial performance (BIAI $y \in \{+1, 0, -1\}$ 1 vs Cross-Entropy $y \in \{+1, 0, -1\}$ 2) and more accurate preference-specific driving styles (aggressive, timid) compared to agents trained with standard cross-entropy preference loss (Muslimani et al., 23 Jan 2026).

Example Ordering

In an autonomous driving toy example, TAC correctly distinguishes reward functions whose induced trajectory orderings are discordant with human preferences. For a case with 1 discordant out of 6 comparisons, $y \in \{+1, 0, -1\}$ 3, reflecting substantial but imperfect alignment (Muslimani et al., 8 Mar 2025).

6. Limitations, Practical Guidelines, and Future Directions

Limitations: TAC assumes the chosen trajectory/feature set is sufficiently expressive for an aligned reward to exist. TAC computations require simulating all trajectory pairs, which may become expensive for large datasets. Current published results focus on linear reward models; adapting to nonlinear (e.g., neural) cases is ongoing work.
Guidelines: Select trajectory pairs that span the diversity of meaningful behaviors; tune the Soft-TAC sensitivity parameter to balance gradient smoothness and sharpness; report TAC as practitioners navigate reward design.
Future Work: Extending Soft-TAC to parametric, black-box models (deep networks), integrating active preference selection guided by TAC gradients, leveraging TAC to diagnose missing features via low-agreement pairs, and applying TAC-driven alignment to settings such as LLMs or other sequential decision-making architectures (Muslimani et al., 8 Mar 2025, Muslimani et al., 23 Jan 2026).

7. Broader Significance and Research Context

TAC addresses the longstanding challenge in RL of reward misspecification by providing a rigorous, preference-theoretic quantitative alignment statistic. Its invariance properties enable robust reward shaping and comparison across candidate reward definitions. Empirical studies across control and simulated environments, including user trials with RL practitioners and large-scale behavioral domains, demonstrate its utility in both manual and automated reward alignment settings.

The differentiable Soft-TAC loss further bridges reward alignment and reward learning, enabling direct optimization of preference-consistent rewards from human data. The approach shows promise for high-dimensional, complex domains where expressive and interpretable agent behaviors are essential (Muslimani et al., 8 Mar 2025, Muslimani et al., 23 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Towards Improving Reward Design in RL: A Reward Alignment Metric for RL Practitioners (2025)

The Trajectory Alignment Coefficient in Two Acts: From Reward Tuning to Reward Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Trajectory Alignment Coefficient (TAC).