Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
51 tokens/sec
2000 character limit reached

GC-TTT: Goal-Conditioned Test-Time Training

Updated 30 July 2025
  • Goal-Conditioned Test-Time Training (GC-TTT) is a paradigm that enables models to adapt at inference by fine-tuning on goal-relevant, self-supervised data.
  • It integrates dynamic adaptation and goal-conditioning by selecting offline experiences based on state proximity and estimated returns for localized policy updates.
  • GC-TTT has demonstrated significant performance improvements in offline RL tasks, achieving higher success rates with modest compute overhead.

Goal-Conditioned Test-Time Training (GC-TTT) denotes a paradigm in which a predictive or control model adapts its parameters at inference time so as to specialize its behavior toward achieving a specified goal, using only goal-relevant (often self-supervised) information available at test time. GC-TTT is conceptually rooted in test-time training (TTT), originally introduced for distributional robustness in supervised learning, but extends the notion to settings where the test-time “goal” can vary per instance—typically in reinforcement learning, goal-conditioned control, or high-level planning. GC-TTT techniques address the challenge of dynamically tailoring a generalist model to the immediate requirements of the evaluation episode, leveraging on-the-fly fine-tuning or adaptation based on experience or context related to the current goal.

1. Key Algorithmic Principles

GC-TTT/Goal-conditioned test-time training is typified by the integration of two core mechanisms: dynamic adaptation at inference, and goal-conditioning of both adaptation and policy evaluation. In contrast to “train-once, test-frozen” or traditional TTT schemes (which adapt toward generic robustness objectives), GC-TTT adapts model parameters specifically toward the desirata set by the provided goal.

The principal workflow, as instantiated in the offline RL setting (Bagatella et al., 24 Jul 2025), involves:

  • Maintaining a goal-conditioned policy (or value function) pre-trained on a large offline dataset with diverse goals.
  • At each test episode (and in receding-horizon fashion, periodically during an episode), selecting from available offline data a set of transitions relevant to both the current agent state ss and the evaluation goal gg^*.
  • Fine-tuning the parameters θ\theta of the pre-trained universal policy πθ(as,g)\pi_\theta(a \mid s, g^*) by performing a number of SGD steps on a loss over only these carefully filtered, goal-relevant trajectories.
  • Resetting or updating the policy periodically, thereby balancing between “local specialization” for short-horizon goal achievement and the globally robust prior.

This adaptive process is distinct in that it uses a self-supervised data selection criterion, requiring no additional annotation beyond what was used in original offline training.

2. Self-Supervised Data Selection for Goal-Relevant Adaptation

GC-TTT’s efficacy pivots on selecting data from the offline experience that is maximally relevant for the given state–goal pair encountered during evaluation (Bagatella et al., 24 Jul 2025). The selection uses two criteria:

  • Relevance: From the offline dataset DD, sub-trajectories whose starting state s1s_1 is “close” (under a problem-appropriate distance metric dd) to the current state ss are retained: Drel(s)={(s1,...,sH)Dd(s,s1)<ϵ}D_{\mathrm{rel}}(s) = \{(s_1, ..., s_H) \in D \mid d(s, s_1) < \epsilon \}.
  • Optimality: For each relevant sub-trajectory, an estimated HH-step return with respect to the evaluation goal is computed:

V^((s1,...,sH)g)=i=1H1γi1R(si,g)+γH1V(sHg).\hat{V}((s_1, ..., s_H) \mid g^*) = \sum_{i=1}^{H-1} \gamma^{i-1} R(s_i, g^*) + \gamma^{H-1} V(s_H \mid g^*).

Only sub-trajectories whose V^\hat{V} exceeds the qq-th percentile among candidates are selected.

The agent then fine-tunes on this pruned batch, minimizing a task-appropriate loss (e.g., behavior cloning, Q-learning), but now goal-conditioned and state-local. This ensures the local policy adapts using only relevant, goal-oriented experience, countering the global “underfitting” observed in large, universal policies.

3. Inference-Time Adaptation and Receding-Horizon Specialization

The adaptation is applied in a receding-horizon scheme: after NN gradient updates on the filtered goal-related batch, the updated policy is rolled forward for KK steps in the environment before being reset (or re-specialized) as the trajectory evolves (Bagatella et al., 24 Jul 2025). This prevents catastrophic drift and preserves the broader capabilities of the foundation model, while allowing for substantial short-horizon specialization—enabling the agent to recover from deviations, stitching together experience for rarely-seen or long-horizon goals, and performing effective local correction.

Such flexibility is crucial for tasks with sparse reward structure, complex geometry, or significant domain shift between offline data and online evaluation. By enabling “just-in-time” specialization to the immediate state and goal context, GC-TTT surpasses static offline RL policies, which often underperform on outlier or long-tail goals despite universal value function pretraining.

4. Empirical Performance and Compute Scaling

GC-TTT outperforms standard offline RL baselines, including behavior cloning (BC), various implicit Q-learning variants (IQL), and hierarchical approaches, with performance gains often dramatic. For example, in the pointmaze environment, GC-BC success rates increase from \sim5% to over 80%. Similar trends appear for antmaze, humanoidmaze, and challenging manipulation domains (Bagatella et al., 24 Jul 2025).

The extra compute cost incurred by online fine-tuning is carefully analyzed. Each trial of GC-TTT uses a few forward+backward passes per episode segment, with the overall additional FLOPs budgeted versus scaling the global model. The findings indicate that investing in even modest GC-TTT adaptation (relative to globally increasing model width) yields uniquely large improvements in success rates—improvements that static scaling cannot match at the same computational cost.

Method Success Rate (Antmaze-U) Extra Compute (FLOPs)
BC 10% 1x
GC-BC 70% 1.5x
IQL 50% 1x
GC-IQL 85% 1.5x

(Table constructed from (Bagatella et al., 24 Jul 2025); for illustrative comparison.)

This suggests that practical GC-TTT can be deployed with minimal hardware overhead, and that per-goal/in-episode plasticity is more impactful than brute-force global scaling.

5. Connections to Broader Theories and Methodologies

GC-TTT in offline RL formalizes and extends the “specialization at inference” paradigm, previously observed in other goal-conditioned or test-time training methodologies. Connections include:

A plausible implication is that GC-TTT, by decoupling adaptation from global model retraining and conditioning both adaptation and data selection on the current goal, can deliver both robust generalization and highly tailored local behavior.

6. Mathematical Foundations and Algorithmic Structure

The core mathematical mechanism of GC-TTT in offline RL is as follows:

  • Given a universal value function or policy πθ(as,g)\pi_\theta(a \mid s, g),
  • At time tt with current (st,g)(s_t, g^*):

    1. Select a relevant optimal subset D(st,g)D(s_t, g^*) from offline buffer using the two-stage criterion above.
    2. Fine-tune θTTT\theta_{\text{TTT}} for NN steps to minimize loss LL over D(st,g)D(s_t, g^*):

JTTT(θ)=EsD(st,g)[L(s,g;θ)]J_\text{TTT}(\theta) = -\mathbb{E}_{s' \sim D(s_t, g^*)} [L(s', g^*; \theta)]

  1. Deploy πθTTT\pi_{\theta_{\text{TTT}}} for the next KK environment steps.
  2. Reset or re-adapt for the new state and goal as trajectory unfolds.

This procedure is strictly self-supervised post-training—requiring no new demonstrations or labels.

7. Application Domains, Limitations, and Research Directions

GC-TTT has been validated across high-dimensional navigation (AntMaze, HumanoidMaze), manipulation (CubeSingle), and can be extended to other domains where episodes or environments are defined by temporally varying or externally specified goals (Bagatella et al., 24 Jul 2025). The specialized data selection procedure is particularly effective when offline data is wide in coverage but sparse in per-goal local optimality, enabling dynamic stitching and reweighting of experience to meet difficult target outcomes.

Open research directions include:

  • Extending GC-TTT to incorporate online trajectories, leveraging fresh experience to further improve adaptation.

  • Understanding the causes of underfitting in universal GCRL policies and developing better pretraining schemes.
  • Adapting GC-TTT for “lazy” update schedules suitable for high-frequency control.
  • Transferring similar techniques to non-RL domains, e.g., LLMs, planning, or vision tasks where goal-conditioned plasticity at test time may yield both efficiency and accuracy gains.

GC-TTT formalizes a rigorous, goal-aware inference-time adaptation paradigm: model parameters are locally specialized per state–goal context using self-supervised selection and fine-tuning, yielding substantial and cost-efficient performance gains over both globally robust and purely “frozen” approaches in goal-conditioned RL and related settings (Bagatella et al., 24 Jul 2025).