GC-TTT: Goal-Conditioned Test-Time Training

Updated 30 July 2025

Goal-Conditioned Test-Time Training (GC-TTT) is a paradigm that enables models to adapt at inference by fine-tuning on goal-relevant, self-supervised data.
It integrates dynamic adaptation and goal-conditioning by selecting offline experiences based on state proximity and estimated returns for localized policy updates.
GC-TTT has demonstrated significant performance improvements in offline RL tasks, achieving higher success rates with modest compute overhead.

Goal-Conditioned Test-Time Training (GC-TTT) denotes a paradigm in which a predictive or control model adapts its parameters at inference time so as to specialize its behavior toward achieving a specified goal, using only goal-relevant (often self-supervised) information available at test time. GC-TTT is conceptually rooted in test-time training (TTT), originally introduced for distributional robustness in supervised learning, but extends the notion to settings where the test-time “goal” can vary per instance—typically in reinforcement learning, goal-conditioned control, or high-level planning. GC-TTT techniques address the challenge of dynamically tailoring a generalist model to the immediate requirements of the evaluation episode, leveraging on-the-fly fine-tuning or adaptation based on experience or context related to the current goal.

1. Key Algorithmic Principles

GC-TTT/Goal-conditioned test-time training is typified by the integration of two core mechanisms: dynamic adaptation at inference, and goal-conditioning of both adaptation and policy evaluation. In contrast to “train-once, test-frozen” or traditional TTT schemes (which adapt toward generic robustness objectives), GC-TTT adapts model parameters specifically toward the desirata set by the provided goal.

The principal workflow, as instantiated in the offline RL setting (Bagatella et al., 24 Jul 2025), involves:

Maintaining a goal-conditioned policy (or value function) pre-trained on a large offline dataset with diverse goals.
At each test episode (and in receding-horizon fashion, periodically during an episode), selecting from available offline data a set of transitions relevant to both the current agent state $s$ and the evaluation goal $g^*$ .
Fine-tuning the parameters $\theta$ of the pre-trained universal policy $\pi_\theta(a \mid s, g^*)$ by performing a number of SGD steps on a loss over only these carefully filtered, goal-relevant trajectories.
Resetting or updating the policy periodically, thereby balancing between “local specialization” for short-horizon goal achievement and the globally robust prior.

This adaptive process is distinct in that it uses a self-supervised data selection criterion, requiring no additional annotation beyond what was used in original offline training.

2. Self-Supervised Data Selection for Goal-Relevant Adaptation

GC-TTT’s efficacy pivots on selecting data from the offline experience that is maximally relevant for the given state–goal pair encountered during evaluation (Bagatella et al., 24 Jul 2025). The selection uses two criteria:

Relevance: From the offline dataset $D$ , sub-trajectories whose starting state $s_1$ is “close” (under a problem-appropriate distance metric $d$ ) to the current state $s$ are retained: $D_{\mathrm{rel}}(s) = \{(s_1, ..., s_H) \in D \mid d(s, s_1) < \epsilon \}$ .
Optimality: For each relevant sub-trajectory, an estimated $H$ -step return with respect to the evaluation goal is computed:

$\hat{V}((s_1, ..., s_H) \mid g^*) = \sum_{i=1}^{H-1} \gamma^{i-1} R(s_i, g^*) + \gamma^{H-1} V(s_H \mid g^*).$

Only sub-trajectories whose $\hat{V}$ exceeds the $q$ -th percentile among candidates are selected.

The agent then fine-tunes on this pruned batch, minimizing a task-appropriate loss (e.g., behavior cloning, Q-learning), but now goal-conditioned and state-local. This ensures the local policy adapts using only relevant, goal-oriented experience, countering the global “underfitting” observed in large, universal policies.

3. Inference-Time Adaptation and Receding-Horizon Specialization

The adaptation is applied in a receding-horizon scheme: after $N$ gradient updates on the filtered goal-related batch, the updated policy is rolled forward for $K$ steps in the environment before being reset (or re-specialized) as the trajectory evolves (Bagatella et al., 24 Jul 2025). This prevents catastrophic drift and preserves the broader capabilities of the foundation model, while allowing for substantial short-horizon specialization—enabling the agent to recover from deviations, stitching together experience for rarely-seen or long-horizon goals, and performing effective local correction.

Such flexibility is crucial for tasks with sparse reward structure, complex geometry, or significant domain shift between offline data and online evaluation. By enabling “just-in-time” specialization to the immediate state and goal context, GC-TTT surpasses static offline RL policies, which often underperform on outlier or long-tail goals despite universal value function pretraining.

4. Empirical Performance and Compute Scaling

GC-TTT outperforms standard offline RL baselines, including behavior cloning (BC), various implicit Q-learning variants (IQL), and hierarchical approaches, with performance gains often dramatic. For example, in the pointmaze environment, GC-BC success rates increase from $\sim$ 5% to over 80%. Similar trends appear for antmaze, humanoidmaze, and challenging manipulation domains (Bagatella et al., 24 Jul 2025).

The extra compute cost incurred by online fine-tuning is carefully analyzed. Each trial of GC-TTT uses a few forward+backward passes per episode segment, with the overall additional FLOPs budgeted versus scaling the global model. The findings indicate that investing in even modest GC-TTT adaptation (relative to globally increasing model width) yields uniquely large improvements in success rates—improvements that static scaling cannot match at the same computational cost.

Method	Success Rate (Antmaze-U)	Extra Compute (FLOPs)
BC	10%	1x
GC-BC	70%	1.5x
IQL	50%	1x
GC-IQL	85%	1.5x

(Table constructed from (Bagatella et al., 24 Jul 2025); for illustrative comparison.)

This suggests that practical GC-TTT can be deployed with minimal hardware overhead, and that per-goal/in-episode plasticity is more impactful than brute-force global scaling.

5. Connections to Broader Theories and Methodologies

GC-TTT in offline RL formalizes and extends the “specialization at inference” paradigm, previously observed in other goal-conditioned or test-time training methodologies. Connections include:

Foundation model adaptation in LLMs via test-time training on in-context demonstrations (Gozeten et al., 14 Mar 2025).
Knowledge distillation and goal-conditioned Q-learning as in (Levine et al., 2022), which demonstrate sample-efficient transfer of goal-dependent behaviors through gradient alignment.
Test-time adaptation in continuous data streams (as in video or robotics) where adaptation is local-in-time and possibly local-in-goal, with online memory or recurrent fast-weights architectures (Wang et al., 2023, Zhang et al., 29 May 2025).

A plausible implication is that GC-TTT, by decoupling adaptation from global model retraining and conditioning both adaptation and data selection on the current goal, can deliver both robust generalization and highly tailored local behavior.

6. Mathematical Foundations and Algorithmic Structure

The core mathematical mechanism of GC-TTT in offline RL is as follows:

Given a universal value function or policy $\pi_\theta(a \mid s, g)$ ,
At time $t$ $t$ with current $(s_t, g^*)$ $(s_{t}, g^{*})$ :
1. Select a relevant optimal subset $D(s_t, g^*)$ from offline buffer using the two-stage criterion above.
2. Fine-tune $\theta_{\text{TTT}}$ for $N$ steps to minimize loss $L$ over $D(s_t, g^*)$ :

$J_\text{TTT}(\theta) = -\mathbb{E}_{s' \sim D(s_t, g^*)} [L(s', g^*; \theta)]$

Deploy $\pi_{\theta_{\text{TTT}}}$ for the next $K$ environment steps.
Reset or re-adapt for the new state and goal as trajectory unfolds.

This procedure is strictly self-supervised post-training—requiring no new demonstrations or labels.

7. Application Domains, Limitations, and Research Directions

GC-TTT has been validated across high-dimensional navigation (AntMaze, HumanoidMaze), manipulation (CubeSingle), and can be extended to other domains where episodes or environments are defined by temporally varying or externally specified goals (Bagatella et al., 24 Jul 2025). The specialized data selection procedure is particularly effective when offline data is wide in coverage but sparse in per-goal local optimality, enabling dynamic stitching and reweighting of experience to meet difficult target outcomes.

Open research directions include:

Extending GC-TTT to incorporate online trajectories, leveraging fresh experience to further improve adaptation.
Understanding the causes of underfitting in universal GCRL policies and developing better pretraining schemes.
Adapting GC-TTT for “lazy” update schedules suitable for high-frequency control.
Transferring similar techniques to non-RL domains, e.g., LLMs, planning, or vision tasks where goal-conditioned plasticity at test time may yield both efficiency and accuracy gains.

GC-TTT formalizes a rigorous, goal-aware inference-time adaptation paradigm: model parameters are locally specialized per state–goal context using self-supervised selection and fine-tuning, yielding substantial and cost-efficient performance gains over both globally robust and purely “frozen” approaches in goal-conditioned RL and related settings (Bagatella et al., 24 Jul 2025).