Papers
Topics
Authors
Recent
Search
2000 character limit reached

Relative Trajectory Balance (RTB)

Updated 4 July 2026
  • Relative Trajectory Balance (RTB) is an off-policy objective that matches a learned trajectory distribution to a reward- or likelihood-tilted reference distribution.
  • It uses a squared log-ratio loss which enforces a balance condition, ensuring that the marginal posterior aligns with the terminal reward-weighted prior.
  • RTB supports stable training through replay buffers and loss clipping, and it is theoretically equivalent to KL-regularized methods like Trust-PCL under specific reward designs.

Relative Trajectory Balance (RTB) is a trajectory-level, off-policy squared-residual objective for sequential generative models in which a learned trajectory distribution is trained relative to a reference or prior trajectory distribution, with terminal weighting provided by a reward, likelihood, or energy term. In the diffusion-posterior formulation, RTB enforces the pathwise relation Zϕpϕ(τ)r(x1)pθ(τ)Z_\phi\,p_\phi(\tau)\approx r(x_1)\,p_\theta(\tau), so that marginalizing trajectories yields pϕ(x1)pθ(x1)r(x1)p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1); in the sequential fine-tuning formulation, it targets a terminal marginal of the form Pϕ(sT)πprior(sT)exp(E(sT)/α)P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha) (Scimeca et al., 12 Mar 2025, Deleu et al., 1 Sep 2025). A later theoretical analysis places RTB inside KL-regularized reinforcement learning by proving an exact equivalence to Trust-PCL up to a positive constant factor, reframing RTB as a path-consistency objective rather than a fundamentally separate paradigm (Deleu et al., 1 Sep 2025).

1. Core formulation

In the diffusion-based presentation, RTB is defined on a denoising trajectory

τ=(x0xΔtx1),Δt=1T,\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},

with Markov transitions

p(xt+Δtxt)=N ⁣(xt+Δtxt+ut(xt)Δt,σt2ΔtI),p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),

and trajectory density

p(x0,xΔt,,x1)=p(x0)i=1Tp(xiΔtx(i1)Δt).p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).

Given a pretrained prior diffusion model pθp_\theta, a learned posterior model pϕp_\phi, a positive terminal reward r(x1)r(x_1), and a learnable scalar normalizer ZϕZ_\phi, the RTB loss is

pϕ(x1)pθ(x1)r(x1)p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)0

The zero-loss condition implies

pϕ(x1)pθ(x1)r(x1)p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)1

so the objective is designed to recover a reward-tilted posterior rather than merely a high-reward policy (Scimeca et al., 12 Mar 2025).

The same paper gives a conditional, amortized version for inverse problems: pϕ(x1)pθ(x1)r(x1)p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)2 Here the learned sampler is conditioned on an observation pϕ(x1)pθ(x1)r(x1)p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)3, and the scalar normalizer becomes a function pϕ(x1)pθ(x1)r(x1)p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)4. This turns RTB into an amortized posterior-training objective: a single model can be trained to generalize across measurements rather than solve each posterior at sampling time from scratch (Scimeca et al., 12 Mar 2025).

In the finite-horizon sequential fine-tuning presentation, RTB is written as an expectation over an arbitrary behavior distribution pϕ(x1)pθ(x1)r(x1)p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)5: pϕ(x1)pθ(x1)r(x1)p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)6 with residual

pϕ(x1)pθ(x1)r(x1)p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)7

This formulation emphasizes the same structure in a different notation: a learned trajectory law pϕ(x1)pθ(x1)r(x1)p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)8 is matched to a reference trajectory law pϕ(x1)pθ(x1)r(x1)p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)9 tilted by terminal energy Pϕ(sT)πprior(sT)exp(E(sT)/α)P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)0 and normalized by Pϕ(sT)πprior(sT)exp(E(sT)/α)P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)1 (Deleu et al., 1 Sep 2025).

2. From trajectory balance to relative trajectory balance

RTB inherits its basic pathwise logic from Trajectory Balance (TB), introduced for GFlowNets as a trajectory-level alternative to flow matching and detailed balance. In the original TB formulation, for a complete trajectory Pϕ(sT)πprior(sT)exp(E(sT)/α)P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)2,

Pϕ(sT)πprior(sT)exp(E(sT)/α)P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)3

and any global minimizer defines a policy that samples exactly from the reward-proportional target distribution (Malkin et al., 2022). TB is therefore an absolute balance condition: forward trajectory mass, multiplied by a learned partition function, is matched directly to terminal reward times backward path mass.

RTB is a relative version of this construction. In the diffusion derivation, the target unnormalized density is Pϕ(sT)πprior(sT)exp(E(sT)/α)P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)4, and the appendix shows that RTB arises from a TB-style objective in which posterior and prior share the same fixed backward process, causing the backward terms to cancel. The resulting balance equation is

Pϕ(sT)πprior(sT)exp(E(sT)/α)P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)5

so the reference process appears explicitly inside the numerator-denominator ratio rather than only through a terminal reward (Scimeca et al., 12 Mar 2025).

A later theoretical paper sharpens this relation. It studies RTB as a method for fine-tuning sequential generative models toward

Pϕ(sT)πprior(sT)exp(E(sT)/α)P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)6

and proves that RTB is exactly equivalent to Trust-PCL when the reward design satisfies

Pϕ(sT)πprior(sT)exp(E(sT)/α)P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)7

with the correspondences

Pϕ(sT)πprior(sT)exp(E(sT)/α)P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)8

Under these identifications,

Pϕ(sT)πprior(sT)exp(E(sT)/α)P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)9

This is an exact objective equivalence up to the multiplicative factor τ=(x0xΔtx1),Δt=1T,\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},0, not merely an analogy at the level of fixed points or asymptotic optima (Deleu et al., 1 Sep 2025).

The same paper also shows an on-policy gradient relation. If one differentiates RTB while ignoring derivatives through the sampling distribution, then the resulting gradient equals a REINFORCE-with-KL gradient up to a factor τ=(x0xΔtx1),Δt=1T,\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},1, provided the REINFORCE baseline is chosen as

τ=(x0xΔtx1),Δt=1T,\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},2

This places RTB even more directly inside the family of KL-regularized path-consistency methods rather than outside it (Deleu et al., 1 Sep 2025).

3. Off-policy character and optimization mechanics

A defining property of RTB is that it is explicitly formulated as an off-policy objective. In the sequential formulation, the expectation is taken under an arbitrary behavior distribution τ=(x0xΔtx1),Δt=1T,\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},3, and in the diffusion formulation the gradient with respect to τ=(x0xΔtx1),Δt=1T,\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},4 does not require backpropagation through the trajectory-generation process. This permits training trajectories to come from distributions other than the current model, including replay-buffer policies and backtracking-based behavior policies (Deleu et al., 1 Sep 2025, Scimeca et al., 12 Mar 2025).

The diffusion paper makes this operational with replay and backtracking exploration. It defines a replay buffer

τ=(x0xΔtx1),Δt=1T,\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},5

and a behavior distribution

τ=(x0xΔtx1),Δt=1T,\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},6

where τ=(x0xΔtx1),Δt=1T,\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},7 is a prioritization distribution over stored terminal samples. The intended use is to obtain high-target-density terminal states τ=(x0xΔtx1),Δt=1T,\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},8, sample reverse/noising trajectories

τ=(x0xΔtx1),Δt=1T,\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},9

and then train RTB on those trajectories even if the current sampler p(xt+Δtxt)=N ⁣(xt+Δtxt+ut(xt)Δt,σt2ΔtI),p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),0 cannot yet reach those modes on-policy. The paper presents this as a direct mechanism for mode discovery and for preventing catastrophic forgetting once good terminal samples have been discovered (Scimeca et al., 12 Mar 2025).

The same source describes two practical stabilizations. First, it replaces the squared loss by the empirical variance over a minibatch of the quantity inside the square, thereby removing dependence on p(xt+Δtxt)=N ⁣(xt+Δtxt+ut(xt)Δt,σt2ΔtI),p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),1; this is described as a relative variant of VarGrad. Second, it applies loss clipping by skipping updates on trajectories whose loss is already close to p(xt+Δtxt)=N ⁣(xt+Δtxt+ut(xt)Δt,σt2ΔtI),p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),2. Both modifications are presented as numerical stabilizers, especially in conditional settings (Scimeca et al., 12 Mar 2025).

The equivalence paper uses off-policy capability to revisit an illustrative example from the RTB literature. It argues that earlier negative results for KL-regularized RL were partly caused by a reward mismatch, and reports that off-policy REINFORCE with self-normalized importance sampling (SNIS), when given the correct reward design and the same exploratory behavior policy, can recover the expected tilted target distribution as accurately as RTB on the toy task (Deleu et al., 1 Sep 2025). This does not negate RTB’s off-policy utility; rather, it narrows the dispute to formulation and reward specification.

4. RTB for Bayesian inverse problems with diffusion priors

The first major applied use of RTB in the supplied literature is posterior training for Bayesian inverse problems with pretrained diffusion priors. The inferential target is

p(xt+Δtxt)=N ⁣(xt+Δtxt+ut(xt)Δt,σt2ΔtI),p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),3

with p(xt+Δtxt)=N ⁣(xt+Δtxt+ut(xt)Δt,σt2ΔtI),p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),4 represented implicitly by a diffusion model rather than a tractable closed-form density. RTB addresses the intractability of p(xt+Δtxt)=N ⁣(xt+Δtxt+ut(xt)Δt,σt2ΔtI),p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),5 by working on path space: it compares entire posterior trajectories to prior trajectories and uses the likelihood p(xt+Δtxt)=N ⁣(xt+Δtxt+ut(xt)Δt,σt2ΔtI),p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),6 as terminal reward p(xt+Δtxt)=N ⁣(xt+Δtxt+ut(xt)Δt,σt2ΔtI),p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),7 (Scimeca et al., 12 Mar 2025).

The paper instantiates this framework on several concrete inverse problems. For inpainting,

p(xt+Δtxt)=N ⁣(xt+Δtxt+ut(xt)Δt,σt2ΔtI),p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),8

For Fourier phase retrieval,

p(xt+Δtxt)=N ⁣(xt+Δtxt+ut(xt)Δt,σt2ΔtI),p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),9

For nonlinear deblur,

p(x0,xΔt,,x1)=p(x0)i=1Tp(xiΔtx(i1)Δt).p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).0

For gravitational lensing,

p(x0,xΔt,,x1)=p(x0)i=1Tp(xiΔtx(i1)Δt).p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).1

The conditional RTB objective then trains an amortized posterior sampler conditioned on p(x0,xΔt,,x1)=p(x0)i=1Tp(xiΔtx(i1)Δt).p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).2 rather than a training-free guided sampler specialized to a single observation (Scimeca et al., 12 Mar 2025).

A practical modification introduces a Langevin-biased posterior drift: p(x0,xΔt,,x1)=p(x0)i=1Tp(xiΔtx(i1)Δt).p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).3 This is described as an inductive bias that allows reward information to influence intermediate diffusion states rather than only the terminal state. The paper also notes that one may use parameter-efficient fine-tuning, that full backpropagation through the sampling process is not required, and that one may backpropagate only through a subset of trajectory summands when full trajectory memory is too expensive (Scimeca et al., 12 Mar 2025).

Empirically, the paper compares RTB with training-free posterior methods such as DPS, FPS, FPS-SMC, classifier-guidance-style methods, CLA in gravitational lensing, and hybrids such as RTB+DPS and RTB+DPS+CLA. Its main reported pattern is that guidance-based methods often achieve stronger likelihood or reconstruction metrics, whereas RTB tends to achieve much better p(x0,xΔt,,x1)=p(x0)i=1Tp(xiΔtx(i1)Δt).p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).4, which the authors interpret as evidence of greater posterior faithfulness. The reported metrics include p(x0,xΔt,,x1)=p(x0)i=1Tp(xiΔtx(i1)Δt).p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).5, LPIPS, FID, and p(x0,xΔt,,x1)=p(x0)i=1Tp(xiΔtx(i1)Δt).p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).6. In gravitational lensing, RTB used p(x0,xΔt,,x1)=p(x0)i=1Tp(xiΔtx(i1)Δt).p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).7 diffusion steps, while CLA required p(x0,xΔt,,x1)=p(x0)i=1Tp(xiΔtx(i1)Δt).p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).8 steps for reasonable samples; at the same time, the paper notes that training can be unstable for very peaky rewards, and in lensing about p(x0,xΔt,,x1)=p(x0)i=1Tp(xiΔtx(i1)Δt).p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).9 of runs diverged (Scimeca et al., 12 Mar 2025).

The paper’s abstract states that existing training-free diffusion posterior methods struggle to perform effective posterior inference in latent space due to inherent biases. The supplied body text does not provide a formal latent-variable derivation of this bias, but it does present the broader empirical claim that heuristic posterior-guidance dynamics can shift samples away from the true posterior, often yielding artificially high likelihood values at the cost of reduced diversity (Scimeca et al., 12 Mar 2025).

RTB is closely connected to, but distinct from, several later or adjacent trajectory-balance methods. The most direct precursor is standard TB in GFlowNets, whose motivation was improved long-range credit assignment relative to flow matching and detailed balance. TB imposes a single global constraint on a complete sampled trajectory and was empirically shown to improve convergence, diversity of generated samples, and robustness to long action sequences and large action spaces (Malkin et al., 2022).

Not every trajectory-balance method built around a reference model is RTB. The paper “Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training” explicitly does not introduce or discuss Relative Trajectory Balance. Its method, TBA, uses standard TB—implemented through a batch-estimated “VarGrad variant of trajectory balance”—inside an asynchronous actor–learner system with replay buffers, and its key systems claim is that TB is off-policy. The paper states that there is no objective named RTB, no derivation of an RTB loss, and no comparison between RTB and standard TB (Bartoldson et al., 24 Mar 2025).

A conceptually close but terminologically distinct development appears in “Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion LLMs.” That paper does not mention RTB by name, but its TraFL objective defines a reward-tilted target relative to a frozen reference trajectory distribution,

pθp_\theta0

and minimizes a squared residual against that reference-anchored target. The paper presents TraFL as a response to “trajectory locking,” a failure mode in diffusion-language-model post-training in which sampled reward-driven updates over-concentrate probability mass on a narrow set of denoising paths (Ahmadi et al., 13 May 2026). A plausible implication is that RTB-style reference-relative balance principles had already become a reusable design pattern beyond the original diffusion-posterior setting.

Another related method is Rooted absorbed Prefix Trajectory Balance (RapTB). That paper is not presented as RTB, but it introduces a root-relative auxiliary residual

pθp_\theta1

which cancels the global normalizer and provides dense prefix-level supervision. The paper is explicit that standard terminal TB remains the only exact balance condition whose optimum matches the reward-proportional target, while the rooted relative term is an auxiliary variance-reducing regularizer layered on top (Wang et al., 28 Feb 2026). This suggests a lineage in which “relative” constructions increasingly serve credit-assignment and normalization-cancellation roles even when full RTB is not used.

6. Terminological boundaries and recurrent misconceptions

The term “RTB” is not unique to machine learning. In cosmology, RTB denotes Ricci-trace-based gravity, the subclass pθp_\theta2 of pθp_\theta3 theories, where the gravitational action depends linearly on the Ricci scalar and nontrivially on the trace pθp_\theta4 of the matter energy-momentum tensor (Silva et al., 2024). This usage is entirely unrelated to Relative Trajectory Balance in generative modeling and RL.

Within generative-model post-training, another recurring misconception is to treat every TB-derived, reference-anchored, or replay-compatible method as RTB. The available papers do not support that conflation. TBA is explicitly standard TB in an asynchronous system rather than RTB (Bartoldson et al., 24 Mar 2025). TraFL is RTB-like in the sense of being reference-relative and reward-tilted, but it is not introduced as RTB and uses a diffusion-compatible sequence-level surrogate rather than exact pathwise log-probabilities (Ahmadi et al., 13 May 2026). RapTB employs a rooted relative auxiliary residual, but it retains standard TB as the exact global anchor (Wang et al., 28 Feb 2026).

A further point of debate concerns novelty. One line of work presents RTB as a recently introduced off-policy objective that is asymptotically unbiased for posterior inference in diffusion inverse problems (Scimeca et al., 12 Mar 2025). A later theoretical analysis argues that RTB is not a fundamentally new alternative to KL-regularized RL because it is exactly equivalent to Trust-PCL under a specific reward design and parameter identification (Deleu et al., 1 Sep 2025). The literature therefore supports two simultaneously true statements: RTB is a useful and practically distinctive pathwise objective for posterior and generative-model fine-tuning, and RTB can also be understood as a re-expression of pre-existing KL-regularized path-consistency machinery rather than a separate theoretical family.

In that sense, the most stable definition of RTB is not institutional or historical but structural: it is a relative trajectory-balance objective in which a learned trajectory distribution is matched, via a squared log-ratio residual, to a reward- or likelihood-tilted reference trajectory distribution, with a learned normalization term and explicit off-policy compatibility.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Relative Trajectory Balance (RTB).