Papers
Topics
Authors
Recent
Search
2000 character limit reached

Interval Score Matching for Text-to-3D

Updated 9 March 2026
  • ISM is a score-based distillation framework for text-to-3D generation that overcomes SDS limitations by using deterministic DDIM inversion and interval score matching.
  • It provides consistent supervision signals that reduce over-smoothing and improve reconstruction fidelity through multi-step score alignment.
  • ISM facilitates faster convergence and robust 3D asset generation, forming the basis for advanced methods like Trajectory Score Matching (TSM).

Interval Score Matching (ISM) is a score-based distillation framework for text-to-3D generation that addresses the limitations of Score Distillation Sampling (SDS) by employing deterministic Denoising Diffusion Implicit Models (DDIM) inversion and interval-based score matching. ISM produces consistent supervision signals for optimizing 3D representations directly with pretrained 2D diffusion models, resulting in improved reconstruction fidelity, reduced over-smoothing, and greater training efficiency compared to prior approaches. ISM was introduced in the LucidDreamer system and has influenced the development of further generalizations such as Trajectory Score Matching (TSM) (Liang et al., 2023, Miao et al., 2024).

1. Motivation and Conceptual Foundations

Score Distillation Sampling (SDS), first used in DreamFusion, distills knowledge from text-conditioned 2D diffusion models (such as DDPMs) into 3D neural fields or explicit representations. SDS performs a reconstruction-based objective: for a rendered image x0=render(θ,c)x_0 = \mathrm{render}(\theta, c) (with 3D parameters θ\theta and camera cc), SDS adds noise and attempts to match the predicted denoising direction to the true score of the forward diffusion kernel q(xtx0)q(x_t|x_0):

LSDS(θ)=Et,c[ω(t)ϵϕ(xt,t,y)ϵ(xt,t)2],\mathcal{L}_\text{SDS}(\theta) = \mathbb{E}_{t,c} [ \omega(t) \|\epsilon_\phi(x_t, t, y) - \epsilon(x_t, t)\|^2 ],

where ϵϕ\epsilon_\phi is the diffusion model's predicted noise (conditioned on prompt yy), ϵ(xt,t)\epsilon(x_t, t) is the ground truth score, and ω(t)\omega(t) is a time-dependent weighting. Due to stochastic noise draws and the need for faithful reconstructions at high noise levels, SDS exhibits semantically inconsistent pseudo-ground-truths that, when averaged over many steps, lead to over-smoothing of geometry and texture.

ISM introduces deterministic DDIM inversion and interval score matching to address these challenges. Rather than targeting a denoised x0x_0 from a noisy xtx_t, ISM encourages the matching of predicted diffusion scores across a fixed interval in the DDIM trajectory. DDIM inversion removes the randomness in noisy latent generation, yielding supervision that is consistently aligned across views and time steps. The objective targets the difference between the score at a later timestep tt (conditioned on text) and an earlier timestep ss (unconditional), regularizing the learning of 3D parameters via interval consistency (Liang et al., 2023).

2. Formalism and Loss Function

The ISM loss for a rendered view x0=g(θ,c)x_0 = g(\theta, c) is defined as follows. Let (xs,xt)(x_s, x_t) be noisy latents at inversion steps ss and tt ($0 < s < t$), computed by deterministic DDIM inversion. The denoiser outputs ϵϕ(xu,u,y)\epsilon_\phi(x_u, u, y) (conditioned) and ϵϕ(xu,u,)\epsilon_\phi(x_u, u, \emptyset) (unconditioned), and ω(t)\omega(t) is a reweighting function. The squared-error loss and gradient are:

LISM(θ)=Et,c[ω(t)ϵϕ(xt,t,y)ϵϕ(xs,s,)2]\mathcal{L}_\text{ISM}(\theta) = \mathbb{E}_{t,c} \bigl[ \omega(t) \bigl\| \epsilon_\phi(x_t, t, y) - \epsilon_\phi(x_s, s, \emptyset) \bigr\|^2 \bigr]

θLISM(θ)=Et,c[ω(t)  (ϵϕ(xt,t,y)ϵϕ(xs,s,))  x0θ].\nabla_\theta \mathcal{L}_\text{ISM}(\theta) = \mathbb{E}_{t,c}\bigl[ \omega(t)\;(\epsilon_\phi(x_t, t, y) - \epsilon_\phi(x_s, s, \emptyset)) \;\frac{\partial x_0}{\partial \theta} \bigr].

Here, the “pseudo-ground-truth” is the unconditional score ϵϕ(xs,s,)\epsilon_\phi(x_s, s, \emptyset), and the target is the text-conditioned score at xtx_t.

Key differences with SDS are:

  • Latent Generation: ISM latents are created by deterministic DDIM inversion, eliminating random noise.
  • Supervision Interval: ISM uses interval score matching (xs,xt)(x_s, x_t) rather than reconstructing all the way from xtx_t to x0x_0.
  • Score Alignment: ISM matches conditional and unconditional scores at different timesteps, exploiting multi-step denoising quality and enhanced stability (Liang et al., 2023, Miao et al., 2024).

3. DDIM Inversion and Trajectory Consistency

The DDIM inversion procedure traverses “upwards” along the time axis, deterministically mapping the rendered image x0x_0 to noisier latents xsx_s and xtx_t:

  • Starting from x0x_0, apply inverted DDIM updates for u=1,,su = 1, \dots, s (producing xsx_s), then continue for u=s+1,,tu = s+1, \dots, t to compute xtx_t.
  • The inversion formula uses the reverse-sampling update of DDIM:

xu1=αu1(xu1αuϵϕ(xu,u,)αu)+1αu1ϵϕ(xu,u,),x_{u-1} = \sqrt{\alpha_{u-1}} \left( \frac{x_u - \sqrt{1-\alpha_u}\,\epsilon_\phi(x_u, u, \emptyset)}{\sqrt{\alpha_u}} \right) + \sqrt{1-\alpha_{u-1}}\,\epsilon_\phi(x_u, u, \emptyset),

where αu\alpha_u are noise schedule terms.

This approach ensures that for a given (θ,c)(\theta, c) and prompt, both xsx_s and xtx_t are consistent across evaluations. The ISM algorithm may use large inversion strides without material impact on the supervision quality, providing computational efficiency (Liang et al., 2023). The interval length δT=ts\delta_T = t-s and the inversion stride δS\delta_S act as key hyperparameters controlling granularity and speed.

4. Pseudo-Ground-Truth Inconsistency and ISM Limitations

Despite the determinism of DDIM inversion, ISM still suffers from two main sources of error:

  • Linearization Error: Each inversion step approximates ϵϕ(xu,u)ϵϕ(xu1,u1)\epsilon_\phi(x_u, u) \approx \epsilon_\phi(x_{u-1}, u-1), accumulating small deviations from the exact diffusion trajectory.
  • Target Drift: The “pseudo-ground-truth” ϵϕ(xs,s,)\epsilon_\phi(x_s, s, \emptyset) varies depending on the choice of ss and the path taken, as accumulated errors differ across inversion runs.

These errors manifest as local blurring or inconsistency in the resulting 3D asset, especially in high-detail or ambiguous regions. When discrepancies are large, the supervision signal becomes an average of incompatible pseudo-GTs, leading to the smoothing out of geometry or textures (Miao et al., 2024).

A summary of ISM's strengths and limitations compared to SDS is given below:

Method Latent Generation Score Supervision Key Limitations
SDS Random-noise DDPM One-step, x₀ reconstruction Over-smoothing, noisy pseudo-GTs
ISM DDIM inversion Interval, (x_s, x_t) Drift in pseudo-GT, accumulated error

5. Generalization: Connection to Trajectory Score Matching (TSM)

Trajectory Score Matching (TSM) generalizes ISM by introducing an intermediate time μ(s,t)\mu \in (s, t). After inversion to xsx_s, two forward (denoising) paths are taken: one to xμx_\mu, one to xtx_t. The TSM loss is:

LTSM(θ)=Et,c[ω(t)ϵϕ(xt,t,y)ϵϕ(xμ,μ,)2].\mathcal{L}_\text{TSM}(\theta) = \mathbb{E}_{t,c}\bigl[\omega(t)\|\epsilon_\phi(x_t, t, y)-\epsilon_\phi(x_\mu, \mu, \emptyset)\|^2\bigr].

ISM is the special case with μ=s\mu = s. For any intermediate μ(s,t)\mu \in (s, t), the accumulated drift between xμx_\mu and xtx_t is strictly smaller than the drift between xsx_s and xtx_t, reducing pseudo-ground-truth inconsistency and increasing stability. Consequently, TSM produces sharper and more consistent outputs when compared to ISM due to its reduced error propagation (Miao et al., 2024).

6. Training Algorithm and Practical Implementation

The ISM training procedure for text-to-3D distillation proceeds as follows (paraphrased version, omitting constants):

  1. Sample a camera cc and render x0x_0 from current 3D parameters θ\theta.
  2. Sample tt uniformly and set s=tδTs = t - \delta_T.
  3. Perform accelerated DDIM inversion with stride δS\delta_S to obtain xsx_s from x0x_0.
  4. Continue inversion one more step to get xtx_t from xsx_s.
  5. Evaluate conditional (ϵϕ(xt,t,y)\epsilon_\phi(x_t, t, y)) and unconditional (ϵϕ(xs,s,)\epsilon_\phi(x_s, s, \emptyset)) scores.
  6. Compute the ISM gradient and update θ\theta:

g=ω(t)(ϵϕ(xt,t,y)ϵϕ(xs,s,))render(θ,c)θg = \omega(t)\bigl(\epsilon_\phi(x_t, t, y) - \epsilon_\phi(x_s, s, \emptyset)\bigr)\frac{\partial \mathrm{render}(\theta, c)}{\partial \theta}

  1. Update parameters: θθηg\theta \leftarrow \theta - \eta g.

This procedure is agnostic to the type of 3D representation (e.g., NeRF, 3D Gaussian Splatting), and hyperparameters (δT,δS)(\delta_T, \delta_S) can be tuned to trade off speed, sharpness, and stability (Liang et al., 2023).

7. Empirical Results and Impact

Experiments with LucidDreamer (ISM + 3D Gaussian Splatting) demonstrate notable improvements over SDS-based methods in terms of geometric accuracy, detail preservation, training efficiency, and user preference. Specifically:

  • Qualitatively, ISM distills fine geometry (e.g., hair strands, clothing folds) where SDS and variants produce over-smoothed models.
  • User studies report LucidDreamer (ISM) as most preferred, with a ranking of 1.25 compared to DreamFusion (3.28), Magic3D (3.44), ProlificDreamer (2.37), and others.
  • ISM achieves faster convergence (e.g., ∼5 hours on A100 vs. 10–15 hours for SDS-based pipelines at equal batch size and settings).
  • Larger inversion strides δS\delta_S speed up inversion with negligible impact on fidelity; varying δT\delta_T alters the trade-off between local detail and global structure.

ISM's improvements validate deterministic inversion and interval-based supervision as mechanisms for robust 3D distillation from 2D diffusion priors. Successors such as TSM further ameliorate pseudo-ground-truth drift and yield enhanced stability (Liang et al., 2023, Miao et al., 2024).


ISM’s theoretical and practical advances over SDS are foundational in the current landscape of score-based text-to-3D model distillation, providing a template for further innovation in trajectory-level score supervision and robust 3D synthesis pipelines.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Interval Score Matching (ISM).