Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 92 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 36 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 113 tok/s
GPT OSS 120B 472 tok/s Pro
Kimi K2 214 tok/s Pro
2000 character limit reached

Diffuse-CLoC: Unified Diffusion for Character Control

Updated 22 August 2025
  • Diffuse-CLoC is a unified diffusion framework that jointly models state and action trajectories to achieve physically plausible character control in dynamic environments.
  • It leverages classifier guidance and inpainting techniques to condition future trajectories, facilitating obstacle avoidance and precise task-space control.
  • The framework employs a transformer-based architecture with differentiated attention for state planning and reactive action generation, outperforming modular baselines in long-horizon planning.

Diffuse-CLoC is a unified, guided diffusion framework for physically realistic character control that merges the intuitive steerability of kinematic motion generation with the dynamic viability of state-action-based control policies. The distinguishing principle is joint modeling and co-diffusion of both state (character pose) and action (control signals) sequences within a single denoising diffusion probabilistic model, enabling sampling of entire future trajectories that can be conditioned at inference time for diverse control tasks.

1. Unified Joint Diffusion of States and Actions

Diffuse-CLoC employs a single DDPM to model the joint distribution over state and action trajectories, setting it apart from hierarchical or modular methods that separately handle planning and control. The trajectory is represented as τt=[at,st+1,at+1,...,st+H,at+H]\tau_t = [a_t, s_{t+1}, a_{t+1}, ..., s_{t+H}, a_{t+H}] over a preview horizon HH. By conditioning action generation directly upon predicted future states, the method ensures that prospective physical requirements (e.g., balance, obstacle negotiation, goal reachability) inform control signals throughout the entire planning window.

Training is performed via noise scheduling applied independently to state and action tokens, with the denoising objective:

L=MSE(x0,θ(τt(k),Ot,k),τt)\mathcal{L} = \mathrm{MSE}(x_{0,\theta}(\tau_t^{(k)}, O_t, k), \tau_t)

where OtO_t is the observation buffer containing past state-action pairs and the current state, and x0,θx_{0,\theta} is the denoiser. Noise levels ksk_s and kak_a can be separately administered to states and actions, facilitating varied uncertainty propagation through the trajectory.

2. Conditioning and Guided Sampling

Diffuse-CLoC achieves steerable control through inference-time conditioning, directly leveraging classifier guidance and inpainting strategies from kinematic diffusion models. Task objectives are formalized using cost functions on the state sequence, such as signed distance function (SDF) costs for obstacle avoidance:

Gobs(τ)=j,t=tt+Hexp(cSDFj(st+1))G_{\text{obs}}(\tau) = \sum_{j, t'=t}^{t+H} \exp(-c \cdot \text{SDF}^j(s_{t'+1}))

Gradients of these costs τGobs-\nabla_{\tau} G_{\text{obs}} are incorporated via a posterior guidance rule:

τlogp(ττ)=τG(τc(τ))\nabla_{\tau} \log p(\tau^* | \tau) = -\nabla_{\tau} G(\tau^c(\tau))

thus steering the denoising process toward desired behaviors, for example, navigating around obstacles or achieving specified end-effector positions.

For in-betweening tasks, the model supports mixed-noise schedules, effectively "fixing" keyframe states by maintaining zero noise while diffusing the rest, enabling the inpainting of physically plausible and smooth transitions.

3. Network Architecture and Attention Strategy

Diffuse-CLoC utilizes a transformer-based architecture designed to physically separate planning and reactive control. "State tokens" attend across the previewed (future) trajectory, allowing information to flow temporally in both directions and inform holistic planning. "Action tokens," by contrast, employ causal attention—each attends only to historical and current states, preventing the contamination of immediate control signals by potentially noisy future state predictions. A rolling inference scheme maintains previously predicted clean tokens in a FIFO buffer, facilitating warmstart and consistency across successive planning cycles.

4. Applications: Physical Plausibility and Task Diversity

The framework demonstrably supports a broad spectrum of downstream tasks without retraining, including:

  • Static/Dynamic Obstacle Avoidance: By guiding state predictions with SDF-based cost gradients, the system navigates complex environments while actively avoiding collisions.
  • Motion In-betweening: Given sparse keyframes, the model inpaints the entire trajectory by conditioning only on the fixed frames; transitions are physically plausible and robust.
  • Task-Space Control: Arbitrary body part targets at selected timesteps (e.g., controller-driven root position, hand reaches) are accommodated via targeted cost functions:

Gts(τ)=tTPx(st+1)gt2G_{\text{ts}}(\tau) = \sum_{t' \in T} \|P_x(s_{t'+1}) - g_{t'}\|^2

where PxP_x projects to the task-space and gtg_{t'} is the target.

  • Long-horizon Planning: The method achieves physically realistic motion over extended horizons without needing a distinct high-level planner.

5. Experimental Quantitative and Qualitative Results

Empirical evaluation shows that Diffuse-CLoC consistently outperforms modular baselines such as Kin+PHC (hierarchical motion diffusion plus tracking controller) in terms of physical realism (lower Fréchet Inception Distance, FID), success rates, and robustness to obstacles. For example, in the "Walk-Perturb" setting, Diffuse-CLoC produces a 16% fall rate compared to 30–44% for hierarchical models, and yields superior motion quality. Obstacle navigation (Forest, Jump tasks) and in-betweening tasks similarly benefit from joint state-action generation. Ablation studies on attention styles and preview lengths indicate that causal attention for actions and receding-horizon state planning provide optimal trade-offs.

6. Limitations and Future Directions

Current limitations include tuning classifier guidance strength for inference-time conditioning, as overly strong signals can induce unnatural motion artifacts. Dataset coverage remains a constraint—certain underrepresented motion types (e.g., fine limb control) may suffer degraded quality, suggesting a need for expanded training regimes or augmentation. Foot jitter artifacts, likely introduced by aggressive augmentation noise, point to refinements required in preprocessing pipelines. Longer-term planning and improved transitioning between disparate behaviors may be achievable by reducing dependency on historical context. Extensions to richer human–object interactions and incorporation of sensory feedback (e.g., visual or terrain maps) constitute important next development steps.

7. Significance for Physics-Based Control

Diffuse-CLoC establishes a paradigm for unified, end-to-end, physics-grounded character control, achieving both steerability and biomechanical plausibility across previously unseen and long-horizon tasks. By directly co-diffusing actions and states within a single conditional model and integrating robust conditioning strategies, it eliminates the need for separate planners or handcrafted control logic, providing a scalable and versatile solution for physically realistic animation, robotics, and interactive virtual environments.