Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
127 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
53 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
10 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

PhysXGen: Physics-Informed 3D Asset Generation

Updated 17 July 2025
  • PhysXGen is a physics-grounded framework that fuses 3D generative modeling with explicit physical property annotation.
  • It features a dual-branch architecture separately encoding geometric and physical attributes to ensure realistic and simulation-compatible outputs.
  • Leveraging the extensive PhysXNet dataset and joint diffusion-based training, it markedly reduces property prediction errors compared to prior approaches.

PhysXGen is a feed-forward framework for physics-grounded 3D asset generation, designed to couple state-of-the-art 3D generative models with explicit modeling and prediction of physical properties. The goal is to produce 3D assets that are not only visually and structurally plausible but also annotated and constrained with meaningful physical attributes—including absolute scale, material properties, affordance, part-level kinematics, and functional descriptions—thereby facilitating real-world simulation, embodied AI, and other physically-aware applications (2507.12465).

1. Fundamental Principles and Objectives

PhysXGen addresses a significant gap in the domain of 3D generation: most prior methods focus on geometric detail and texture while neglecting attributes essential for physical realism and downstream simulation. By explicitly injecting physical knowledge into the generative process, PhysXGen enables the end-to-end creation of 3D assets that are natively compatible with simulation engines and robotics systems. This is achieved through dual-branch latent modeling and the integration of a physics-annotated 3D dataset, PhysXNet, supporting extensible physical annotation and property learning.

2. Dual-Branch Architecture and Model Design

A central distinguishing feature of PhysXGen is its dual-branch architecture, which learns the structural–physical correlations at latent representation level:

  • Structural Branch: Encodes geometric and appearance features (e.g., using pre-trained VAE modules such as those from DINOv2/TRELLIS), producing the structural latent:

Pslat=Eaes(Paes)P_{\text{slat}} = \mathcal{E}_{\text{aes}}(P_{\text{aes}})

where PaesP_{\text{aes}} encodes input asset appearance and geometry, and Eaes\mathcal{E}_{\text{aes}} is the pre-trained VAE encoder.

  • Physical Branch: Compresses a wide range of physical properties to a latent code:

Pplat=Ephy(Pphy,Psem)P_{\text{plat}} = \mathcal{E}_{\text{phy}}(P_{\text{phy}}, P_{\text{sem}})

where PphyP_{\text{phy}} aggregates physical metrics (scale, density, kinematics, affordance) and PsemP_{\text{sem}} is a function/utility text embedding (via CLIP or similar models).

These latents are further coupled by learnable residual connections—ensuring that, for example, physical constraints influence the fine-scale details synthesized by the structural decoder. The entire system is optimized via joint diffusion-based training. The structural branch and the physical branch have separate loss functions (e.g., geometry and property prediction errors) but are jointly aligned in the diffusion latent space using a combined loss:

Lvae=Laes(color)+Laes(geometry)+Lphy+Lsem+Lkl+Lreg\mathcal{L}_{\text{vae}} = \mathcal{L}_{\text{aes}}^{(\text{color})} + \mathcal{L}_{\text{aes}}^{(\text{geometry})} + \mathcal{L}_{\text{phy}} + \mathcal{L}_{\text{sem}} + \mathcal{L}_{\text{kl}} + \mathcal{L}_{\text{reg}}

Ldiff=Laes+Lphy\mathcal{L}_{\text{diff}} = \mathcal{L}_{\text{aes}} + \mathcal{L}_{\text{phy}}

where Lphy\mathcal{L}_{\text{phy}} enforces accuracy of physical property outputs and Lreg\mathcal{L}_{\text{reg}} supports mesh structure regularization.

3. PhysXNet: Physics-Grounded 3D Dataset

PhysXNet is the foundational dataset for PhysXGen and represents the first large-scale resource of 3D object/part assets systematically annotated across five “physics-first” dimensions:

Dimension Examples/Notes
Absolute Scale Object-part dimensions in standard units (e.g., cubic centimeters)
Material Explicit material class, Young’s modulus, Poisson’s ratio, density
Affordance Part likeliness for grasp/touch, ranked at part granularity
Kinematics Joint type, axis, parent–child mesh, movement range and direction
Function Desc. CLIP-based multi-level textual annotation (basic, function, kinematic)

PhysXNet and its extended version PhysXNet-XL (containing 6M procedurally-annotated objects) enable not only the supervised learning of property prediction but also the modeling of high-level structure–property dependencies. Annotation utilizes a scalable human-in-the-loop process: visual isolation and rendering of object parts, automatic VLM (e.g., GPT-4o) annotation, followed by expert refinement using mesh analysis and clustering (e.g., k-means for revolute joint axis).

4. Learning and Property Prediction

PhysXGen’s training strategy emphasizes not only generative fidelity but also the accurate synthesis of physical attributes, as measured by mean absolute error (MAE) on individual property dimensions. This is facilitated by:

  • Rendering and evaluating properties like absolute scale, material, and affordance from multiple random viewpoints, enforcing viewpoint-invariant property predictions.
  • Using a combination of property regression, semantic text–geometry alignment, and explicit geometry–physics latent coupling.
  • Evaluating geometry via Chamfer Distance and F-Score, and visuals with PSNR across rendered viewpoints.

Experimental results indicate that PhysXGen surpasses prior pipelines such as TRELLIS + PhysPre, halving property prediction errors across several attributes (e.g., absolute scale MAE drops from 12.46 to 6.63, material MAE from 0.262 to 0.141).

5. Human-in-the-Loop Data Annotation

A structured data annotation pipeline underpins PhysXNet’s corpus quality:

  • Visual Isolation: Alpha compositing for clean component renders.
  • Automated VLM Labeling: Vision-LLMs annotate fundamental part properties and semantic utility.
  • Expert Refinement: Mesh point clouds (from child–parent relationships) enable algorithmic extraction of kinematics (rotational axes, movement range), with clustering and plane-fitting for revolute joints.

This yields consistent, fine-grained annotations, enabling robust learning of physics-grounded representations.

6. Applications and Future Developments

PhysXGen’s physically-grounded 3D assets have broad utility in:

  • Simulation environments (robotics, virtual/augmented reality, digital twins) where assets must possess both realistic geometry and simulation-ready physical labels.
  • Training embodied AI and manipulation policies, where affordance maps, material properties, and kinematics are essential for learning robust sensorimotor skills.
  • Industrial workflows requiring digital replicas with empirically meaningful property annotations.
  • Asset design for simulation in physics-based gaming, engineering, and model-based control.

Anticipated advancements include addressing fine-grained property prediction challenges, reducing geometric/physical inconsistencies during generation, enriching the framework with additional material and kinematic property types, and improving semantic-text/geometry alignment for more nuanced functional annotation.

7. Comparative Performance

PhysXGen has been empirically validated to outperform standalone property predictors and prior baselines, especially in learning cross-property consistency. Its dual-branch VAE/diffusion design is a critical factor in achieving low structural and physical property error. Ablation studies corroborate that removing cross-branch latent connections or replacing the joint model with independent predictors leads to a measurable drop in property accuracy and geometry quality.


PhysXGen thus establishes a paradigm for explicitly physical-grounded 3D asset generation, merging generative structural priors and detailed property modeling via a robust dual-branch architecture and supported by a scalable annotation pipeline (2507.12465).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)