PhysXGen Framework: 3D Asset Generation

Updated 14 August 2025

PhysXGen is a physically grounded framework for 3D asset generation that integrates geometric, material, and physical attributes in a unified process.
It utilizes a dual-branch architecture to jointly encode visual and physical data, ensuring coherent asset synthesis without post-hoc refinements.
The framework leverages PhysXNet, a comprehensive physics-annotated dataset, to validate performance in simulation, robotics, and digital content creation.

The PhysXGen Framework is an end-to-end feed-forward paradigm for physical-grounded 3D asset generation. Its design explicitly integrates physical knowledge—spanning geometric, material, affordance, kinematic, and semantic domains—into the generative process of 3D structure synthesis. The core architectural innovation is a dual-branch model that produces 3D assets with plausible, verifiable physical predictions while preserving high-fidelity geometry and appearance. This approach is supported by the creation of PhysXNet, a comprehensive physics-annotated 3D dataset for joint learning and rigorous evaluation. PhysXGen systematically advances the state of generative physical AI by optimizing both structure and physics, setting a foundation for simulation, embodied AI, and robotics.

1. Framework Structure and Objectives

PhysXGen’s chief motivation is to bridge the gap between traditional geometry-focused generative models and the requirements of physically plausible asset generation for real-world simulation. Most preceding approaches ignore quantitative physical attributes, impeding their applicability for rigorous simulation tasks. PhysXGen rectifies this via:

Joint encoding of geometry/appearance and physics, producing assets annotated with real-world scale, material properties, affordance scores, and detailed kinematic specifications.
Dual-branch architecture (structural and physical), allowing latent feature interaction and consistency enforcement between geometric and physical spaces.
Integration of a variational autoencoder (VAE) for efficient and disentangled latent compression of asset properties.

The feed-forward nature of the design means that asset generation does not rely on post-hoc annotation or sequential refinement; instead, physical knowledge is injected directly and globally in the generation process.

2. Dual-Branch Architecture and Latent Modeling

Central to PhysXGen is the dual-branch mechanism:

Structural Branch: Encodes geometric and appearance signals. This branch employs encoders—such as DINOv2—for appearance processing. The output, P₍aes₎, is passed to a VAE encoder ℰ₍aes₎ to yield a structured latent representation P₍slat₎.
Physical Branch: Encodes physics attributes, including absolute scale, material (density, Young’s modulus, Poisson’s ratio), affordance, kinematic parameters, and semantic descriptors (function narratives via CLIP embeddings). Data is encoded via a physical VAE (encoder ℰ₍phy₎), resulting in the physical latent P₍plat₎.

The connections between these branches are facilitated with residual and skip connections during the diffusion-based generation phase. Joint training is performed with a conditional flow matching (CFM) loss that maximizes consistency between structure and physics. The branches' latent representations are given by:

$P_{plat} = \mathcal{E}_{phy}(P_{phy}, P_{sem})$
$P_{slat} = \mathcal{E}_{aes}(P_{aes})$

where $P_{phy}$ is a multidimensional attribute tensor and $P_{sem}$ encompasses semantic function descriptions. PhysXGen leverages this mechanism to ensure that structural choices logically cohere with physical properties.

3. PhysXNet: Physics-Annotated Asset Dataset

A major contribution underpinning PhysXGen is PhysXNet—the first systematic, large-scale physics-annotated 3D dataset. Its annotation scheme spans five foundational dimensions:

Dimension	Description
Absolute Scale	Real-world size and scale parameters
Material	Material type, density, Young’s modulus, Poisson ratio
Affordance	Graspability and part interaction priority
Kinematics	Detailed movement parameters, joint types, axes, etc.
Function	Semantic, functional, and kinematic descriptions

Annotations are acquired with a scalable human-in-the-loop pipeline:

Visual Isolation: Each part of a 3D asset is rendered with alpha compositing to generate isolated prompts for robust labeling.
Automated VLM Annotation: Vision–LLMs (GPT-4o) initially label fundamental attributes on the isolated renders.
Expert Refinement: Human annotators verify and refine complex property assignments (e.g., detailed kinematics). Movable parts are analyzed using point cloud extraction, plane fitting, and axis selection (including k-means for rotational joints), yielding precise morphology and movement specifications.

PhysXNet establishes a standardized foundation for learning robust latent representations in PhysXGen.

4. Training Strategy and Technical Formulations

Training proceeds by minimizing a composite loss designed to enforce consistency across visual, geometric, and physical domains:

$\mathcal{L}_{vae} = \mathcal{L}_{aes}^{color} + \mathcal{L}_{aes}^{geo} + \mathcal{L}_{phy} + \mathcal{L}_{sem} + \mathcal{L}_{kl} + \mathcal{L}_{reg}$

where:

$\mathcal{L}_{aes}^{color}$ : Color reconstruction for visual fidelity.
$\mathcal{L}_{aes}^{geo}$ : Geometric structure correspondence.
$\mathcal{L}_{phy}$ : Physical property matching (absolute scale, material, affordance, kinematics).
$\mathcal{L}_{sem}$ : Latent consistency for function descriptions.
$\mathcal{L}_{kl}$ : KL divergence regularization in VAE latent space.
$\mathcal{L}_{reg}$ : Structural mesh regularization.

The transformer diffusion module further enables compositional sampling, reinforcing the integration of physical knowledge during asset generation. This strategy ensures predicted physical parameters remain tightly coupled with generated geometry.

5. Empirical Validation and Performance

Experiments on benchmark assets quantitatively and qualitatively validate PhysXGen’s superiority over both geometry-only models and those augmented with decoupled physical predictors ("PhysPre"). Key evaluation metrics include:

Appearance: Mean PSNR over rendered views.
Geometry: Chamfer Distance (CD) and F-Score.
Physical Accuracy: Euclidean MAE between generated and ground-truth physical attribute arrays across multiple views.

Results show that joint latent modeling in PhysXGen yields systematically lower MAE across all physical dimensions compared to baselines. Additionally, the framework exhibits generalization to unseen asset categories and maintains geometric and physics fidelity on fine-grained parts, supporting credible downstream simulation and manipulation.

6. Applications, Limitations, and Implications

PhysXGen’s fusion of structure and physics has multiple implications:

Simulation/Embodied AI: Physics-annotated assets facilitate robust simulation of interactions, force propagation, and functional movement, enabling more accurate embodied reasoning and planning.
Robotics: Detailed kinematic and material labels permit real-time adaptability for manipulation and grasp planning tasks.
Digital Content Creation: Physics-first asset generation reduces post-processing and manual annotation, streamlining pipelines for gaming, industrial simulation, and VR.
Dataset Expansion: The authors recommend further dataset growth (synthetic and scanned assets), refinement of physical property modeling, and normalization improvements for attributes like absolute scale and density.

A plausible implication is that the joint optimization architecture can be extended to other modalities (e.g., dynamic meshes, multibody systems) where cross-domain knowledge is necessary for realistic simulation and AI control.

7. Future Directions

Promising future research avenues articulated by the authors include:

Enhanced learning mechanisms for capturing subtle physical details and correlations.
Integration with more sophisticated normalization and regularization schemes to ensure robustness across large dynamic ranges of physical parameters.
Expansion to broader asset domains, supporting complex assemblies, material composites, and highly articulated bodies.
Transfer and deployment in advanced simulation platforms, robotics, and virtual world synthesis.

Such developments may accelerate the maturation of physically plausible generative frameworks, critically advancing simulation-based scientific computing and embodied decision-making.

PhysXGen is distinguished by its dual-branch architecture, comprehensive physics-annotated dataset (PhysXNet), and rigorous training protocol, collectively defining a new paradigm in physically grounded 3D asset generation (Cao et al., 16 Jul 2025). These contributions substantiate its relevance as a foundation for future generative physical AI and simulation research.

PDF Markdown Chat (Pro)

References (1)

PhysX-3D: Physical-Grounded 3D Asset Generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to PhysXGen Framework.