PhysXGen: Physics-Informed 3D Asset Generation
- PhysXGen is a physics-grounded framework that fuses 3D generative modeling with explicit physical property annotation.
- It features a dual-branch architecture separately encoding geometric and physical attributes to ensure realistic and simulation-compatible outputs.
- Leveraging the extensive PhysXNet dataset and joint diffusion-based training, it markedly reduces property prediction errors compared to prior approaches.
PhysXGen is a feed-forward framework for physics-grounded 3D asset generation, designed to couple state-of-the-art 3D generative models with explicit modeling and prediction of physical properties. The goal is to produce 3D assets that are not only visually and structurally plausible but also annotated and constrained with meaningful physical attributes—including absolute scale, material properties, affordance, part-level kinematics, and functional descriptions—thereby facilitating real-world simulation, embodied AI, and other physically-aware applications (2507.12465).
1. Fundamental Principles and Objectives
PhysXGen addresses a significant gap in the domain of 3D generation: most prior methods focus on geometric detail and texture while neglecting attributes essential for physical realism and downstream simulation. By explicitly injecting physical knowledge into the generative process, PhysXGen enables the end-to-end creation of 3D assets that are natively compatible with simulation engines and robotics systems. This is achieved through dual-branch latent modeling and the integration of a physics-annotated 3D dataset, PhysXNet, supporting extensible physical annotation and property learning.
2. Dual-Branch Architecture and Model Design
A central distinguishing feature of PhysXGen is its dual-branch architecture, which learns the structural–physical correlations at latent representation level:
- Structural Branch: Encodes geometric and appearance features (e.g., using pre-trained VAE modules such as those from DINOv2/TRELLIS), producing the structural latent:
where encodes input asset appearance and geometry, and is the pre-trained VAE encoder.
- Physical Branch: Compresses a wide range of physical properties to a latent code:
where aggregates physical metrics (scale, density, kinematics, affordance) and is a function/utility text embedding (via CLIP or similar models).
These latents are further coupled by learnable residual connections—ensuring that, for example, physical constraints influence the fine-scale details synthesized by the structural decoder. The entire system is optimized via joint diffusion-based training. The structural branch and the physical branch have separate loss functions (e.g., geometry and property prediction errors) but are jointly aligned in the diffusion latent space using a combined loss:
where enforces accuracy of physical property outputs and supports mesh structure regularization.
3. PhysXNet: Physics-Grounded 3D Dataset
PhysXNet is the foundational dataset for PhysXGen and represents the first large-scale resource of 3D object/part assets systematically annotated across five “physics-first” dimensions:
Dimension | Examples/Notes |
---|---|
Absolute Scale | Object-part dimensions in standard units (e.g., cubic centimeters) |
Material | Explicit material class, Young’s modulus, Poisson’s ratio, density |
Affordance | Part likeliness for grasp/touch, ranked at part granularity |
Kinematics | Joint type, axis, parent–child mesh, movement range and direction |
Function Desc. | CLIP-based multi-level textual annotation (basic, function, kinematic) |
PhysXNet and its extended version PhysXNet-XL (containing 6M procedurally-annotated objects) enable not only the supervised learning of property prediction but also the modeling of high-level structure–property dependencies. Annotation utilizes a scalable human-in-the-loop process: visual isolation and rendering of object parts, automatic VLM (e.g., GPT-4o) annotation, followed by expert refinement using mesh analysis and clustering (e.g., k-means for revolute joint axis).
4. Learning and Property Prediction
PhysXGen’s training strategy emphasizes not only generative fidelity but also the accurate synthesis of physical attributes, as measured by mean absolute error (MAE) on individual property dimensions. This is facilitated by:
- Rendering and evaluating properties like absolute scale, material, and affordance from multiple random viewpoints, enforcing viewpoint-invariant property predictions.
- Using a combination of property regression, semantic text–geometry alignment, and explicit geometry–physics latent coupling.
- Evaluating geometry via Chamfer Distance and F-Score, and visuals with PSNR across rendered viewpoints.
Experimental results indicate that PhysXGen surpasses prior pipelines such as TRELLIS + PhysPre, halving property prediction errors across several attributes (e.g., absolute scale MAE drops from 12.46 to 6.63, material MAE from 0.262 to 0.141).
5. Human-in-the-Loop Data Annotation
A structured data annotation pipeline underpins PhysXNet’s corpus quality:
- Visual Isolation: Alpha compositing for clean component renders.
- Automated VLM Labeling: Vision-LLMs annotate fundamental part properties and semantic utility.
- Expert Refinement: Mesh point clouds (from child–parent relationships) enable algorithmic extraction of kinematics (rotational axes, movement range), with clustering and plane-fitting for revolute joints.
This yields consistent, fine-grained annotations, enabling robust learning of physics-grounded representations.
6. Applications and Future Developments
PhysXGen’s physically-grounded 3D assets have broad utility in:
- Simulation environments (robotics, virtual/augmented reality, digital twins) where assets must possess both realistic geometry and simulation-ready physical labels.
- Training embodied AI and manipulation policies, where affordance maps, material properties, and kinematics are essential for learning robust sensorimotor skills.
- Industrial workflows requiring digital replicas with empirically meaningful property annotations.
- Asset design for simulation in physics-based gaming, engineering, and model-based control.
Anticipated advancements include addressing fine-grained property prediction challenges, reducing geometric/physical inconsistencies during generation, enriching the framework with additional material and kinematic property types, and improving semantic-text/geometry alignment for more nuanced functional annotation.
7. Comparative Performance
PhysXGen has been empirically validated to outperform standalone property predictors and prior baselines, especially in learning cross-property consistency. Its dual-branch VAE/diffusion design is a critical factor in achieving low structural and physical property error. Ablation studies corroborate that removing cross-branch latent connections or replacing the joint model with independent predictors leads to a measurable drop in property accuracy and geometry quality.
PhysXGen thus establishes a paradigm for explicitly physical-grounded 3D asset generation, merging generative structural priors and detailed property modeling via a robust dual-branch architecture and supported by a scalable annotation pipeline (2507.12465).