Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

127 tokens/sec

GPT-4o

11 tokens/sec

Gemini 2.5 Pro Pro

53 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

10 tokens/sec

DeepSeek R1 via Azure Pro

33 tokens/sec

2000 character limit reached

PhysXNet: Physically-Grounded 3D Asset Modeling

Updated 17 July 2025

PhysXNet is a physics-grounded 3D dataset that systematically annotates objects across five dimensions: scale, material, affordance, kinematics, and function.
It integrates a dual-branch model, PhysXGen, with a human-in-the-loop pipeline to generate 3D assets that merge structural fidelity with reliable physical properties.
The framework enables advanced applications in simulation, robotics, and embodied AI by ensuring that generated models exhibit both geometric accuracy and meaningful physical behavior.

PhysXNet is a large-scale, systematically annotated dataset and accompanying generative framework designed for physical-grounded 3D asset modeling. By providing detailed annotations across multiple physical dimensions—including scale, material, affordance, kinematics, and function—and leveraging a sophisticated image-to-3D architecture (PhysXGen), PhysXNet directly addresses the need for physically meaningful 3D asset creation and simulation. These developments enable more accurate and functional integration of 3D models in simulation environments, robotics, and embodied AI, facilitating real-world applications where geometric fidelity must be paired with reliable physical properties (2507.12465).

1. Definition and Scope

PhysXNet is the first physics-grounded 3D dataset that systematically annotates objects and their parts across five foundational physical dimensions: absolute scale, material, affordance, kinematics, and function description. Its design is tightly integrated with PhysXGen, a feed-forward, dual-branch framework for image-to-3D asset generation, operating in a unified latent space that captures both structural (geometry and appearance) and physical attributes. The dataset includes over 26,000 meticulously labeled assets and offers extensions (PhysXNet-XL) comprising millions of procedural augmentations. All assets and the corresponding generative model are intended for open release, providing a foundation for research and applications in generative physical AI, simulation, robotics, and beyond.

2. Five-Dimensional Physical Annotation Paradigm

PhysXNet introduces a structured annotation protocol that encompasses:

Absolute Scale: Precise physical measurements (e.g., length, width, height) assigned to each object or part, conforming to standard metric conventions. The distribution of these measurements is long-tailed, covering a broad spectrum from small household items to large structures.
Material Properties: Each part receives both categorical (e.g., "wood," "metal") and numeric material attributes such as density, Young’s modulus, and Poisson's ratio. These quantitative values are crucial for simulation and downstream physical prediction.
Affordance Ranking: Annotated as a prioritized score (1–10) per part, affordance reflects the likelihood and priority of human interaction, motivated by the functional role and usage frequency of the part.
Kinematics: Parts are classified with respect to kinematic types—including no constraint (type A), prismatic joints (B), revolute joints (C), hinge joints (D), and rigid connectors (E). Additional parameters, such as joint range and axis or parent-child hierarchy, further detail the potential or allowed motion.
Function Description: Semantic labeling at multiple levels—basic, functional, and kinematic—encapsulates the object’s purpose and its operable dynamics, using both free-form and structured text to capture nuances pertinent to practical use.

These multi-level annotations support downstream learning and simulation tasks that require more than geometric fidelity.

3. Human-in-the-Loop Annotation Pipeline

The creation of PhysXNet leverages a scalable, semi-automated pipeline that blends vision–LLMs with expert refinement:

Target Visual Isolation: Components are foregrounded via alpha compositing, enabling focused visual interpretation and reducing ambiguity for automated systems.
Vision–LLM (VLM) Annotation: A large VLM (e.g., GPT-4o) processes optimized prompts to generate candidate physical, kinematic, and functional properties for each object or part, utilizing rendered visuals.
Expert Verification and Refinement: Human experts systematically inspect, correct, and enrich difficult cases, particularly for nuanced or relational properties (for instance, fitting joint contact surfaces using point-plane analysis and k-means clustering).

This process ensures both throughput and fine-grained annotation fidelity, supporting rapid extension of the dataset (including PhysXNet-XL, with over 6 million augmented assets).

4. PhysXGen: Dual-Branch Generative Modeling

PhysXGen is a core component for generating 3D assets imbued with physically plausible attributes. The architecture is organized as follows:

Dual-Branch Latent Representation: One branch encodes structural (geometry and appearance) features from pretrained 3D models (e.g., TRELLIS, DINOv2), while the other captures physical properties using a dedicated variational autoencoder (VAE) and semantic embeddings (such as CLIP). The latent spaces are combined explicitly, allowing the model to learn and preserve correlations between form and function.

$P_{plat} = \mathcal{E}_{phy}(P_{phy}, P_{sem}), \quad P_{slat} = \mathcal{E}_{aes}(P_{aes})$

$P_{phy}$ contains concatenated physical attributes (scale, affordance, density, 11D kinematic parameters); $P_{sem}$ incorporates CLIP-derived semantic features; $P_{aes}$ covers appearance/geometry.

Residual and Cross-domain Decoding: The physical decoder ( $\mathcal{D}_{phy}$ ) and the structural decoder ( $\mathcal{D}_{aes}$ ) interact via residual connections, ensuring that structural cues can influence physical predictions—crucial for mutual consistency in generated results.
Diffusion-based Generation: Both branches utilize a transformer-based diffusion module, supporting stochastic and diverse asset generation. The overall optimization includes color, geometry, semantic, physical, regularization, and Kullback–Leibler divergence losses:

$\mathcal{L}_{vae} = \mathcal{L}^{(color)}_{aes} + \mathcal{L}^{(geometry)}_{aes} + \mathcal{L}_{phy} + \mathcal{L}_{sem} + \mathcal{L}_{kl} + \mathcal{L}_{reg}$

A Conditional Flow Matching (CFM) loss is used for the structural branch:

$\mathcal{L}_{aes} = \mathbb{E}_{t,x_0,\epsilon} \| f(x, t) - (\epsilon - x_0) \|_2^2$

with a similar term for the physical branch.

5. Experimental Validation and Generalization

PhysXGen is evaluated on both geometric and physical consistency using data splits from PhysXNet (e.g., 24K training, 1K validation, 1K test):

Geometric Metrics: Peak Signal-to-Noise Ratio (PSNR), Chamfer Distance (CD), and F-score are applied to reconstructed 3D models, showing that PhysXGen achieves at least comparable or superior fidelity to baseline models, such as TRELLIS+PhysPre.
Physical Attribute Accuracy: PhysXGen demonstrates lower mean absolute errors across absolute scale, material, affordance, kinematics, and function labeling compared to geometric-only baselines.
Ablation Studies: Incorporating cross-domain (geometry–physics) interactions improves not only physical prediction, but also geometric reconstruction quality, substantiating the efficacy of joint learning.
Qualitative Results: Generated models display coherent part-level physical attributes and structurally consistent functionality, even from challenging single-image prompts.

6. Applications and Impact

PhysXNet and PhysXGen enable practical advances in:

Simulation and Embodied AI: Physically annotated assets support high-fidelity virtual environments and robotic agents requiring realistic interaction modeling.
Manipulation Planning: Detailed kinematic and affordance labels facilitate reasoning about object operation and manipulation in robotics and AR/VR.
Generative Modeling: PhysXGen’s unified latent framework permits synthesis of new 3D assets that are geometrically and physically grounded, overcoming a major limitation of prior geometry-centric 3D generative models.
Dataset Scalability: Through automated annotation and procedural augmentation, PhysXNet provides a resource capable of supporting large-scale supervised and self-supervised research in physical scene understanding.

7. Future Directions

Several research opportunities are highlighted:

Refinement of Fine-grained Predictions: Improving the accuracy of difficult physical properties (e.g., subtle distinctions in large objects, resolving embedding artifacts for complex function descriptions).
Dataset Enrichment: Introducing additional physical parameters, rare kinematic configurations, and integrating real-world scan data to expand diversity.
Coupling Enhancement: Exploring architectures or learning strategies that further disentangle and/or couple structural and physical latent representations for improved generalizability and control.
Downstream Integration: Potential utilization for physical simulation, robotic manipulation, and dynamic scene understanding is anticipated, as well as transfer learning to real-world data.

PhysXNet, accompanied by its generative architecture and systematic physical annotation, represents a significant advancement in aligning virtual 3D modeling with the requirements of physical simulation, robotics, and AI. Its open-source availability is intended to catalyze subsequent progress and application in generative physical AI (2507.12465).

PDF Markdown Chat (Upgrade)

References (1)

PhysX: Physical-Grounded 3D Asset Generation (2025)