PhysX-Anything: Simulation-Ready 3D Assets

Updated 18 November 2025

PhysX-Anything is a generative framework that converts single images into simulation-ready 3D models with explicit geometry, articulation, and physical properties.
It employs a vision-language model and efficient voxel tokenization to compress complex 3D geometry by 193×, enabling diverse asset creation.
The framework outputs assets in standard formats like URDF and SDF, supporting seamless integration with robotics, embodied AI, and physics-based simulations.

PhysX-Anything is a simulation-ready 3D physical asset generative framework introduced to bridge the gap between visual 3D modeling and simulation-capable, physically parameterized, articulated object models suitable for embodied AI, robotics, and physics-based interaction. Unlike previous 3D generation approaches, which treat geometry largely as static visual representations, PhysX-Anything produces assets that capture explicit shape, articulation, and physical properties directly from a single in-the-wild image input (Cao et al., 17 Nov 2025). Central to this contribution is the introduction of a vision-LLM (VLM)-driven generative architecture, a highly efficient 3D tokenization representation, and the large-scale, richly annotated PhysX-Mobility dataset spanning over 2,000 objects with physical and articulation metadata.

1. Simulation-Ready Physical 3D Asset Generation

PhysX-Anything is the first framework that generates high-quality, simulation-ready 3D assets containing explicit geometry, per-part articulation, and physical parameters directly inferable from a single RGB image. The generative model employs a VLM backbone, enabling it to interpret and condition both textual and visual context, and advances beyond prior approaches by learning explicit geometry without sacrificing simulation fidelity or generative diversity. Key differentiators include:

Generation of assets with physical attributes (mass, density, friction, center of mass, inertia tensor, restitution, material label) and articulated joints (type, axis, frame, limits, damping, stiffness, effort).
Efficiency in geometry representation, allowing generative training within conventional VLM token budgets (achieved by 193× reduction in token count relative to naive mesh serialization).
Assets output in standard simulation formats (URDF, SDF, glTF, MJCF), enabling direct deployment into MuJoCo, Gazebo, Isaac Gym, and other physics engines (Cao et al., 17 Nov 2025).

2. The PhysX-Mobility Dataset: Scale and Annotation Protocol

PhysX-Mobility, released alongside PhysX-Anything, is a large-scale database comprising 2,063 unique 3D object instances distributed over 47 categories—a more than 2× category increase and 2.5× instance count over previous physical 3D datasets such as PhysXGen. Data sources include PartNet-Mobility, ShapeNet, Objaverse, and proprietary real-world 3D scans.

The annotation pipeline is as follows:

Geometry cleaning and retopology (non-manifold face removal; metric unification).
Semantic segmentation (using PartNet labels or learned part-field networks).
Physical attribute assignment by material lookup, volume integration, and CAD-based calculations. Density $\rho$ per part is assigned by material; mass $m=\rho V$ ; center of mass $r_\mathrm{com}$ and inertia tensor $I$ are computed using standard CAD integrals.
Articulation parameters are annotated for all movable parts, including joint type (revolute/prismatic), axis $a$ , frame $p,R$ , limits $[q_\mathrm{min},q_\mathrm{max}]$ , and dynamic parameters ( $b$ , $k$ , maximum effort). Limits and frames are validated by in-sim test (rotating joints to collision); $b$ and $k$ are inferred by oscillation tests.
Synthetic data augmentation is applied, perturbing densities and joint limits to increase coverage.

Assets are exported using the following schemas:

URDF: For ROS/MuJoCo, encoding all inertial, visual, collision, and joint specs.
SDF: For Gazebo/Isaac Gym.
glTF: Textured visualization.
MJCF: For direct MuJoCo use.

3. Efficient Geometry Representation and Tokenization

To address VLM token budget constraints, PhysX-Anything introduces a coarse-to-fine voxel-based tokenization. The pipeline consists of:

Coarse voxelization: Input mesh is downsampled to a $32\times32\times32$ grid; each occupied voxel maps to a unique linear index.
Token serialization: Consecutive voxel indices are merged into range tokens (e.g., “15-27”). Compared to raw mesh text ( $\sim$ 38,000 tokens) or vertex quantization ( $\sim$ 7,200 tokens), range-merged voxelization reduces input to $\sim$ 197 tokens—a 193× reduction.
Decoding: Output token string is parsed to a set of voxels, used as a condition for mesh refinement by a structured-latent diffusion decoder. The loss

$\mathcal{L}_\mathrm{flow} = \lVert f_\theta(x_t,c,V^{\mathrm{low},t}) - (\epsilon-x_0) \rVert_2^2$

is minimized; final high-resolution mesh is reconstructed using Marching Cubes. Part segmentation propagates via nearest-neighbor from voxels to mesh faces.

The following pseudocode details the encoding procedure:

def encode_voxels(voxel_set):
    idxs = sorted(i*32*32 + j*32 + k for (i,j,k) in voxel_set)
    ranges, start, prev = [], idxs[0], idxs[0]
    for idx in idxs[1:]:
        if idx == prev + 1:
            prev = idx
        else:
            if start == prev:
                ranges.append(str(start))
            else:
                ranges.append(f"{start}-{prev}")
            start = prev = idx
    if start == prev:
        ranges.append(str(start))
    else:
        ranges.append(f"{start}-{prev}")
    return " ".join(ranges)

4. Physical and Articulation Parameter Specification

PhysX-Mobility assets are annotated with dense per-part physical and joint specifications. For physical properties:

Mass ( $m$ ), Density ( $\rho$ ): Derived from per-part material and volume.
Center of mass ( $r_\mathrm{com}$ ), Inertia tensor ( $I$ ): Computed by

$I = \int_V \rho (\lVert x \rVert^2 I_3 - xx^\mathrm{T})\,\mathrm{d}V$

Friction coefficients ( $\mu_s$ , $\mu_d$ ) and Restitution ( $e$ ): Material-table lookup.

Articulation for each joint specifies:

Type: Revolute (rotational), Prismatic (translational), Fixed.
Axis $a$ , Frame $p,R$ : For revolute/prismatic.
Limits: For revolute: $\theta_\mathrm{min} \leq \theta \leq \theta_\mathrm{max}$ ; for prismatic: $d_\mathrm{min} \leq d \leq d_\mathrm{max}$ .
Dynamics: Damping $b$ , stiffness $k$ , effort limits.
Actuation and contact models: Spring–damper torque/force and friction at contact.

Physics simulation integration is described by standard rigid body joint and friction models; e.g.,

$\tau_\mathrm{spring} = -k(q - q_\mathrm{eq}),\quad \tau_\mathrm{damp} = -b\dot{q}$

and at contact,

$\lVert F_\mathrm{fric} \rVert \leq \mu F_n,\quad F_\mathrm{fric} = -\mu\,\mathrm{sgn}(v_t)\,F_n$

5. Benchmarks and Empirical Evaluation

The generative capacity of PhysX-Anything has been benchmarked on PhysX-Mobility using both geometric and physical metrics (averaged over 2,000 test images):

Metric	Value
PSNR (geometry)	20.35
Chamfer Distance (m)	$1.443 \times 10^{-2}$
F-score @1cm	77.5%
Absolute scale error	0.30
Material classification acc.	87.6%
Affordance prediction (/20)	14.28
Kinematic-parameter match	0.94
Text-description coherence	19.36 / 20

For downstream embodied policy learning, DDPG-trained agents, operating in MuJoCo environments instantiated from synthesized PhysX-Anything assets, achieve an 85% $\pm$ 4% task success rate across standard manipulation and interaction tasks, compared to 60% $\pm$ 8% for retrieval-based 3D asset baselines.

6. File Formats, Simulation Integration, and Usability

PhysX-Mobility assets are available in URDF, SDF, and glTF, supporting immediate workflow integration in MuJoCo, ROS, Gazebo, and Isaac Gym pipelines. The URDF schema includes all inertial, collision, joint, and friction data for each asset. Simulation loading is straightforward; example MuJoCo-Py invocation:

import mujoco_py
model = mujoco_py.load_model_from_path('coffee_machine.urdf')
sim   = mujoco_py.MjSim(model)
viewer= mujoco_py.MjViewer(sim)
while True:
    sim.step()
    viewer.render()

7. Impact, Limitations, and Prospects

PhysX-Anything—by combining VLM-based generation, a highly compressed tokenization scheme, and the largest simulation-ready 3D asset dataset to date—substantially extends the reach of generative 3D modeling into simulation- and control-heavy domains. It enables the rapid acquisition of sim-ready assets from single images, with sufficient physical and kinematic detail for robotic training, embodied AI, and physical reasoning tasks.

The dataset’s unprecedented scale (47 classes, >2,000 objects) and depth of annotation allow for broad evaluation coverage and realistic diversity in simulated environments, which had previously been limited by categorical and physical attribute constraints. While the coarse-to-fine voxel strategy achieves dramatic token compression with minimal loss of geometry, further work could assess fidelity in edge cases involving extremely complex surfaces or minute parts. The integration methodology assumes standard rigid-body dynamics and parameterizes typical household and office objects; domains requiring non-rigid or highly deformable bodies are not presently addressed.

Overall, PhysX-Anything and PhysX-Mobility jointly enable the next phase of physical 3D asset generation, simulation, and AI-driven manipulation research, providing a critical substrate for embodied intelligence and robotics (Cao et al., 17 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image (2025)

Follow Topic

Get notified by email when new papers are published related to PhysX-Anything.