TransPhy3D: Synthetic Dataset for Transparent Scenes
- TransPhy3D is a synthetic video dataset designed for transparent and reflective scenes, offering physics-accurate rendering and complete triplet annotations.
- The accompanying DKT framework uses a LoRA-adapted video diffusion backbone to achieve temporally consistent depth and normal estimates across long video sequences.
- Empirical results show DKT-1.3B significantly outperforms benchmarks, enhancing robotic grasping and manipulation in optically challenging environments.
TransPhy3D denotes a large-scale synthetic video dataset specifically constructed to address the persistent challenge of transparent and highly reflective object perception within computer vision systems. Traditional stereo, time-of-flight (ToF), and monocular depth pipelines are typically confounded by the complex optical phenomena—refraction, reflection, transmission—exhibited by such objects, frequently resulting in incomplete or temporally unstable estimates. The TransPhy3D dataset, together with a dedicated video-to-video translation framework (DKT), leverages generative video diffusion priors to enable robust and temporally consistent depth and normal estimation across arbitrarily long input video sequences featuring transparent scenes (Xu et al., 29 Dec 2025).
1. Dataset Construction and Structure
TransPhy3D is constructed as the first large-scale synthetic video corpus specifically targeting transparent and highly reflective scenes. Its creation encompasses the following stages:
- Asset Bank Assembly:
- Category-Rich Static Assets: 5,574 meshes from BlenderKit are rendered and scored for transparency/reflectivity using Qwen2.5-VL-7B, with the top 574 retained.
- Shape-Rich Parametric Assets: Drawing on methodologies from T²SQNet [Kim et al. 2024], asset families are generated by sampling shape parameters from a low-dimensional distribution, producing an effectively infinite variety of mesh silhouettes.
- PBR Materials: Meshes are randomly assigned glass, plastic, or metal materials via categorical sampling .
- Scene Creation:
- objects are sampled per scene and given randomized 6-DOF poses/scales. Bullet-based physics under Blender simulates a 2s drop, allowing objects to settle organically within container or tabletop contexts.
- Camera Trajectories:
- Each scene is captured for frames at 24 fps. Camera motion follows a horizontal sweep perturbed by sinusoidal elevation:
- where , rad, Hz.
- Rendering Pipeline:
- Scenes are rendered at pixels using Blender/Cycles under HDR illumination. Outputs include RGB, metric depth, and object-space normals, with post-processing denoising via NVIDIA OptiX.
The resulting corpus comprises 11,000 videos (1.32 million frames), each with triplet annotations (RGB, depth, normals), thus providing physics-accurate training material unattainable with real-world data given current annotation limitations.
2. Video-to-Video Translation Framework ("DKT")
TransPhy3D is foundational for training DKT (Diffusion Knows Transparency), a video-to-video translator that predicts per-frame dense depth and normals via a LoRA-adapted WAN video diffusion backbone [WAN, Wan et al. 2025]:
- Architecture Overview:
- LoRA Adaptation:
- In each transformer attention layer, the base weights are frozen except for low-rank adapters , which are trainable.
- Input Conditioning:
- RGB frames and noisy depth are separately encoded: with , , and
- .
- Concatenation yields .
- Loss Function:
- Ground-truth velocity is . Training minimizes
- .
- For normals, an analogous branch with a separate loss operates in parallel.
This design achieves temporally coherent predictions, a necessary property for manipulation-focused robotics pipelines.
3. Training Protocol and Inference Workflow
- Optimization Details:
- AdamW optimizer (, no weight decay), batch size 8 clips, total 70,000 iterations on 8× NVIDIA H100 GPUs (2 days total).
- Data Augmentation:
- Horizontal flips, brightness/contrast jitter (0.1), small camera roll perturbations (2°).
- Source Data Sampling:
- Single-frame tasks (e.g., HISS, DREDS, ClearGrasp) are used when ; multi-frame clips (, ) are exclusively sampled from TransPhy3D.
- Inference Procedure:
- 5 denoising steps per clip. Arbitrary-length input videos are processed via a sliding window (K=16 frames) with overlap blending using cosine weights (DepthCrafter [Hu et al. 2025]).
- Runtime:
- DKT-1.3B: 0.17 s/frame, 11.2 GB peak GPU memory; DKT-14B: 0.41 s/frame (both at 832×480 resolution).
4. Evaluation Metrics and Empirical Results
Performance is established on both synthetic and real-world benchmarks—ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test—using zero-shot protocols:
| Method | REL (↓) | RMSE (↓) | δ1.05 (↑) | δ1.10 (↑) | δ1.25 (↑) |
|---|---|---|---|---|---|
| DAv2 (Large) | 10.85 | 12.21 | 32.21 | 56.37 | 89.94 |
| DepthCrafter | 11.32 | 12.34 | 31.92 | 55.46 | 88.59 |
| DKT-1.3B (Ours) | 9.72 | 14.58 | 38.17 | 65.50 | 93.04 |
On TransPhy3D-Test, DKT-1.3B obtains REL = 2.96, δ1.05 = 87.17, and δ1.25 = 98.56. Results on DREDS-STD (CatKnown/Novel) surpass previous methods (REL = 5.30, RMSE = 4.96 for CatKnown).
For video normal estimation on ClearPose, DKT-Normal-14B yields mean angular error 26.03°, median 18.59°, and 30.06% within θ, outperforming NormalCrafter (27.08°, 20.29°, 26.10%).
5. Robotic Manipulation and Practical Integration
DKT-1.3B depth outputs are tested within a robotic grasping stack comprising AprilTag, AnyGrasp, and CuRobo modules across three surface types:
| Method | Translucent | Reflective | Diffusive | Mean |
|---|---|---|---|---|
| Raw D435 | 0.47 | 0.18 | 0.56 | 0.38 |
| DAv2-Large | 0.60 | 0.27 | 0.56 | 0.46 |
| DepthCrafter | 0.67 | 0.23 | 0.63 | 0.48 |
| DKT-1.3B (Ours) | 0.80 | 0.59 | 0.81 | 0.73 |
Compared to prior best, DKT provides a mean boost of +32 pp (translucent), +32 pp (reflective), and +18 pp (diffusive) in grasping success rates. This suggests substantial improvements in robot manipulation capabilities when depth is reliably estimated for optically challenging objects.
6. Significance, Broader Implications, and Methodological Advances
TransPhy3D enables robust training of perception models by supplying physically accurate synthetic video sequences otherwise unattainable via manual annotation. The LoRA-adapted diffusion approach demonstrates that generative video models internalize nontrivial optical rules, facilitating temporally coherent and label-free perception. A plausible implication is that physics-aware synthetic datasets, coupled with large generative backbones, can be efficiently repurposed for high-fidelity real-world tasks beyond transparent object perception, potentially impacting other vision domains where annotation or ground-truth acquisition is challenging. The explicit co-training with existing frame-wise benchmarks further supports generalization and stability in predictions. These advances collectively support the broader claim: "Diffusion knows transparency" (Xu et al., 29 Dec 2025).