TransPhy3D: Synthetic Dataset for Transparent Scenes

Updated 30 December 2025

TransPhy3D is a synthetic video dataset designed for transparent and reflective scenes, offering physics-accurate rendering and complete triplet annotations.
The accompanying DKT framework uses a LoRA-adapted video diffusion backbone to achieve temporally consistent depth and normal estimates across long video sequences.
Empirical results show DKT-1.3B significantly outperforms benchmarks, enhancing robotic grasping and manipulation in optically challenging environments.

TransPhy3D denotes a large-scale synthetic video dataset specifically constructed to address the persistent challenge of transparent and highly reflective object perception within computer vision systems. Traditional stereo, time-of-flight (ToF), and monocular depth pipelines are typically confounded by the complex optical phenomena—refraction, reflection, transmission—exhibited by such objects, frequently resulting in incomplete or temporally unstable estimates. The TransPhy3D dataset, together with a dedicated video-to-video translation framework (DKT), leverages generative video diffusion priors to enable robust and temporally consistent depth and normal estimation across arbitrarily long input video sequences featuring transparent scenes (Xu et al., 29 Dec 2025).

1. Dataset Construction and Structure

TransPhy3D is constructed as the first large-scale synthetic video corpus specifically targeting transparent and highly reflective scenes. Its creation encompasses the following stages:

Asset Bank Assembly:
- Category-Rich Static Assets: 5,574 meshes from BlenderKit are rendered and scored for transparency/reflectivity using Qwen2.5-VL-7B, with the top 574 retained.
- Shape-Rich Parametric Assets: Drawing on methodologies from T²SQNet [Kim et al. 2024], asset families are generated by sampling shape parameters $\theta \sim U$ from a low-dimensional distribution, producing an effectively infinite variety of mesh silhouettes.
- PBR Materials: Meshes are randomly assigned glass, plastic, or metal materials via categorical sampling $m \sim \text{Categorical}(\{\text{glass, plastic, metal}\})$ .
Scene Creation:
- $M \sim U\{2, \ldots, 5\}$ objects are sampled per scene and given randomized 6-DOF poses/scales. Bullet-based physics under Blender simulates a 2s drop, allowing objects to settle organically within container or tabletop contexts.
Camera Trajectories:
- Each scene is captured for $F = 120$ frames at 24 fps. Camera motion follows a horizontal sweep perturbed by sinusoidal elevation:
- $\varphi_t = \varphi_0 + \Omega t + \alpha \sin(2\pi f t)$
- where $\Omega \approx 2\pi/120$ , $\alpha \sim U(0.05, 0.15)$ rad, $f \sim U(0.1, 0.2)$ Hz.
Rendering Pipeline:
- Scenes are rendered at $832 \times 480$ pixels using Blender/Cycles under HDR illumination. Outputs include RGB, metric depth, and object-space normals, with post-processing denoising via NVIDIA OptiX.

The resulting corpus comprises 11,000 videos (1.32 million frames), each with triplet annotations (RGB, depth, normals), thus providing physics-accurate training material unattainable with real-world data given current annotation limitations.

2. Video-to-Video Translation Framework ("DKT")

TransPhy3D is foundational for training DKT (Diffusion Knows Transparency), a video-to-video translator that predicts per-frame dense depth and normals via a LoRA-adapted WAN video diffusion backbone [WAN, Wan et al. 2025]:

Architecture Overview:
- VAE Encoder/Decoder: Maps video clips $V \in \mathbb{R}^{T \times H \times W \times 3}$ to latent tokens $x^c_1 \in \mathbb{R}^{T \times h \times w \times C}$ and reconstructs them.
- DiT Transformer: DiT blocks parameterized by $U_\theta$ predict the "velocity" in latent space, facilitating flow-matching.
LoRA Adaptation:
- In each transformer attention layer, the base weights are frozen except for low-rank adapters $\{A,B\}$ , which are trainable.
Input Conditioning:
- RGB frames and noisy depth are separately encoded: with $x^d_0 \sim \mathcal{N}(0, I)$ , $t \sim U[0,1]$ , and
- $x^d_t = t x^d_1 + (1-t) x^d_0$ .
- Concatenation yields $x^{{in}}_t = \text{Concat}[x^d_t, x^c_1] \in \mathbb{R}^{T \times h \times w \times 2C}$ .
Loss Function:
- Ground-truth velocity is $v^d_t = x^d_1 - x^d_0$ . Training minimizes
- $\mathcal{L}_\text{depth} = \mathbb{E}_{x^d_0, x^d_1, x^c_1, t}\|U_\theta(x^{{in}}_t, t) - v^d_t\|_2^2$ .
- For normals, an analogous branch with a separate loss $\mathcal{L}_\text{normal}$ operates in parallel.

This design achieves temporally coherent predictions, a necessary property for manipulation-focused robotics pipelines.

3. Training Protocol and Inference Workflow

Optimization Details:
- AdamW optimizer ( $lr = 1 \times 10^{-5}$ , no weight decay), batch size 8 clips, total 70,000 iterations on 8× NVIDIA H100 GPUs ( $\sim$ 2 days total).
Data Augmentation:
- Horizontal flips, brightness/contrast jitter ( $\pm$ 0.1), small camera roll perturbations ( $\pm$ 2°).
Source Data Sampling:
- Single-frame tasks (e.g., HISS, DREDS, ClearGrasp) are used when $F=1$ ; multi-frame clips ( $F=4N+1$ , $N \sim U\{0,\ldots,5\}$ ) are exclusively sampled from TransPhy3D.
Inference Procedure:
- 5 denoising steps per clip. Arbitrary-length input videos are processed via a sliding window (K=16 frames) with overlap blending using cosine weights (DepthCrafter [Hu et al. 2025]).
Runtime:
- DKT-1.3B: 0.17 s/frame, 11.2 GB peak GPU memory; DKT-14B: 0.41 s/frame (both at 832×480 resolution).

4. Evaluation Metrics and Empirical Results

Performance is established on both synthetic and real-world benchmarks—ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test—using zero-shot protocols:

Method	REL (↓)	RMSE (↓)	δ1.05 (↑)	δ1.10 (↑)	δ1.25 (↑)
DAv2 (Large)	10.85	12.21	32.21	56.37	89.94
DepthCrafter	11.32	12.34	31.92	55.46	88.59
DKT-1.3B (Ours)	9.72	14.58	38.17	65.50	93.04

On TransPhy3D-Test, DKT-1.3B obtains REL = 2.96, δ1.05 = 87.17, and δ1.25 = 98.56. Results on DREDS-STD (CatKnown/Novel) surpass previous methods (REL = 5.30, RMSE = 4.96 for CatKnown).

For video normal estimation on ClearPose, DKT-Normal-14B yields mean angular error 26.03°, median 18.59°, and 30.06% within θ, outperforming NormalCrafter (27.08°, 20.29°, 26.10%).

5. Robotic Manipulation and Practical Integration

DKT-1.3B depth outputs are tested within a robotic grasping stack comprising AprilTag, AnyGrasp, and CuRobo modules across three surface types:

Method	Translucent	Reflective	Diffusive	Mean
Raw D435	0.47	0.18	0.56	0.38
DAv2-Large	0.60	0.27	0.56	0.46
DepthCrafter	0.67	0.23	0.63	0.48
DKT-1.3B (Ours)	0.80	0.59	0.81	0.73

Compared to prior best, DKT provides a mean boost of +32 pp (translucent), +32 pp (reflective), and +18 pp (diffusive) in grasping success rates. This suggests substantial improvements in robot manipulation capabilities when depth is reliably estimated for optically challenging objects.

6. Significance, Broader Implications, and Methodological Advances

TransPhy3D enables robust training of perception models by supplying physically accurate synthetic video sequences otherwise unattainable via manual annotation. The LoRA-adapted diffusion approach demonstrates that generative video models internalize nontrivial optical rules, facilitating temporally coherent and label-free perception. A plausible implication is that physics-aware synthetic datasets, coupled with large generative backbones, can be efficiently repurposed for high-fidelity real-world tasks beyond transparent object perception, potentially impacting other vision domains where annotation or ground-truth acquisition is challenging. The explicit co-training with existing frame-wise benchmarks further supports generalization and stability in predictions. These advances collectively support the broader claim: "Diffusion knows transparency" (Xu et al., 29 Dec 2025).

Markdown Upgrade to Chat

References (1)

Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TransPhy3D.

TransPhy3D: Synthetic Dataset for Transparent Scenes

1. Dataset Construction and Structure

2. Video-to-Video Translation Framework ("DKT")

3. Training Protocol and Inference Workflow

4. Evaluation Metrics and Empirical Results

5. Robotic Manipulation and Practical Integration

6. Significance, Broader Implications, and Methodological Advances

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

TransPhy3D: Synthetic Dataset for Transparent Scenes

1. Dataset Construction and Structure

2. Video-to-Video Translation Framework ("DKT")

3. Training Protocol and Inference Workflow

4. Evaluation Metrics and Empirical Results

5. Robotic Manipulation and Practical Integration

6. Significance, Broader Implications, and Methodological Advances

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research