FlashDepth: 3D Sensing & Real-Time Depth
- FlashDepth is a dual-approach system that integrates a direct time-of-flight LiDAR sensor with in-pixel histogramming and a real-time video depth estimator.
- It employs adaptive sliding gates, dynamic peak locking, and center-of-mass refinement to achieve high depth precision and temporal stability.
- The technology delivers 2K real-time depth maps with power-efficient operation, enhancing applications in automotive, robotics, AR/VR, and video post-production.
FlashDepth encompasses two distinct yet related entities within the domain of depth sensing and estimation: (1) a direct time-of-flight (dToF) solid-state LiDAR image sensor architecture that utilizes in-pixel histogramming and dynamic readout for high-throughput, low-latency 3D imaging (Gyongy et al., 2022); and (2) a real-time streaming high-resolution video depth estimation model that augments state-of-the-art single-image depth networks to provide consistent, accurate, and temporally stable depth maps from standard RGB video at 2K resolution (Chou et al., 9 Apr 2025). Both paradigms are designed for applications demanding rapid, data-efficient, and scalable 3D perception—including automotive, robotics, and augmented/mixed reality.
1. Direct Time-of-Flight FlashDepth Sensor: Architectural Principles
FlashDepth is a fully solid-state, flash-based dToF imager implemented as a single CMOS chip, featuring a 2048-pixel SPAD array structured as a 64×32 grid of “macropixels.” Each macropixel integrates a 4×4 SPAD sub-array with dedicated local processing (Gyongy et al., 2022).
Flood pulses (e.g., 850 nm, 10 ns) from a VCSEL or similar laser illuminate the entire scene. Returning photons are temporally quantized by each pixel into an 8-bin histogram using an in-pixel sliding time gate. Unlike full histogramming approaches that offload raw timestamps or 1024-bin histograms, FlashDepth performs histogramming, statistical peak locking, and background estimation directly within each macropixel.
In-pixel logic applies a cycling gate that is dynamically shifted across the unambiguous range and locked around photon return peaks, discarding background and out-of-range events. The local histogram, , over pulses, captures counts in bin as:
where is gate start and is bin width. Depth is subsequently extracted by time-of-flight, .
SPAD front-end signals are pulse-shortened ( ps), summed, and funneled into a multi-event time-to-digital converter (TDC). Timing sources can be on-chip (GRO, DL) or external, supporting histogram bin widths down to 250 ps.
2. Embedded Histogramming, Peak Locking, and “Dynamic Vision” Readout
A major innovation is the in-pixel partial histogramming with adaptive sliding windows and dynamic peak locking. For each frame:
- The minimum bin within the 8-bin window estimates ambient background .
- The peak bin and its value are determined via a comparator tree.
- A detection threshold is computed:
0
- If 1, the pixel is flagged as detecting a true return; otherwise, the gate continues sliding.
- The gate index is incremented or decremented if the peak is near the edge (bins 1-2 or 7-8), or “locked” if central.
Motion detection arises naturally: a “moving peak” flag is set if the gate shifts between frames, yielding a per-pixel motion mask 2.
Two principal selective readout modes are supported: (a) surface-only, outputting data only for pixels with confirmed peaks; and (b) motion-triggered, transmitting data solely for pixels experiencing gate movement. Both minimize output toggles and reduce downstream bandwidth and power consumption.
Sub-bin depth refinement is achieved via center-of-mass (CMM) computation:
3
3. Performance Metrics, Data Efficiency, and Scalability
FlashDepth balances accuracy, throughput, and I/O efficiency through architectural and algorithmic choices:
- Depth Precision and Accuracy: For 4, ranges up to 50 m yield non-linearity 5 cm and standard deviation 6 cm at 50 fps, even under bright (20 klux) ambient conditions; for 7, reliable detection extends to 40 m with 8 cm.
- Frame Rate and Compression: In histogram mode, 108 bits/macro-pixel at up to 29 kFPS; direct depth (bin-res, 12 bits) up to 260 kFPS; sub-bin (CMM, 15 bits) up to 208 kFPS, achieving 7–9× data reduction compared to raw histograms. Smart readout can reduce I/O pad switching to below 1 mW.
- SPAD Non-Linearity and False Positives: Differential non-linearity (DNL) of 14% (1 ns DL timing), or 1.6% (8 ns external clock). False positive rate (FPR) remains at 0.2% (for 9, 0 counts/bin).
- Power: 70 mW total at 50 FPS/30 klux (SPADs 32 mW; digital 36 mW; I/O 1.3 mW). Smart modes further lower I/O power.
The sensor is implemented in standard 40 nm FSI CMOS. With row skipping, per-pixel gate control, and selective readout, the design scales favorably to larger arrays, especially via 3D stacking (macropixel logic area matches SPAD area, which enables backside-illuminated architectures).
4. FlashDepth for Real-time High-Resolution Video Depth Estimation
A parallel innovation in the FlashDepth nomenclature is a model for streaming video depth estimation at 2K resolution and real-time rates (Chou et al., 9 Apr 2025). This approach leverages modifications to the Depth Anything v2 (DAv2) single-image depth network, embedding lightweight spatial and temporal modules:
- Base Model Architecture: Utilizes DINO v2 ViT (encoder) and DPT decoder, operating on 2K (2044×1148) video frames.
- Temporal Scale Alignment: Integrates a recurrent state-space (Mamba) module before the final convolution head, enforcing online temporal scale and shift consistency, thus mitigating depth flicker while introducing minimal computational overhead.
- Hybrid Dual-Resolution Stream: Simultaneously processes full-resolution (FlashDepth-S) and downsampled (FlashDepth-L, short side 518 px) streams, using cross-attention blocks to fuse intermediate decoder features from the more accurate large stream into the fast small stream.
- Training Strategy: Employs a two-stage protocol:
- Stage 1: Consistency pre-training (low-res synthetic) with L1 depth supervision, training Mamba only.
- Stage 2: Hybrid fine-tuning (2K images), initializing cross-attention to zero and freezing large-stream weights, teaching the small stream to exploit large-stream features.
Inference is performed in an online, per-frame autoregressive loop without any batch processing or sliding-window stitching.
5. Comparative Evaluation and Quantitative Benchmarks
The video-based FlashDepth is evaluated across five challenging datasets—ETH3D, Sintel, Waymo, Unreal4K, UrbanSyn—using AbsRel (mean absolute relative error), 1 (percent pixels within a 25% relative error), and boundary F1 scores.
Summary Table: Boundary Sharpness and Throughput
| Method | Unreal4K F1 | UrbanSyn F1 | FPS | Resolution |
|---|---|---|---|---|
| DAv2 | 0.058 | 0.118 | 30 | 924×518 |
| DepthCrafter | 0.021 | 0.044 | 2.1 | 1024×576 |
| VidDepthAny | 0.049 | 0.097 | 24 | 924×518 |
| CUT3R | 0.007 | 0.019 | 14 | 512×288 |
| FlashDepth-L | 0.048 | 0.136 | 30 | 924×518 |
| FlashDepth-L High-Res | 0.143 | 0.271 | 6.0 | 2044×1148 |
| FlashDepth (Full) | 0.109 | 0.185 | 24 | 2044×1148 |
FlashDepth achieves 24 FPS streaming at 2K, with substantially higher boundary F1 (edge recovery) than any streaming or online baseline. In 2 accuracy, it is within 1–2% of the strongest batch (non-streaming) approach, despite operating at higher resolution and lower latency.
6. Application Scenarios, Limitations, and Prospects
FlashDepth, both as a dToF sensor and as a real-time depth estimator, targets scenarios requiring high-frame-rate, precise 3D imaging and minimal latency:
- Autonomous Vehicles and Advanced Driver Assistance: Solid-state, compact dToF LiDAR for mid-range mapping, with robust performance under high ambient light.
- Mobile Robotics and Drones: On-device, low-latency depth for navigation, obstacle avoidance, and manipulation.
- VR/AR/MR: Perceptually accurate environment mapping and hand/body tracking, with sustained throughput for interactive systems.
- Video Post-Production: Accurate, real-time 2K depth maps for object compositing, relighting, and synthetic view generation.
A key limitation for video-based FlashDepth is residual flicker in challenging scenes with extreme motion, due to the Mamba module’s scale/shift-only alignment. Addressing pixel-perfect temporal consistency—possibly via lightweight per-frame optimization or bundle adjustment—is identified as a major research direction. In the sensor context, further scaling and 3D stacking are practical avenues for expanding spatial resolution and efficiency (Gyongy et al., 2022, Chou et al., 9 Apr 2025).
FlashDepth demonstrates the efficacy of architectural and algorithmic innovations—such as in-pixel histogramming, dynamic sliding gates, and hybrid recurrent pipelines—for achieving robust, real-time, and scalable 3D perception across hardware and software modalities.