Papers
Topics
Authors
Recent
Search
2000 character limit reached

FlashDepth: 3D Sensing & Real-Time Depth

Updated 10 April 2026
  • FlashDepth is a dual-approach system that integrates a direct time-of-flight LiDAR sensor with in-pixel histogramming and a real-time video depth estimator.
  • It employs adaptive sliding gates, dynamic peak locking, and center-of-mass refinement to achieve high depth precision and temporal stability.
  • The technology delivers 2K real-time depth maps with power-efficient operation, enhancing applications in automotive, robotics, AR/VR, and video post-production.

FlashDepth encompasses two distinct yet related entities within the domain of depth sensing and estimation: (1) a direct time-of-flight (dToF) solid-state LiDAR image sensor architecture that utilizes in-pixel histogramming and dynamic readout for high-throughput, low-latency 3D imaging (Gyongy et al., 2022); and (2) a real-time streaming high-resolution video depth estimation model that augments state-of-the-art single-image depth networks to provide consistent, accurate, and temporally stable depth maps from standard RGB video at 2K resolution (Chou et al., 9 Apr 2025). Both paradigms are designed for applications demanding rapid, data-efficient, and scalable 3D perception—including automotive, robotics, and augmented/mixed reality.

1. Direct Time-of-Flight FlashDepth Sensor: Architectural Principles

FlashDepth is a fully solid-state, flash-based dToF imager implemented as a single CMOS chip, featuring a 2048-pixel SPAD array structured as a 64×32 grid of “macropixels.” Each macropixel integrates a 4×4 SPAD sub-array with dedicated local processing (Gyongy et al., 2022).

Flood pulses (e.g., 850 nm, 10 ns) from a VCSEL or similar laser illuminate the entire scene. Returning photons are temporally quantized by each pixel into an 8-bin histogram using an in-pixel sliding time gate. Unlike full histogramming approaches that offload raw timestamps or 1024-bin histograms, FlashDepth performs histogramming, statistical peak locking, and background estimation directly within each macropixel.

In-pixel logic applies a cycling gate that is dynamically shifted across the unambiguous range and locked around photon return peaks, discarding background and out-of-range events. The local histogram, Hi(k)H_i(k), over PP pulses, captures counts in bin kk as:

Hi(k)=p=1P1[k(ti,pTgate)/Δt=0]H_i(k) = \sum_{p=1}^P 1_{[k - \lfloor (t_{i,p} - T_{\text{gate}})/\Delta t \rfloor = 0 ]}

where TgateT_{\text{gate}} is gate start and Δt\Delta t is bin width. Depth is subsequently extracted by time-of-flight, d=(ctTOF)/2d = (c \cdot t_{\text{TOF}})/2.

SPAD front-end signals are pulse-shortened (300\sim 300 ps), summed, and funneled into a multi-event time-to-digital converter (TDC). Timing sources can be on-chip (GRO, DL) or external, supporting histogram bin widths down to 250 ps.

2. Embedded Histogramming, Peak Locking, and “Dynamic Vision” Readout

A major innovation is the in-pixel partial histogramming with adaptive sliding windows and dynamic peak locking. For each frame:

  1. The minimum bin within the 8-bin window estimates ambient background BB.
  2. The peak bin hmaxh_\text{max} and its value are determined via a comparator tree.
  3. A detection threshold is computed:

PP0

  1. If PP1, the pixel is flagged as detecting a true return; otherwise, the gate continues sliding.
  2. The gate index is incremented or decremented if the peak is near the edge (bins 1-2 or 7-8), or “locked” if central.

Motion detection arises naturally: a “moving peak” flag is set if the gate shifts between frames, yielding a per-pixel motion mask PP2.

Two principal selective readout modes are supported: (a) surface-only, outputting data only for pixels with confirmed peaks; and (b) motion-triggered, transmitting data solely for pixels experiencing gate movement. Both minimize output toggles and reduce downstream bandwidth and power consumption.

Sub-bin depth refinement is achieved via center-of-mass (CMM) computation:

PP3

3. Performance Metrics, Data Efficiency, and Scalability

FlashDepth balances accuracy, throughput, and I/O efficiency through architectural and algorithmic choices:

  • Depth Precision and Accuracy: For PP4, ranges up to 50 m yield non-linearity PP5 cm and standard deviation PP6 cm at 50 fps, even under bright (20 klux) ambient conditions; for PP7, reliable detection extends to 40 m with PP8 cm.
  • Frame Rate and Compression: In histogram mode, 108 bits/macro-pixel at up to 29 kFPS; direct depth (bin-res, 12 bits) up to 260 kFPS; sub-bin (CMM, 15 bits) up to 208 kFPS, achieving 7–9× data reduction compared to raw histograms. Smart readout can reduce I/O pad switching to below 1 mW.
  • SPAD Non-Linearity and False Positives: Differential non-linearity (DNL) of 14% (1 ns DL timing), or 1.6% (8 ns external clock). False positive rate (FPR) remains at 0.2% (for PP9, kk0 counts/bin).
  • Power: 70 mW total at 50 FPS/30 klux (SPADs 32 mW; digital 36 mW; I/O 1.3 mW). Smart modes further lower I/O power.

The sensor is implemented in standard 40 nm FSI CMOS. With row skipping, per-pixel gate control, and selective readout, the design scales favorably to larger arrays, especially via 3D stacking (macropixel logic area matches SPAD area, which enables backside-illuminated architectures).

4. FlashDepth for Real-time High-Resolution Video Depth Estimation

A parallel innovation in the FlashDepth nomenclature is a model for streaming video depth estimation at 2K resolution and real-time rates (Chou et al., 9 Apr 2025). This approach leverages modifications to the Depth Anything v2 (DAv2) single-image depth network, embedding lightweight spatial and temporal modules:

  • Base Model Architecture: Utilizes DINO v2 ViT (encoder) and DPT decoder, operating on 2K (2044×1148) video frames.
  • Temporal Scale Alignment: Integrates a recurrent state-space (Mamba) module before the final convolution head, enforcing online temporal scale and shift consistency, thus mitigating depth flicker while introducing minimal computational overhead.
  • Hybrid Dual-Resolution Stream: Simultaneously processes full-resolution (FlashDepth-S) and downsampled (FlashDepth-L, short side 518 px) streams, using cross-attention blocks to fuse intermediate decoder features from the more accurate large stream into the fast small stream.
  • Training Strategy: Employs a two-stage protocol:
    • Stage 1: Consistency pre-training (low-res synthetic) with L1 depth supervision, training Mamba only.
    • Stage 2: Hybrid fine-tuning (2K images), initializing cross-attention to zero and freezing large-stream weights, teaching the small stream to exploit large-stream features.

Inference is performed in an online, per-frame autoregressive loop without any batch processing or sliding-window stitching.

5. Comparative Evaluation and Quantitative Benchmarks

The video-based FlashDepth is evaluated across five challenging datasets—ETH3D, Sintel, Waymo, Unreal4K, UrbanSyn—using AbsRel (mean absolute relative error), kk1 (percent pixels within a 25% relative error), and boundary F1 scores.

Summary Table: Boundary Sharpness and Throughput

Method Unreal4K F1 UrbanSyn F1 FPS Resolution
DAv2 0.058 0.118 30 924×518
DepthCrafter 0.021 0.044 2.1 1024×576
VidDepthAny 0.049 0.097 24 924×518
CUT3R 0.007 0.019 14 512×288
FlashDepth-L 0.048 0.136 30 924×518
FlashDepth-L High-Res 0.143 0.271 6.0 2044×1148
FlashDepth (Full) 0.109 0.185 24 2044×1148

FlashDepth achieves 24 FPS streaming at 2K, with substantially higher boundary F1 (edge recovery) than any streaming or online baseline. In kk2 accuracy, it is within 1–2% of the strongest batch (non-streaming) approach, despite operating at higher resolution and lower latency.

6. Application Scenarios, Limitations, and Prospects

FlashDepth, both as a dToF sensor and as a real-time depth estimator, targets scenarios requiring high-frame-rate, precise 3D imaging and minimal latency:

  • Autonomous Vehicles and Advanced Driver Assistance: Solid-state, compact dToF LiDAR for mid-range mapping, with robust performance under high ambient light.
  • Mobile Robotics and Drones: On-device, low-latency depth for navigation, obstacle avoidance, and manipulation.
  • VR/AR/MR: Perceptually accurate environment mapping and hand/body tracking, with sustained throughput for interactive systems.
  • Video Post-Production: Accurate, real-time 2K depth maps for object compositing, relighting, and synthetic view generation.

A key limitation for video-based FlashDepth is residual flicker in challenging scenes with extreme motion, due to the Mamba module’s scale/shift-only alignment. Addressing pixel-perfect temporal consistency—possibly via lightweight per-frame optimization or bundle adjustment—is identified as a major research direction. In the sensor context, further scaling and 3D stacking are practical avenues for expanding spatial resolution and efficiency (Gyongy et al., 2022, Chou et al., 9 Apr 2025).

FlashDepth demonstrates the efficacy of architectural and algorithmic innovations—such as in-pixel histogramming, dynamic sliding gates, and hybrid recurrent pipelines—for achieving robust, real-time, and scalable 3D perception across hardware and software modalities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FlashDepth.