FlashDepth: 3D Sensing & Real-Time Depth

Updated 10 April 2026

FlashDepth is a dual-approach system that integrates a direct time-of-flight LiDAR sensor with in-pixel histogramming and a real-time video depth estimator.
It employs adaptive sliding gates, dynamic peak locking, and center-of-mass refinement to achieve high depth precision and temporal stability.
The technology delivers 2K real-time depth maps with power-efficient operation, enhancing applications in automotive, robotics, AR/VR, and video post-production.

FlashDepth encompasses two distinct yet related entities within the domain of depth sensing and estimation: (1) a direct time-of-flight (dToF) solid-state LiDAR image sensor architecture that utilizes in-pixel histogramming and dynamic readout for high-throughput, low-latency 3D imaging (Gyongy et al., 2022); and (2) a real-time streaming high-resolution video depth estimation model that augments state-of-the-art single-image depth networks to provide consistent, accurate, and temporally stable depth maps from standard RGB video at 2K resolution (Chou et al., 9 Apr 2025). Both paradigms are designed for applications demanding rapid, data-efficient, and scalable 3D perception—including automotive, robotics, and augmented/mixed reality.

1. Direct Time-of-Flight FlashDepth Sensor: Architectural Principles

FlashDepth is a fully solid-state, flash-based dToF imager implemented as a single CMOS chip, featuring a 2048-pixel SPAD array structured as a 64×32 grid of “macropixels.” Each macropixel integrates a 4×4 SPAD sub-array with dedicated local processing (Gyongy et al., 2022).

Flood pulses (e.g., 850 nm, 10 ns) from a VCSEL or similar laser illuminate the entire scene. Returning photons are temporally quantized by each pixel into an 8-bin histogram using an in-pixel sliding time gate. Unlike full histogramming approaches that offload raw timestamps or 1024-bin histograms, FlashDepth performs histogramming, statistical peak locking, and background estimation directly within each macropixel.

In-pixel logic applies a cycling gate that is dynamically shifted across the unambiguous range and locked around photon return peaks, discarding background and out-of-range events. The local histogram, $H_i(k)$ , over $P$ pulses, captures counts in bin $k$ as:

$H_i(k) = \sum_{p=1}^P 1_{[k - \lfloor (t_{i,p} - T_{\text{gate}})/\Delta t \rfloor = 0 ]}$

where $T_{\text{gate}}$ is gate start and $\Delta t$ is bin width. Depth is subsequently extracted by time-of-flight, $d = (c \cdot t_{\text{TOF}})/2$ .

SPAD front-end signals are pulse-shortened ( $\sim 300$ ps), summed, and funneled into a multi-event time-to-digital converter (TDC). Timing sources can be on-chip (GRO, DL) or external, supporting histogram bin widths down to 250 ps.

2. Embedded Histogramming, Peak Locking, and “Dynamic Vision” Readout

A major innovation is the in-pixel partial histogramming with adaptive sliding windows and dynamic peak locking. For each frame:

The minimum bin within the 8-bin window estimates ambient background $B$ .
The peak bin $h_\text{max}$ and its value are determined via a comparator tree.
A detection threshold is computed:

$P$ 0

If $P$ 1, the pixel is flagged as detecting a true return; otherwise, the gate continues sliding.
The gate index is incremented or decremented if the peak is near the edge (bins 1-2 or 7-8), or “locked” if central.

Motion detection arises naturally: a “moving peak” flag is set if the gate shifts between frames, yielding a per-pixel motion mask $P$ 2.

Two principal selective readout modes are supported: (a) surface-only, outputting data only for pixels with confirmed peaks; and (b) motion-triggered, transmitting data solely for pixels experiencing gate movement. Both minimize output toggles and reduce downstream bandwidth and power consumption.

Sub-bin depth refinement is achieved via center-of-mass (CMM) computation:

$P$ 3

3. Performance Metrics, Data Efficiency, and Scalability

FlashDepth balances accuracy, throughput, and I/O efficiency through architectural and algorithmic choices:

Depth Precision and Accuracy: For $P$ 4, ranges up to 50 m yield non-linearity $P$ 5 cm and standard deviation $P$ 6 cm at 50 fps, even under bright (20 klux) ambient conditions; for $P$ 7, reliable detection extends to 40 m with $P$ 8 cm.
Frame Rate and Compression: In histogram mode, 108 bits/macro-pixel at up to 29 kFPS; direct depth (bin-res, 12 bits) up to 260 kFPS; sub-bin (CMM, 15 bits) up to 208 kFPS, achieving 7–9× data reduction compared to raw histograms. Smart readout can reduce I/O pad switching to below 1 mW.
SPAD Non-Linearity and False Positives: Differential non-linearity (DNL) of 14% (1 ns DL timing), or 1.6% (8 ns external clock). False positive rate (FPR) remains at 0.2% (for $P$ 9, $k$ 0 counts/bin).
Power: 70 mW total at 50 FPS/30 klux (SPADs 32 mW; digital 36 mW; I/O 1.3 mW). Smart modes further lower I/O power.

The sensor is implemented in standard 40 nm FSI CMOS. With row skipping, per-pixel gate control, and selective readout, the design scales favorably to larger arrays, especially via 3D stacking (macropixel logic area matches SPAD area, which enables backside-illuminated architectures).

4. FlashDepth for Real-time High-Resolution Video Depth Estimation

A parallel innovation in the FlashDepth nomenclature is a model for streaming video depth estimation at 2K resolution and real-time rates (Chou et al., 9 Apr 2025). This approach leverages modifications to the Depth Anything v2 (DAv2) single-image depth network, embedding lightweight spatial and temporal modules:

Base Model Architecture: Utilizes DINO v2 ViT (encoder) and DPT decoder, operating on 2K (2044×1148) video frames.
Temporal Scale Alignment: Integrates a recurrent state-space (Mamba) module before the final convolution head, enforcing online temporal scale and shift consistency, thus mitigating depth flicker while introducing minimal computational overhead.
Hybrid Dual-Resolution Stream: Simultaneously processes full-resolution (FlashDepth-S) and downsampled (FlashDepth-L, short side 518 px) streams, using cross-attention blocks to fuse intermediate decoder features from the more accurate large stream into the fast small stream.
Training Strategy: Employs a two-stage protocol:
- Stage 1: Consistency pre-training (low-res synthetic) with L1 depth supervision, training Mamba only.
- Stage 2: Hybrid fine-tuning (2K images), initializing cross-attention to zero and freezing large-stream weights, teaching the small stream to exploit large-stream features.

Inference is performed in an online, per-frame autoregressive loop without any batch processing or sliding-window stitching.

5. Comparative Evaluation and Quantitative Benchmarks

The video-based FlashDepth is evaluated across five challenging datasets—ETH3D, Sintel, Waymo, Unreal4K, UrbanSyn—using AbsRel (mean absolute relative error), $k$ 1 (percent pixels within a 25% relative error), and boundary F1 scores.

Summary Table: Boundary Sharpness and Throughput

Method	Unreal4K F1	UrbanSyn F1	FPS	Resolution
DAv2	0.058	0.118	30	924×518
DepthCrafter	0.021	0.044	2.1	1024×576
VidDepthAny	0.049	0.097	24	924×518
CUT3R	0.007	0.019	14	512×288
FlashDepth-L	0.048	0.136	30	924×518
FlashDepth-L High-Res	0.143	0.271	6.0	2044×1148
FlashDepth (Full)	0.109	0.185	24	2044×1148

FlashDepth achieves 24 FPS streaming at 2K, with substantially higher boundary F1 (edge recovery) than any streaming or online baseline. In $k$ 2 accuracy, it is within 1–2% of the strongest batch (non-streaming) approach, despite operating at higher resolution and lower latency.

6. Application Scenarios, Limitations, and Prospects

FlashDepth, both as a dToF sensor and as a real-time depth estimator, targets scenarios requiring high-frame-rate, precise 3D imaging and minimal latency:

Autonomous Vehicles and Advanced Driver Assistance: Solid-state, compact dToF LiDAR for mid-range mapping, with robust performance under high ambient light.
Mobile Robotics and Drones: On-device, low-latency depth for navigation, obstacle avoidance, and manipulation.
VR/AR/MR: Perceptually accurate environment mapping and hand/body tracking, with sustained throughput for interactive systems.
Video Post-Production: Accurate, real-time 2K depth maps for object compositing, relighting, and synthetic view generation.

A key limitation for video-based FlashDepth is residual flicker in challenging scenes with extreme motion, due to the Mamba module’s scale/shift-only alignment. Addressing pixel-perfect temporal consistency—possibly via lightweight per-frame optimization or bundle adjustment—is identified as a major research direction. In the sensor context, further scaling and 3D stacking are practical avenues for expanding spatial resolution and efficiency (Gyongy et al., 2022, Chou et al., 9 Apr 2025).

FlashDepth demonstrates the efficacy of architectural and algorithmic innovations—such as in-pixel histogramming, dynamic sliding gates, and hybrid recurrent pipelines—for achieving robust, real-time, and scalable 3D perception across hardware and software modalities.

Markdown Report Issue Upgrade to Chat

References (2)

A direct time-of-flight image sensor with in-pixel surface detection and dynamic vision (2022)

FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FlashDepth.

FlashDepth: 3D Sensing & Real-Time Depth

1. Direct Time-of-Flight FlashDepth Sensor: Architectural Principles

2. Embedded Histogramming, Peak Locking, and “Dynamic Vision” Readout

3. Performance Metrics, Data Efficiency, and Scalability

4. FlashDepth for Real-time High-Resolution Video Depth Estimation

5. Comparative Evaluation and Quantitative Benchmarks

Summary Table: Boundary Sharpness and Throughput

6. Application Scenarios, Limitations, and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

FlashDepth: 3D Sensing & Real-Time Depth

1. Direct Time-of-Flight FlashDepth Sensor: Architectural Principles

2. Embedded Histogramming, Peak Locking, and “Dynamic Vision” Readout

3. Performance Metrics, Data Efficiency, and Scalability

4. FlashDepth for Real-time High-Resolution Video Depth Estimation

5. Comparative Evaluation and Quantitative Benchmarks

Summary Table: Boundary Sharpness and Throughput

6. Application Scenarios, Limitations, and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research