Video-Rate Ptychography: Real-Time Phase Imaging

Updated 15 November 2025

Video-rate ptychography is a dynamic imaging method that fuses optical multiplexing with neural field techniques for real-time, high-resolution phase retrieval.
The approach leverages a low-rank, coordinate-based representation to efficiently reconstruct gigapixel-scale diffraction patterns with minimal measurement redundancy.
Practical implementations have demonstrated 308 nm resolution at 30 fps, enabling direct visualization of rapid mesoscale biological and material processes.

Video-rate ptychography refers to coherent diffraction imaging methods whose data acquisition, algorithmic reconstruction, and feedback throughput jointly achieve or exceed standard video frame rates (≥30 Hz), thus enabling direct visualization and quantification of dynamic mesoscale processes with nanometric or sub-micron resolution over extended fields of view. Unlike classical, sequential ptychography, which traditionally faces prohibitive trade-offs between spatial bandwidth, measurement redundancy, and computational latency, recent advances exploit neural representations, algorithmic parallelism, optical multiplexing, and hardware-accelerated inverse solvers to achieve real-time, gigapixel-scale phase retrieval. This entry reviews the mathematical underpinnings, algorithmic frameworks, physical implementations, performance metrics, empirical validations, and technological implications as delineated across the contemporary literature.

1. Mathematical Foundations and Inverse Problem Formulation

The space-time ptychographic inverse problem concerns the recovery of a dynamic, complex optical wavefield $\Psi(x,y,t)$ from temporal sequences of intensity-only measurements acquired in transmission or reflection, typically under constraints of lensless, coherent illumination and space-bandwidth product (SBP) in excess of $10^{9}$ pixels $\cdot$ frames/s.

Recent video-rate paradigms recast $\Psi(x,y,t)\in\mathbb{C}$ as a low-rank factorization over space and time:

$\Psi(x,y,t) \approx \sum_{i=1}^r S_i(x,y) T_i(t)$

where $\{S_i(x,y)\}$ are spatial modes and $\{T_i(t)\}$ corresponding temporal coefficients. The rationale is that a dynamic object, subject to physical continuity and slow temporal evolution (relative to sensor rate), occupies a restricted submanifold in the high-dimensional spatiotemporal product space, hence is compressible in $r$ -rank form, with $r\ll \min\{N_xN_y,N_t\}$ .

In the neural space–time field framework, these modes are not discretized arrays but are instead embedded in continuous coordinate-based representations: spatial coordinates $(x,y)\in[0,1]^2$ are encoded via multi-resolution hash embeddings $\mathcal{H}(x,y)\in\mathbb{R}^{FL}$ , and $t\in[0,1]$ temporal coordinates are interpolated from a learned basis $\{\tau_j\}$ . The joint features are fused by Hadamard product:

$f(x,y,t) = \mathcal{H}(x,y) \odot \tau(t)$

This approach enables a continuous, memory-efficient parameterization that scales favorably for gigapixel-scale reconstructions (Wang et al., 8 Nov 2025).

Physical measurement models are inserted via generalized angular-spectral diffraction propagation:

$I_t(x,y) = \left| \mathcal{P}_{d_2}\left\{ \mathcal{C}(x-x_t,y-y_t) \ \mathcal{P}_{d_1}\{\Psi(\cdot,t)\} \right\} \right|^2$

where the coded surface $\mathcal{C}(x,y)$ (implemented via a lithographically defined or randomly deposited phase/amplitude mask) and the shift $(x_t,y_t)$ parameterize the illumination/sensor apparatus.

2. Algorithmic Architectures and Loss Functionals

Neural-field-based video-rate ptychography employs a pair of small, independent multi-layer perceptrons (MLPs), one each for the real and imaginary wavefield components:

$\begin{aligned} \Re[\hat\Psi(x,y,t)] &= \mathrm{MLP}_\mathrm{real}(f(x,y,t)) \ \Im[\hat\Psi(x,y,t)] &= \mathrm{MLP}_\mathrm{imag}(f(x,y,t)) \end{aligned}$

This sidesteps the non-differentiable phase-wrapping that afflicts amplitude–phase representations, ensuring convergence for high-phase-gradient samples.

To ensure fidelity in reconstructing fine spatial features and mitigate the limitations of standard amplitude losses—especially in the presence of measurement noise and complex-valued random-phase data—a gradient-domain loss is defined in terms of the square-root intensity spatial derivatives:

$L_\mathrm{grad} = \sum_t \left\|\nabla_{x,y}\sqrt{I_t(x,y)} - \nabla_{x,y}\sqrt{J_t(x,y)} \right\|_2^2$

$J_t$ denotes experimentally observed frames. This objective explicitly matches edge features and is robust against high spatial-frequency errors, yielding more stable and rapid convergence in phase-rich image regions.

All network parameters (spatial hash tables, temporal embeddings, MLP weights) are optimized end-to-end via Adam (learning rate $10^{-3}$ ), backpropagating the joint loss across all $N_t$ frames to exploit the spatiotemporal correlations (Wang et al., 8 Nov 2025). The batch optimization over time transforms what would be a massively underdetermined per-frame phase retrieval into a globally over-constrained identification problem.

3. Physical Implementations and Empirical Performance

The archetypal hardware instantiation is a single-camera lensless imaging system with a custom-coded sensor. In (Wang et al., 8 Nov 2025), an OnSemiconductor AR2020 CMOS chip (5 MP, $1.4\,\mu\mathrm{m}$ pitch, 30 fps) is used with a 405 nm diode laser for centimeter-scale, coherent illumination. A voice-coil actuator scans the coded surface beneath the sample by sub-pixel increments, monitored via fiducial diffraction signatures.

Experimental results demonstrate:

308 nm linewidth resolution at 30 fps over $1.5\,\mathrm{cm}^2$ —a true gigapixel SBP at video throughput.
Dynamic sample monitoring, including snowflake melting (sub-micron resolvability on $40\,\mathrm{mm}^2$ at 30 fps), live label-free phase visualization of stem-cell migration and wound healing, and in-situ bacterial-mass quantification via phase–mass conversion $m(x,y) \propto \phi(x,y)$ .
High-throughput 3D microneedle dissolution kinetics mapping (hundreds of $700\,\mu$ m needles).
Extreme ultraviolet (EUV) ptychography at 29 nm wavelength, with joint static-object, time-varying-probe recovery achieved in only 40 iterations and measurement overlap as low as 1.2% (compared to $\sim$ 1000+ for previous orthogonal-probe approaches).

SBP scaling is sub-linear with respect to acquisition time: each new frame incrementally refines the spatiotemporal latent bases, unlike sequential tiling or scanning strategies in which $T_\mathrm{acq} \propto$ SBP.

4. Comparison with Sequential and Deep-Learning Accelerated Approaches

Traditional iterative model-based solvers for ptychography (e.g., ePIE, Difference Map, Alternating Projections) require extensive measurement redundancy (typically $\gtrsim$ 50% spatial overlap) and hundreds to thousands of iterations per probe position, severely limiting frame rates and imposing high computational latency for large SBP (Cherukara et al., 2020, Zhou et al., 2022, Welker et al., 2022). Parallelization, as in (Zhou et al., 2022), provides up to $38\times$ acceleration but rarely achieves beyond $\sim$ 1 fps without substantial hardware investment.

Deep convolutional architectures such as PtychoNN (Cherukara et al., 2020) learn a direct mapping from a single (or batched) diffraction pattern to object amplitude/phase, enabling near-millisecond per-frame reconstructions for modest fields of view and tolerating aggressive sparse sampling (step-size up to 5 $\times$ the beam width). These methods, while extremely fast once trained, are typically input–output (per-frame) predictors and lack explicit modeling of space-time evolution or compressed collective correlations across video sequences.

The neural-field, low-rank approach addresses SBP–frame rate scaling with holistic exploitation of temporal redundancy, transforming temporal variation from a constraint into a statistical prior.

5. Extensions, Generalizations, and Broader Impact

The described framework generalizes across:

Wavelength regimes: demonstrated at visible and EUV, and applicable for X-ray and electron ptychography, accommodating beam source fluctuations, turbulence, and radiation damage by embedding temporal probe evolution within the neural field.
Three-dimensionality: multi-angle or multi-wavelength variants enable volumetric refractive-index reconstructions.
Throughput-critical applications: pharmaceutical screening (e.g., microneedle characterization), antibiotic resistance phenotyping (bacterial dry-mass), and dynamic live-cell assays benefit from single-camera, label-free, high-speed volumetric imaging.

The elimination of the linear trade-off between SBP and acquisition time renders gigapixel video tractable with a single sensor, without the need for multiplexed camera arrays or specialized optics.

In the broader context, the methodology of recasting high-dimensional, dynamic inversion problems in low-rank, coordinate-based neural field form—with physics-informed forward operators and customized loss functionals—offers a general recipe for computational imaging across dynamic, bandwidth-constrained settings.

6. Technical Considerations and Practical Limitations

Resource requirements scale with latent basis rank $r$ , spatial encoding depth, and temporal embedding resolution. For the full gigapixel video shown in (Wang et al., 8 Nov 2025), significant GPU memory (full-precision) and high-throughput inference pipelines were required to maintain the 30 Hz frame rate.

Potential limitations include:

The need to tune $r$ and embedding dimensions for each spatiotemporal complexity class.
Sensitivity to abrupt, non-smooth, or highly non-stationary temporal events not amenable to low-rank modeling.
The necessity for motion-encoded or otherwise redundant measuring schemes in the event of negligibly evolving samples.

Nonetheless, the framework proves robust across diverse biological, material, and optical platforms and supports extension to full 3D dynamic tomography and extreme spectral regimes.

7. Summary Table: Key Metrics for Video-Rate Ptychography Implementations

System & Algorithm	SBP (Gigapixels)	Frame Rate (Hz)	Field Size	Spatial Resolution	Reconstruction Latency
Neural-field (dual-MLP) (Wang et al., 8 Nov 2025)	>1	30+	1.5 cm $^2$	308 nm	Real-time (per frame)
Deep Conv. Net (PtychoNN) (Cherukara et al., 2020)	0.1–0.3	>1000*	10–100 $\mu$ m $^2$	60–600 nm	1 ms/frame
Parallel FPM (Zhou et al., 2022)	1	1*	1 mm $^2$	400 nm	5–8 s/frame
Prox. TV (PPTV) (Liu et al., 2023)	0.1–0.5	30–100*	0.25–1 mm $^2$	350–700 nm	10–30 ms/frame*

*Frame rates marked with * refer to throughput assuming ideal hardware scaling or batched inference, as detailed in the respective references.

In summary, video-rate ptychography, as defined by neural space–time field approaches, total-variation-regularized coded geometries, deep convolutional modulators, and optimized physical sensor designs, constitutes a fully practical, extensible strategy for high-throughput, lensless, dynamic complex-field imaging at gigapixel spatial–temporal bandwidths.