FlowLens: Extending FoV for Autonomous Systems

Updated 31 December 2025

FlowLens is a system architecture that reconstructs unseen regions beyond the physical field-of-view using sequential video inpainting techniques.
It integrates explicit optical flow-based feature propagation with clip-recurrent transformers to efficiently align features and fuse spatiotemporal context.
FlowLens improves high-level perception tasks, achieving state-of-the-art metrics in semantic segmentation and object detection for autonomous applications.

The FlowLens system is an architecture developed to extend scene visibility for autonomous vehicles and robotics by reconstructing regions beyond the physical field-of-view (FoV) of vision sensors through online video inpainting. By leveraging sequential video streams, FlowLens recovers critical contextual information in real-time, enhancing both low-level frame quality and high-level perception tasks such as semantic segmentation and object detection. FlowLens integrates explicit optical flow-based feature propagation with implicit clip-recurrent transformer mechanisms, enabling both efficient alignment and robust spatiotemporal context fusion. Its design addresses hardware and cost constraints in sensor deployment, providing practical solutions for safety-critical autonomous systems (Shi et al., 2022).

1. System Motivation and Architectural Overview

Modern autonomous platforms—vehicles, robots, and infrastructure—deploy cameras with restricted FoV due to physical and cost limitations, resulting in incomplete observation of dynamic environments. This constraint is particularly detrimental for safety-critical applications: traffic participants or regulatory signals may fall outside the sensor's boundary, impairing downstream perception modules.

FlowLens extends the effective visual coverage by “inpainting” unseen areas using temporal information available from past and current video frames. The architecture operates in an online, real-time setting and comprises four sequential stages:

Stage	Function	Inputs/Outputs
Convolutional Stem	Extracts shallow features for Local Frames (LF) and Past References	$f_i, f_j \in \mathbb{R}^{H/4 \times W/4 \times C}$
Explicit Flow-Guided Propagation	Aligns and warps LF features using completed optical flow	$\hat{V}_{i\to j}, \tilde f_i$
Clip-Recurrent Transformer	Fuses spatiotemporal features via Hub, DDCA, MixF3N	$\hat{\mathbf{Z}'_{i+1}}$
Output Convolutions	Reconstructs full FoV frames	$\hat{Y}^t$

Motivated by the realization that historical video frames encode spatial cues about regions beyond the instantaneous FoV, FlowLens enhances not only visibility outside the FoV but also semantic context within the observed frame—contributing to improved segmentation and detection performance (Shi et al., 2022).

2. Clip-Recurrent Hub and 3D-Decoupled Cross Attention

Central to the implicit feature propagation of FlowLens, the Clip-Recurrent Hub caches projected keys/values from previous clips and enables the current clip’s queries to attend across temporal boundaries:

$\begin{aligned} \bar{\mathbf{K}^t_i},\bar{\mathbf{V}^t_i} &= \mathrm{SG}(\mathbf{K}^t_i, \mathbf{V}^t_i)\ \bar{\mathbf{Z}'_{i+1}} &= \mathrm{DDCA}(\mathbf{Q}_{i+1}, \mathcal{P}_{kv}(\bar{\mathbf{K}_i},\bar{\mathbf{V}_i}))\ \hat{\mathbf{Z}'_{i+1}} &= \mathbf{Z}'_{i+1} + \mathcal{P}_{fuse}[\bar{\mathbf{Z}'_{i+1}}\Vert \mathbf{Z}'_{i+1}] \end{aligned}$

The 3D-Decoupled Cross Attention (DDCA) mechanism partitions the attention operation into three orthogonal domains—temporal, horizontal, and vertical—via:

Temporal attention: $\mathrm{Attn}_t(Q_t, K_t, V_t)$ attends globally over time, capturing cross-frame context.
Horizontal strip attention: Keys partitioned into non-overlapping horizontal strips $K_h^l$ , undergo pooling $K_h^g$ and stripwise attention; vertical is analogous.
Projection and summation: Outputs are concatenated and projected: $Z = \mathcal{P}_t(Z_t) + \mathcal{P}_{h,w}[Z_h \Vert Z_v]$ .

DDCA improves spatiotemporal coherence, facilitating robust retrieval and fusion of cross-clip information, thus addressing misalignment and providing reliable cues for scene completion (Shi et al., 2022).

3. Mix Fusion Feed Forward Network (MixF3N)

Following DDCA, the Mix Fusion Feed Forward Network (MixF3N) refines local feature propagation by enforcing multi-scale context flow. Token blocks $A \in \mathbb{R}^{N \times d}$ are split channel-wise and processed via depthwise convolutions ( $3 \times 3$ and $5 \times 5$ ), supporting fine-grained detail restoration—essential near FoV boundaries:

$Z = \mathrm{MLP}(\mathrm{GELU}[\mathrm{SS}([\mathcal{C}_{3 \times 3}(A_{:d/2}), \mathcal{C}_{5 \times 5}(A_{d/2:})])])$

$\mathrm{SS}$ denotes soft split (overlapping patch embedding), and $\mathrm{MLP}$ restores channel dimensions post-convolution. This dual-branch fusion mechanism yields effective recovery of unseen spatial features, complementing the global context harvested by the transformer components (Shi et al., 2022).

4. Explicit Optical Flow for Feature Propagation

FlowLens employs explicit optical flow computation and feature warping to align content across consecutive frames, critical for accurate scene reconstruction beyond the FoV:

Flow Completion: Downsampled frames $d_4(X^i), d_4(X^j)$ are processed by $\mathcal{F}$ to estimate $\hat{V}_{i\to j}$ .
Warping: First-order warping $\tilde f_i(x) = f_j(x + \hat V_{i\to j}(x))$ relies on predicted flow.
DCN Compensation: Residual misalignment is mitigated via deformable convolution: offsets $o_{i\to j}$ and modulations $m_{i\to j}$ derived from flow and features.
Second-order Propagation: Fused via $1\times1$ convolutions, yielding $\hat f_i$ for improved consistency.

Supervision for flow completion utilizes an $\mathcal{L}_{flow}$ $L_1$ loss against ground-truth flow fields, essential for minimizing propagation errors (Shi et al., 2022).

5. Training Objectives and FoV Mask Supervision

FlowLens optimizes three core objectives: reconstruction, adversarial, and flow losses:

$\mathcal{L} = \lambda_{rec}\,\mathcal{L}_{rec} + \lambda_{adv}\,\mathcal{L}_{adv} + \lambda_{flow}\,\mathcal{L}_{flow}$

with weights $\lambda_{rec}=0.01,\ \lambda_{adv}=0.01,\ \lambda_{flow}=1$ .

Reconstruction loss ( $\mathcal{L}_{rec}$ ): $L_1$ error between predicted and ground-truth full-FoV frames.
Adversarial loss ( $\mathcal{L}_{adv}$ ): Employs a video GAN discriminator $D$ to penalize perceptual discrepancies.
FoV-mask supervision: Binary masks $M\in\{0,1\}^{H\times W}$ indicate missing/out-of-FoV regions; masks sample expansion rates (5%, 10%, 20%) for both inner and outer scenarios. Supervision matches predicted completions to ground-truth across both visible and reconstructed areas (Shi et al., 2022).

6. KITTI360 FoV-Masked Dataset and Benchmark Protocols

A novel dataset derived from KITTI360 enables rigorous benchmarking for beyond-FoV inpainting and perception tasks:

Data composition: 76,000 pinhole and 76,000 spherical frames with calibrated intrinsics.
FoV Masking Models: Outer-FoV for pinhole via $f-\tan\theta$ model, inner-FoV for spherical via $f-\theta$ model.
Expansion rates: 5%, 10%, 20% augmentation.
Testing protocol: Sequence “seq10” uses only historical frames, emulating real-time application conditions.
Annotations: Ground-truth full-FoV frames and pseudo-labels (via pretrained SegFormer) for semantic segmentation over reconstructed regions (Shi et al., 2022).

7. Quantitative Evaluation and Perception Enhancement

FlowLens demonstrates state-of-the-art performance in both holistic frame inpainting and high-level perception metrics:

Video Inpainting: On KITTI360, outer expansion yields PSNR = 20.13 dB (vs. 19.45 dB for E2FGVI); SSIM = 0.9314 (vs. 0.9229). Inner expansion: PSNR = 36.69 dB, SSIM = 0.9916. On YouTube-VOS (offline), PSNR = 33.89 dB and SSIM = 0.9722.
Semantic Segmentation: Pinholes’ unseen region mIoU: baseline 26.05%, E2FGVI 40.89%, FlowLens 45.04%, with +2.3% improvement even within the original FoV.
Object Detection: FlowLens recovers out-of-view vehicles, cyclists, and traffic signs (using Faster R-CNN), extending perception boundaries with enhanced accuracy.

Metrics employed include PSNR, SSIM, VFID (Video Fréchet Inception Distance), and flow warping error $E_{warp}$ —the last reflecting temporal consistency (Shi et al., 2022).

A plausible implication is that integrating FlowLens-like architectures in real-world systems could lead to expanded perceptual coverage and safer operational envelopes in autonomous navigation and urban robotics.

PDF Markdown Chat (Pro)

References (1)

Beyond the Field-of-View: Enhancing Scene Visibility and Perception with Clip-Recurrent Transformer (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to FlowLens System.