- The paper introduces EmambaIR, a visual state space model that leverages TSAM and GSSM to overcome CNN and ViT limitations in event-guided image reconstruction.
- It implements a Top-k Sparse Attention Module to selectively fuse cross-modal features, significantly reducing memory usage and computational FLOPs.
- Experimental results show superior PSNR and SSIM across deblurring, deraining, and HDR tasks, while maintaining a lower model size and higher throughput.
EmambaIR: Efficient Visual State Space Model for Event-Guided Image Reconstruction
Introduction
The paper presents EmambaIR, an efficient visual state space model for event-guided image reconstruction, addressing limitations of CNNs and ViTs in processing event data for challenging image restoration tasks. Traditional CNN-based methods struggle with long-range dependencies due to restricted receptive fields, leading to noise sensitivity and local ambiguities. Although ViT-based approaches improve global feature aggregation, their quadratic complexity (O(n2)) is prohibitive for high-resolution scenarios. The proposed EmambaIR leverages state space models (SSMs) to provide both global context modeling and linear computational complexity, tailored for the spatial sparsity and temporal continuity characteristic of event streams.
Architectural Contributions
EmambaIR features two principal modules: a Top-k Sparse Attention Module (TSAM) and a Gated State-Space Module (GSSM). The TSAM applies top-k selection attention to efficiently and selectively fuse cross-modal (event and image) features at the pixel level, suppressing noise and computational redundancy by focusing on only the k most relevant event-image correspondences per location. The GSSM extends the vanilla visual SSM (VSSM) with a nonlinear gated unit, enhancing the temporal representation and enabling robust global context modeling with O(n) complexity.
The overall architecture adopts a UNet-based hierarchical encoder-decoder backbone, integrating RLFBs for local feature aggregation and skip connections to facilitate multi-scale information propagation. This design supports efficient high-resolution processing and explicit exploitation of event data during reconstruction.
Top-k Sparse Attention Module (TSAM)
TSAM operationalizes sparse cross-modal feature fusion by computing pixel-wise correlations between image queries and event keys, dynamically selecting the top k interactions per query via a cosine similarity metric. All non-top-k entries in the attention matrix are zeroed out, and sparse matrix multiplication with the event value tensor is performed, dramatically reducing both memory usage and computational FLOPs without significant accuracy loss. The value of k is a tunable hyperparameter, balancing accuracy and efficiency. Ablation demonstrates that even with small k, TSAM yields competitive PSNR and SSIM while maintaining substantial speed and memory advantages.
Gated State-Space Module (GSSM)
The GSSM employs channel-wise gating with nonlinear activations (GeLU) to enhance the standard SSM, which is grounded in continuous-time linear systems theory. This is crucial for capturing the non-stationary temporal event dynamics arising from irregular event generation and varying real-world conditions. The gating operation, by aggregating multi-scale context and normalizing via global average pooling, enables robust long-range dependency modeling crucial for high-fidelity restoration in the presence of rapid motion, low-light, or adverse atmospheric conditions.
Experimental Results
EmambaIR is evaluated on three core event-guided reconstruction tasks: motion deblurring (GoPro, H2D), deraining (Adobe240), and HDR enhancement (SDSD), with both synthetic and real-world event streams. Across all tasks and datasets, EmambaIR attains the highest PSNR and SSIM, clearly surpassing both state-of-the-art image-only and existing event-guided methods.
Quantitatively, EmambaIR achieves:
- Motion Deblurring (GoPro): PSNR 35.74 dB, SSIM 0.9735 (improving EFNet by +0.28 dB PSNR, +0.015 SSIM)
- HDR Enhancement (SDSD): PSNR 24.15 dB, SSIM 0.8164 (outperforming Evlight and RetinexFormer)
- Deraining (Adobe240): PSNR 34.63 dB, SSIM 0.9027 (first event-guided deraining method; +0.84 dB PSNR over best baseline)
Resource-wise, EmambaIR has a model size of 6.25M parameters and a computational cost of 5.87G FLOPs per image, significantly lower than prominent attention-based and event-guided restoration networks. Training throughput is 20% higher than comparator methods on identical hardware.
Qualitative analysis reinforces these findings: EmambaIR demonstrates superior restoration of fine details, sharper edges, and suppression of artifactsโespecially under severe blur, dense rainfall, or extreme lighting. Visualization of ablation further confirms that both TSAM and GSSM are criticalโremoval of either results in a marked drop in restoration quality, validating the synergy of the combined architectural approach.
Theoretical and Practical Implications
EmambaIR demonstrates that carefully designed state space models can subsume both the local inductive biases of CNNs and the global modeling power of attention, while scaling efficiently to high-resolution dense vision. The effective exploitation of both the sparsity and temporal continuity of event streams in TSAM and GSSM is a key innovation allowing high-efficiency/high-reconstruction-fidelity tradeoffs not attainable by previous architectures.
The formalism is general: TSAM's dynamic sparse attention extends to any cross-modal image reconstruction paradigm where auxiliary modalities are spatially sparse. GSSM's nonlinear gating naturally accommodates domains requiring temporal adaptation and non-stationarity robustness.
Future Directions
Given that the current work focuses on frame-based image reconstruction, a logical extension is to event-guided video restoration tasks (e.g., video deblurring, HDR video, frame interpolation), leveraging the same SSM-based benefits for long-sequence temporal modeling. Additionally, adaptation to other imaging modalities with sparse but informative auxiliary data streamsโsuch as medical imaging, remote sensing, or multimodal robotic perceptionโwould further validate and extend the framework's applicability.
From a theoretical perspective, exploring tighter integration with advanced SSM variants (e.g., bidirectional or hierarchical SSMs) or hybridizing with frequency-domain and equivariant attention formulations could further boost efficiency and generalization.
Conclusion
EmambaIR substantiates the effectiveness of SSM-driven architectures, specifically through TSAM and GSSM, for efficient and high-quality event-guided image reconstruction. By focusing on sparse pixel-level cross-modal interactions and robust nonlinear temporal modeling, the method delivers state-of-the-art quantitative and qualitative performance across multiple challenging reconstruction tasks, all while reducing memory footprint and compute cost. The framework fundamentally advances the practical feasibility of event-based vision in high-resolution scenarios, with clear paths toward broader applicability in future AI architectures.
Reference: "EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction" (2605.08073)