I2V-GAN: Unpaired Infrared-to-Visible Video Translation (2108.00913v2)

Published 2 Aug 2021 in cs.CV

Abstract: Human vision is often adversely affected by complex environmental factors, especially in night vision scenarios. Thus, infrared cameras are often leveraged to help enhance the visual effects via detecting infrared radiation in the surrounding environment, but the infrared videos are undesirable due to the lack of detailed semantic information. In such a case, an effective video-to-video translation method from the infrared domain to the visible light counterpart is strongly needed by overcoming the intrinsic huge gap between infrared and visible fields. To address this challenging problem, we propose an infrared-to-visible (I2V) video translation method I2V-GAN to generate fine-grained and spatial-temporal consistent visible light videos by given unpaired infrared videos. Technically, our model capitalizes on three types of constraints: 1)adversarial constraint to generate synthetic frames that are similar to the real ones, 2)cyclic consistency with the introduced perceptual loss for effective content conversion as well as style preservation, and 3)similarity constraints across and within domains to enhance the content and motion consistency in both spatial and temporal spaces at a fine-grained level. Furthermore, the current public available infrared and visible light datasets are mainly used for object detection or tracking, and some are composed of discontinuous images which are not suitable for video tasks. Thus, we provide a new dataset for I2V video translation, which is named IRVI. Specifically, it has 12 consecutive video clips of vehicle and monitoring scenes, and both infrared and visible light videos could be apart into 24352 frames. Comprehensive experiments validate that I2V-GAN is superior to the compared SOTA methods in the translation of I2V videos with higher fluency and finer semantic details. The code and IRVI dataset are available at https://github.com/BIT-DA/I2V-GAN.

Authors (6)

Shuang Li (203 papers)
Bingfeng Han (4 papers)
Zhenjie Yu (2 papers)
Chi Harold Liu (43 papers)
Kai Chen (512 papers)
Shuigen Wang (4 papers)

Citations (38)

View on Semantic Scholar

Summary

The paper introduces I2V-GAN, which leverages novel constraint mechanisms to achieve spatio-temporal consistency in translating infrared videos to the visible domain.
It demonstrates superior performance with lower FID and higher PSNR and SSIM scores on the new IRVI dataset compared to existing models.
The approach offers practical benefits for autonomous driving and security surveillance and sets a foundation for future multi-domain video translation research.

An Expert Evaluation of "I2V-GAN: Unpaired Infrared-to-Visible Video Translation"

The paper "I2V-GAN: Unpaired Infrared-to-Visible Video Translation" introduces a novel approach for unpaired video translation between infrared (IR) and visible light (VI) domains, leveraging a model termed I2V-GAN. The proposed method aims to generate spatial-temporal consistent visible light videos from infrared videos, addressing the inherent challenges posed by the semantic information gap between these two domains. This research holds significant value in fields such as autonomous driving and security surveillance, where visibility in various light conditions is crucial.

The novelty of I2V-GAN lies in its integration of three constraint mechanisms to ensure effective video translation: adversarial constraint, cyclic consistency with an added perceptual loss, and similarity constraints both within and across domains. These constraints are designed to enhance the content and motion consistency and ensure fine-grained video synthesis.

Numerical Results and Dataset

The authors introduce a new dataset named IRVI for infrared-to-visible video translation, comprising 12 video clips with both IR and VI videos divided into 24,352 frames. The performance of I2V-GAN is evaluated against state-of-the-art methods on this dataset, demonstrating superior results in terms of translation fluency and semantic detail preservation. Key performance metrics such as Fréchet Inception Distance (FID), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity (SSIM) are employed to benchmark the model's performance. The reported results indicate that I2V-GAN achieves lower FID scores, suggesting closer proximity between the synthesized and real VI videos, indicating improved realism. Furthermore, higher PSNR and SSIM values compared to other models underscore the robustness of I2V-GAN in preserving video quality and structural integrity.

Technical Approach

The I2V-GAN architecture employs generative adversarial networks (GANs) but extends the conventional cycle consistency loss with a perceptual loss that strengthens content and style translation accuracy. The perceptual loss operates by optimizing features extracted from the VGG network, which contributes to maintaining high precision in content conversion. Additionally, a similarity loss framework, including external and internal versions, is employed to reinforce spatial-temporal coherence within and across synthesized video frames. These similarity losses employ the InfoNCE mechanism to maximize mutual information between corresponding patches of input and generated frames, thus addressing typical artifacts such as flickering.

Implications and Future Directions

Practically, the advancements in I2V-GAN enhance capabilities in critical vision-based applications under adverse conditions, providing an invaluable tool in scenarios where traditional imaging fails. Theoretically, this work enriches the domain of video-to-video translation, suggesting pathways for future research involving other domain translations or the inclusion of multi-modal inputs.

Looking ahead, potential developments may involve optimizing model architectures for real-time applications, as the current approach, while promising in accuracy, may encounter computational limitations. Moreover, exploring broader applications beyond IR-VI translation, such as in medical imaging or augmented reality, could be promising ventures leveraging the foundational advances presented in this work.

In conclusion, the paper presents a significant contribution to the multimedia and computer vision community by addressing the critical challenge of translating unpaired infrared videos to the visible spectrum while preserving essential video characteristics. The proposed I2V-GAN model with its intricate constraint mechanisms sets a new standard and opens avenues for further innovation in video translation tasks.

PDF Markdown

Related Papers

GitHub

GitHub - BIT-DA/I2V-GAN: ACMMM2021 paper "I2V-GAN: Unpaired Infrared-to-Visible Video Translation" (107 stars)