Active RGB-NIR Imaging
- Active RGB-NIR Imaging is a method that combines visible and near-infrared sensing with active illumination to capture complementary color and structural information.
- It employs diverse architectures—from pixel-aligned stereo systems to single-sensor setups—to address low-light enhancement, depth estimation, and material analysis.
- Recent research focuses on advanced fusion techniques, spectral translation models, and precise calibration to overcome cross-spectral inconsistencies.
Active RGB-NIR imaging denotes imaging systems that combine visible RGB sensing with near-infrared sensing under controlled acquisition, often with active NIR illumination or actively controlled scene illumination, in order to exploit the complementarity between visible-spectrum color information and NIR-derived structure, shading, reflectance, or spectral cues. The term covers several distinct but related regimes: pixel-aligned RGB-NIR cameras with 850 nm illuminators for robot vision, gated RGB-NIR stereo with 808 nm flood illumination, single-sensor VIS-NIR-mix capture without an IR-cut filter, actively illuminated RGB plus NIR flash inverse-rendering platforms, and jointly acquired RGB with NIR hyperspectral imaging for scientific phenotyping (Kim et al., 2024, Brucker et al., 2024, Lv et al., 2020, Chung et al., 28 May 2026, Engstrøm et al., 23 Apr 2025). Across these regimes, the central premise is stable: RGB retains visible appearance and chromatic semantics, while NIR can provide higher SNR in darkness, more stable active illumination, or wavelength-specific material response that is either unavailable or difficult to isolate in passive RGB alone.
1. Spectral scope and physical rationale
The NIR component in active RGB-NIR imaging is not tied to a single band or sensor model. Reported systems span approximately in dual-CCD road-scene colorization, 850 nm in robot vision with Advanced Illumination AL295-150850IC, approximately 808 nm flood illumination in cross-spectral gated stereo, approximately in NIR-to-RGB spectral domain translation, and in Specim FX17 line-scan hyperspectral imaging (Limmer et al., 2016, Kim et al., 2024, Brucker et al., 2024, Yang et al., 2023, Engstrøm et al., 23 Apr 2025). This range diversity reflects different objectives: low-light structure sensing, depth from active gating, spectral translation, and chemometric analysis are not constrained to the same optical band.
The underlying motivation is likewise task-dependent. In low-light fusion and denoising, NIR images under invisible near-infrared flash or built-in NIR illumination retain edges and shading when RGB is dominated by Poisson-Gaussian noise or high-ISO degradation (Jin et al., 2023, Xu et al., 2024). In inverse rendering, the key advantage is that the NIR flash lies outside the RGB sensor’s spectral passband and is imperceptible to humans, so the NIR channel can be dominated by a controlled point-light term while RGB remains passive and records ambient visible appearance (Chung et al., 28 May 2026). In 24-hour colorful imaging with a single silicon sensor, removing the IR-cut filter exposes the Bayer channels to both VIS and NIR, creating mixed measurements that can be computationally separated and then re-integrated using NIR-guided enhancement (Lv et al., 2020). In agricultural phenotyping, concurrent RGB and NIR-HSI acquisition under the same illumination enables morphology, reflectance, and pseudo-absorbance to be analyzed jointly over time (Engstrøm et al., 23 Apr 2025).
Several papers make explicit that the complementarity is conditional rather than absolute. NIR can be structurally informative but spectrally inconsistent with visible RGB, because materials may show different contrast, shadows, or reflectance in the two domains (Jin et al., 2023, Li et al., 2024, Xu et al., 2024). This directly challenges any assumption that active RGB-NIR imaging is merely “RGB plus a cleaner grayscale channel.” The literature instead treats cross-spectral inconsistency as a first-order modeling problem.
2. Acquisition architectures, synchronization, and calibration
Representative active RGB-NIR systems differ sharply in acquisition geometry. Pixel-aligned designs use prism-based or dichroic beam-splitter cameras so that RGB and NIR share an optical path and are co-axial at the sensor level. The robot-vision system with two JAI FS‑1600D‑10GE RGB-NIR cameras forms a stereo pair, each camera being paired with an NIR illuminator, and supplements the imaging stack with an Ouster OS1 LiDAR and PTP synchronization (Kim et al., 2024). The inverse-rendering platform similarly uses a prism-based RGB-NIR camera, JAI FS-1600, synchronized with an Advanced Illumination AL295 NIR flash and mounted on a robotic arm atop a wheeled mobile base, thereby enabling dense multi-view RGB-NIR capture with flash-on/flash-off subtraction (Chung et al., 28 May 2026).
Other systems are explicitly non-prismatic but still actively controlled. The barley germination dataset uses a Basler Ace 2 Pro RGB line-scan camera and a Specim FX17 line-scan hyperspectral camera, both operating simultaneously while a conveyor belt moves a Petri dish plate carrying chessboards and PTFE white references under six 20 W halogen bulbs (Engstrøm et al., 23 Apr 2025). The hardware is active in the sense that both illumination and geometry are controlled, and RGB and NIR-HSI are acquired simultaneously under the same illumination. In contrast, Real-NAID uses a Huawei X2381‑VG surveillance camera with a built-in NIR illuminator; captures are sequential and restricted to static scenes so that RGB and NIR remain aligned without explicit geometric calibration (Xu et al., 2024). The single-sensor VIS-NIR-MIX system uses a Bitran CS-63C camera without an IR-cut filter and a motorized filter wheel for dataset construction, with a xenon lamp and an 880 nm band-pass filter used to emulate active NIR LED illumination at night (Lv et al., 2020).
Radiometric standardization is central to these systems. In the barley setup, white and dark correction are applied row-wise and per-channel via
with for RGB and a shutter-closed dark reference for NIR-HSI; PTFE foil provides the white reference, and size correction is then performed by bilinear interpolation using chessboard geometry (Engstrøm et al., 23 Apr 2025). In robot vision, the per-channel image-formation model explicitly includes active illumination only in NIR,
with for RGB and from the 850 nm LED (Kim et al., 2024). In inverse rendering, the NIR flash-only radiance is isolated by
which removes ambient NIR and yields a point-light shading term suitable for BRDF and geometry estimation (Chung et al., 28 May 2026).
Geometric calibration ranges from identity alignment in prism cameras to explicit affine or projective registration in heterogeneous systems. The barley dataset estimates RGB-to-RGB alignment across sessions with ArUco-based RANSAC affine transformation and RGB-to-HSI alignment with chessboard centers, enabling the same physical kernel to be cropped in both modalities at the same time point (Engstrøm et al., 23 Apr 2025). Cross-modal registration more generally is benchmarked by RGB-NIR-IRegis, where aligned RGB-NIR pairs are available through a monocular RGB-NIR camera and cross-view homographies are annotated for unaligned pairs, with evaluation based on corner reprojection error and AUC@3px, AUC@5px, and AUC@10px (Li et al., 2024).
3. Core computational paradigms
A dominant paradigm is RGB-NIR fusion under explicit inconsistency modeling. “DarkVisionNet: Low-Light Imaging via RGB-NIR Fusion with Deep Inconsistency Prior” formulates fusion around deep structures extracted from RGB and NIR feature spaces and a Deep Inconsistency Prior,
with 0 in experiments, so that inconsistent NIR structures are suppressed during fusion (Jin et al., 2023). “NIR-Assisted Image Denoising: A Selective Fusion Approach and A Real-World Benchmark Dataset” instead introduces a Selective Fusion Module that factorizes the NIR/RGB fusion weights into global and local components, 1, and performs complementary softmax gating for both modalities before reconstruction (Xu et al., 2024). Both approaches directly reject the naïve assumption that NIR structure should always be injected into RGB.
A second paradigm is spectral-domain translation or colorization. “Multi-scale Progressive Feature Embedding for Accurate NIR-to-RGB Spectral Domain Translation” decomposes the problem into a NIR2grayscale domain-translation stage and a grayscale3RGB colorization stage, with multi-scale SCCM supervision and feature-level adversarial alignment, and reports that MPFNet outperforms state-of-the-art counterparts by 4 PSNR on the VCIP2020 dataset (Yang et al., 2023). The earlier “Infrared Colorization Using Deep Convolutional Neural Networks” likewise treats NIR-to-RGB estimation as low-frequency RGB regression plus deterministic high-frequency NIR detail reinjection, using a multi-scale CNN with a low-frequency bypass and joint bilateral filtering on the output (Limmer et al., 2016). These methods are not fusion in the strict sense; they generate RGB-like imagery from NIR inputs and therefore inherit the one-to-many ambiguity of cross-spectral color assignment.
A third paradigm centers on geometric inference. In robot vision, pixel-aligned RGB-NIR stereo supports both image-level fusion compatible with RGB-pretrained models and feature-level fusion integrated into RAFT-Stereo, where alternating Fusion–NIR cost volumes produced the best depth RMSE, 5, relative to RGB-only, NIR-only, or simpler fusion baselines (Kim et al., 2024). In cross-spectral gated stereo, active NIR time-gated slices, passive high-resolution RCCB stereo, physics-based gated reconstruction, and LiDAR supervision are fused in a CREStereo-style architecture with pose refinement and attention-based cross-modal feature fusion; the method improves MAE by about 6 over the next best method in the 7 range (Brucker et al., 2024). In inverse rendering, a three-stage pipeline initializes geometry from ambient RGB using 2D Gaussian splatting, refines geometry and NIR BRDF under flash-only NIR, and then solves for RGB diffuse albedo and the RGB environment map with cross-spectral sharing of roughness and metallic parameters (Chung et al., 28 May 2026).
Cross-modality registration has become an independent front-end problem rather than a pre-processing detail. “Towards RGB-NIR Cross-modality Image Registration and Beyond” argues that inconsistent local features have a toxic impact on registration quality and proposes SGFormer, which injects high-level semantic guidance into a LoFTR-style matcher via a Semantic Injection Module and a Semantic Triplet Loss (Li et al., 2024). This suggests that active RGB-NIR imaging increasingly depends on semantics-aware correspondence, not only on photometric or descriptor-level similarity.
4. Datasets, benchmarks, and evaluation regimes
The dataset ecosystem is heterogeneous because active RGB-NIR imaging spans scientific imaging, surveillance, robot vision, inverse rendering, and low-light restoration. The barley germination dataset provides RGB images, NIR-HSI images, segmentation masks, full-dish imagery, grid coordinates, and mean pseudo-absorbance spectra for 2242 individual barley kernels, each imaged pre-moisture and then every 24 hours for five consecutive days, for a total of six sessions per kernel (Engstrøm et al., 23 Apr 2025). It is notable for simultaneous RGB and NIR-HSI acquisition, explicit segmentation via Otsu’s method, and time-series labeling of germination day.
For low-light fusion, Dark Vision Dataset is described as the first public RGBNIR fusion benchmark and contains 5k aligned RGB-NIR reference pairs cropped to 8 for training, 1k reference pairs for testing, and 10 additional real noisy pairs at 9 (Jin et al., 2023). Real-NAID complements this with real noisy RGB plus clean RGB and clean NIR guidance: 100 static scenes, each with three noisy RGB images, one clean RGB image, and one clean NIR image, for 300 RGB-NIR noisy/clean instances split into 70 training scenes and 30 test scenes (Xu et al., 2024). These two datasets occupy different positions in the literature: DVD emphasizes aligned RGB-NIR fusion under synthetic and real low-light noise, whereas Real-NAID emphasizes real surveillance-style capture with an active built-in NIR illuminator.
For robot vision, the pixel-aligned RGB-NIR stereo dataset includes 39 training videos with approximately 73,000 frames and 4 test videos with approximately 7,000 frames, for a total of approximately 80,000 frames across 43 scenes, with synchronized LiDAR and per-sensor exposure times (Kim et al., 2024). RGB-NIR-IRegis addresses registration rather than reconstruction, providing 25 scene sequences and 260 unaligned image pairs with viewpoint variation, plus aligned pairs from a monocular RGB-NIR camera, thereby enabling fair evaluation of cross-modality registration under both aligned and cross-view settings (Li et al., 2024). Cross-spectral gated stereo extends the evaluation regime to automotive-scale depth, using test sets with 2463 frames to 0 and 655 additional frames with accumulated LiDAR maps to 1 (Brucker et al., 2024).
Inverse rendering introduces yet another benchmark structure. The active RGB-NIR inverse-rendering dataset is described as the first multi-view RGB-NIR inverse-rendering dataset captured across multiple ambient illumination conditions; each object–environment pair contains over 100 synchronized RGB, ambient NIR, and flash-on NIR frames, plus masks and camera poses (Chung et al., 28 May 2026). By contrast, the VIS-NIR-MIX dataset for 24-hour colorful imaging contains 102 scenes and 714 images, organized as seven captures per scene across day and night conditions, with filter-wheel-based alignment and long-exposure VIS references at night (Lv et al., 2020). The coexistence of such different benchmarks is itself informative: active RGB-NIR imaging has no single canonical protocol because the sensing objectives vary from reflectance inversion to robot depth to spectral chemometrics.
5. Major application domains
Low-light restoration is the most established application area. DarkVisionNet reports DVD performance of 2 and SSIM 3 at 4, 5 and SSIM 6 at 7, and best PSNR/SSIM at heavier noise levels 8 and 9, outperforming fusion and denoising baselines especially under heavy noise (Jin et al., 2023). Real-NAID shows that SFM-equipped denoisers improve most strongly at higher noise; for example, NIR‑Restormer improves from 0 to 1 in PSNR/SSIM/LPIPS at the high-noise setting (Xu et al., 2024). The integrated single-sensor 24-hour imaging pipeline goes further by addressing both daytime NIR contamination and nighttime chrominance absence, reaching daytime PSNR 2, SSIM 3, Colorfulness 4, and nighttime PSNR 5, SSIM 6, Colorfulness 7 on its test set (Lv et al., 2020).
Depth estimation and robot perception constitute a second major domain. In pixel-aligned robot stereo, learned HSV-brightness fusion improves YOLOv8 object detection without retraining from mAP 8 on RGB to 9 on fused images, while feature fusion improves stereo depth RMSE from 0 for RGB-only RAFT-Stereo to 1 (Kim et al., 2024). Cross-spectral gated stereo extends active RGB-NIR to long-range automotive depth, reporting night RMSE 2, MAE 3, ARD 4, and 5 to 6, as well as 7 MAE values of 8 at night and 9 by day (Brucker et al., 2024). The paper explicitly emphasizes applications such as autonomous driving and “lost cargo” detection.
Inverse rendering and reflectance recovery form a third domain in which active RGB-NIR is used not to improve photographic appearance but to stabilize physical decomposition under uncontrolled ambient light. The active RGB-NIR inverse-rendering method reports RGB diffuse albedo 0 in PSNR/SSIM/LPIPS, roughness RMSE 1, normal MAE 2, and relighting PSNR 3, while requiring 4 hours per scene (Chung et al., 28 May 2026). Here the principal benefit is not merely low-light robustness but ambient robustness: NIR flash provides stable point-light shading that does not contaminate RGB acquisition.
Further domains include NIR-to-RGB visualization and application-specific classification. MPFNet reaches PSNR 5, SSIM 6, AE 7, and LPIPS 8 on the VCIP test set, improving PSNR by 9 over ATCycleGAN (Yang et al., 2023). The older deep colorization model, trained on 38,495 RGB-NIR road-scene pairs from a multi-CCD camera, achieves RMSE 0 and S-CIELAB 1 for its best topology (Limmer et al., 2016). In night-time fire detection, grayscale surrogates used as NIR-like data yield mAP@50 2, mAP3 4, and F1 5, and a two-stage pipeline using YOLOv11n and EfficientNetV2-B0 is designed to reduce false positives from bright artificial lights (Khai et al., 29 Dec 2025). In plant science, the barley dataset explicitly supports classification of germinated versus non-germinated kernels, multimodal time-series analysis, and exploratory chemometric analysis via cleaned pseudo-absorbance spectra (Engstrøm et al., 23 Apr 2025).
6. Limitations, misconceptions, and research directions
A recurrent misconception is that pixel alignment or simultaneous acquisition eliminates the substantive difficulty of RGB-NIR fusion. The literature shows the opposite. Cross-spectral inconsistency arises because NIR can contain shadows absent in RGB, RGB can contain color patterns absent in NIR, and local gradient statistics can differ enough to destabilize feature matching and fusion (Jin et al., 2023, Li et al., 2024, Xu et al., 2024). Pixel alignment removes a geometric nuisance variable, but it does not resolve modality-dependent reflectance or semantics.
A second misconception is that NIR-to-RGB translation recovers ground-truth visible color. “Multi-scale Progressive Feature Embedding for Accurate NIR-to-RGB Spectral Domain Translation” states that the mapping is intrinsically ambiguous because NIR reflectance is non-overlapping with the visible spectrum and therefore induces a one-to-many color mapping (Yang et al., 2023). The earlier road-scene colorization work likewise notes failures when visible signals are absent from NIR, such as LED traffic lights (Limmer et al., 2016). These results imply that colorized NIR is best interpreted as a learned RGB surrogate, not as a spectrally faithful reconstruction.
Active NIR itself is also not uniformly robust. Outdoor sunlight contains strong ambient NIR, which reduces the headroom of flash-after-subtraction in inverse rendering and degrades gated SNR in automotive depth estimation (Chung et al., 28 May 2026, Brucker et al., 2024). Strong solar IR, bright artificial lights, glare, fog, rain, and material-specific NIR reflectance all remain failure modes in the surveyed systems (Khai et al., 29 Dec 2025, Brucker et al., 2024). A plausible implication is that “active” should not be conflated with “ambient-invariant”; rather, it improves controllability within a bounded radiometric regime.
Dataset-specific caveats are equally important. The barley germination dataset is explicitly biased because glue residue from the 3D-printed grid strongly inhibited germination, so transfer to standard malting conditions requires caution (Engstrøm et al., 23 Apr 2025). Real-NAID is restricted to static scenes with sequential capture (Xu et al., 2024). The VIS-NIR-MIX dataset uses sequential filter-wheel acquisition and leaves dynamic scenes for future work (Lv et al., 2020). The 2016 road-scene colorization dataset is summer daylight only and therefore does not cover night or active illumination despite its relevance to active systems (Limmer et al., 2016). These constraints underscore that benchmark diversity in active RGB-NIR imaging still lags the diversity of intended deployments.
Current research directions in the cited works are relatively consistent. They include physics-informed constraints and richer spectral priors for NIR-to-RGB translation (Yang et al., 2023), improved alignment and temporally consistent fusion (Jin et al., 2023), adaptive NIR illumination control and learned geometric warping for denoising (Xu et al., 2024), tighter LiDAR integration and generative RGB-NIR modeling for robot vision (Kim et al., 2024), coded or structured NIR illumination and joint RGB-NIR BRDF learning for inverse rendering (Chung et al., 28 May 2026), and broader integration of patch-based detection and temporal flame dynamics for surveillance (Khai et al., 29 Dec 2025). Taken together, these directions suggest that the field is moving from simple two-image fusion toward tightly coupled spectral, geometric, temporal, and physical inference.