V-RGBX: Multimodal RGB-plus-X Modeling

Updated 4 July 2026

V-RGBX is a multimodal framework that integrates visible RGB with auxiliary sensor or intrinsic channels (e.g., thermal, depth) to improve visual perception in adverse conditions.
It employs diverse fusion strategies—from pixel-level concatenation to reasoning-based cross-attention—to combine complementary evidence for enhanced detection, tracking, and editing.
Empirical studies show that careful modality-aware interaction boosts performance, addressing challenges like modality misalignment and variable sensor reliability.

V-RGBX denotes RGB-plus- $X$ multimodal modeling in which a visible RGB stream is paired with an auxiliary modality $X$ . In the cited literature, $X$ ranges from sensor modalities such as thermal infrared, depth, event data, and polarimetric imagery to intrinsic scene-property layers such as albedo, normal, material, irradiance, and diffuse irradiance. Across these formulations, the recurring premise is that RGB provides fine appearance, color, and semantic detail, while $X$ provides complementary evidence when RGB is unreliable or insufficient; this premise has been instantiated in video object detection, tracking, visual grounding, low-light enhancement, diffusion-based detection, joint RGBX generation, and intrinsic-aware video editing (Tu et al., 2023, Tang et al., 2022, Zhao et al., 31 Dec 2025, Jha et al., 30 May 2025, Orfaig et al., 5 May 2025, Dirik et al., 19 Apr 2025, Fang et al., 12 Dec 2025, Wu et al., 31 Jan 2026).

1. Concept and scope

Within V-RGBX, the symbol $X$ does not denote a single fixed modality. In sensor-centric settings, it denotes a second imaging stream aligned with RGB, most commonly thermal infrared, depth, event, or polarimetric data. In intrinsic-centric settings, it denotes latent or physically meaningful layers generated jointly with RGB, including albedo, surface normal, depth/disparity, diffuse irradiance, and material-related channels (Orfaig et al., 5 May 2025, Dirik et al., 19 Apr 2025, Fang et al., 12 Dec 2025).

This yields two major interpretations of the field. The first is multimodal perception under heterogeneous sensing, where RGB is fused with a second observation channel to improve robustness in adverse illumination, weak contrast, motion-heavy scenes, or geometry-sensitive situations. The second is joint RGB+ $X$ scene modeling, where RGB and intrinsic maps are treated as mutually constraining outputs or conditions in a generative pipeline rather than as primary and auxiliary sensor streams (Tu et al., 2023, Dirik et al., 19 Apr 2025).

A common operational assumption is cross-modal correspondence. RGB-thermal video detection uses manually aligned paired streams; RT-X Net assumes paired, approximately aligned/co-registered, co-located RGB-T observations; RGBX-DiffusionDet assumes aligned paired 2D modalities; RGBX-Grounding explicitly states that each RGB image is paired with the subsequent $X$ image and that they are spatially and temporally aligned (Tu et al., 2023, Jha et al., 30 May 2025, Orfaig et al., 5 May 2025, Wu et al., 31 Jan 2026). This reliance on correspondence is one of the defining practical constraints of V-RGBX.

This suggests that V-RGBX is better understood as a family of coupled RGB+ $X$ representations than as a single task. What remains invariant is not the downstream objective, but the design problem: how to exploit complementary structure across modalities whose statistics, reliability, and failure modes are different.

2. Task formulations

The literature represented here spans both discriminative and generative formulations. In RGBT video object detection, the detector receives paired RGB and thermal streams and predicts object categories and bounding boxes for every frame in the sequence; at inference for frame $t$ , EINet uses RGB $(t-1,t,t+1)$ and thermal $X$ 0, six images in total (Tu et al., 2023). In RGB-TIR visual grounding, the model receives an aligned pair $X$ 1 and a sentence $X$ 2, and outputs a regressed box $X$ 3 (Zhao et al., 31 Dec 2025). RGBX-R1 extends this to multi-image grounding, formulating prediction as

$X$ 4

where the inputs include a language query, RGB and $X$ 5-modality templates, and a sequence of search images (Wu et al., 31 Jan 2026).

Low-level restoration and generation broaden the scope further. RT-X Net takes a low-light RGB image and a corresponding thermal image and outputs an enhanced visible image (Jha et al., 30 May 2025). PRISM supports text-to-RGBX generation, RGB-to- $X$ 6 decomposition, and $X$ 7-to-RGBX conditional generation by jointly modeling RGB, albedo, normal, depth/disparity, and diffuse irradiance (Dirik et al., 19 Apr 2025). The framework explicitly named V-RGBX performs video inverse rendering into intrinsic channels, photorealistic video synthesis from those channels, and keyframe-based intrinsic-conditioned video editing (Fang et al., 12 Dec 2025).

Task family	Typical $X$ 8	Output
Video object detection	Thermal	Per-frame categories and bounding boxes
Visual grounding	Thermal, depth, event	Referred-object bounding box or box sequence
Visual tracking	Thermal	Target box over time
Low-light enhancement	Thermal	Enhanced RGB image
Diffusion-based detection	Depth, polarimetric, infrared	2D detections
Joint RGBX generation/editing	Albedo, normal, depth/disparity, irradiance, material	RGB and $X$ 9 maps, or edited RGB video

These task definitions show that V-RGBX is not restricted to “RGB plus an extra channel” detection. It includes temporal localization, language-conditioned grounding, low-level restoration, and bidirectional RGB $X$ 0 generation. A common misconception is to equate the field with high-level fusion only; the cited work shows that restoration and intrinsic-aware editing are also central subdomains (Jha et al., 30 May 2025, Dirik et al., 19 Apr 2025, Fang et al., 12 Dec 2025).

3. Fusion and conditioning paradigms

One recurring design question is where interaction between RGB and $X$ 1 should occur. The literature spans pixel-level fusion, feature-level fusion, decision-level fusion, shared latent-token modeling, and interleaved conditioning.

In RGBT tracking, the paper behind DFAT systematically compares pixel-, feature-, and decision-level fusion and reports that decision-level fusion is strongest under limited RGB-T training data. Its late-fusion mechanism uses dynamic weighting of RGB and TIR classification outputs together with linear template update, and the authors explicitly identify cross-modal score bias as a key issue when the backbone is trained on RGB data only (Tang et al., 2022). This is an important counterpoint to the common assumption that earlier fusion is necessarily better.

Several systems instead adopt middle fusion but with modality-aware interaction. EINet is a dual-stream, multi-frame, middle-fusion detector built on YOLOX with a Temporal Proximity Enhancement module and an Erasure-based Interaction mechanism. Its most distinctive idea is asymmetric cross-modal feature erasure: thermal helps erase RGB noise, but RGB does not erase thermal. The mechanism is grounded in a proposed negative SiLU, intended to preserve negative response values associated with background/noise regions while suppressing foreground responses in the thermal branch, then to use those thermal inactive features as a denoising guide for RGB (Tu et al., 2023). RT-X Net follows a different middle-fusion design: modality-specific self-attention is applied first, then multi-head cross-attention is used so that thermal-induced feature maps refine and guide the RGB feature space (Jha et al., 30 May 2025). RGBT-VGNet uses a CLIP-based grounding architecture with Asymmetric Modality Adaptation, which assigns higher LoRA rank to the thermal branch than the RGB branch, and Language-Aware Visual Synergy, which uses the text as a semantic query to refine each modality before cross-attention between modalities (Zhao et al., 31 Dec 2025).

Diffusion-based detection introduces another pattern. RGBX-DiffusionDet keeps the original DiffusionDet decoder unchanged and modifies only the encoder. It uses separate ResNet-50 backbones and FPNs for RGB and $X$ 2, fuses paired feature maps with Dynamic Channel Reduction within CBAM, and then aggregates multilevel proposal features with DMLAB before feeding them to the original diffusion decoder (Orfaig et al., 5 May 2025). The stated design principle is to preserve decoder complexity while adding an adaptive multimodal encoder.

Generative V-RGBX systems replace explicit cross-modal fusion with joint latent modeling or structured conditioning. PRISM expands the number of image tokens from $X$ 3 to $X$ 4, concatenates tokens from all modalities into one shared Diffusion Transformer, and supports arbitrary-subset conditioning through modality-token overriding during sampling (Dirik et al., 19 Apr 2025). V-RGBX, the intrinsic-aware video editing framework, constructs an interleaved intrinsic conditioning sequence

$X$ 5

choosing one intrinsic modality per frame and injecting temporal-aware modality embeddings so that the model can interpret the modality identity even after temporal compression (Fang et al., 12 Dec 2025).

A further shift appears in RGBX-R1, where the cross-modal bridge is not primarily architectural but reasoning-based. Its Understand–Associate–Validate prompting scaffold constructs a Visual Modality Chain-of-Thought that first anchors target semantics in RGB, then establishes spatial correspondence in $X$ 6, and finally validates target location under cross-modal degradation and complementarity (Wu et al., 31 Jan 2026). This suggests a distinct V-RGBX paradigm: transferring RGB competence to $X$ 7 through explicit multimodal reasoning rather than only through feature fusion.

4. Benchmarks, datasets, and evaluation

V-RGBX research is unusually benchmark-driven because the value of $X$ 8 is highly condition-dependent. VT-VOD50 is presented as the first dedicated benchmark for RGBT video object detection. It contains 50 pairs of RGB-thermal video sequences, totaling 9449 RGBT image pairs, collected in real traffic scenarios; the split is 38 paired sequences for training and 12 for testing, with seven road-scene classes and metrics reported as AP50, AP, FPS, parameter count, and GFLOPs (Tu et al., 2023).

RGBT-Ground is described as the first large-scale RGB-TIR visual grounding benchmark for complex real-world scenarios. It contains 21,535 RGBT image pairs and 38,760 grounding instances, derived from RefFLIR, RefM $X$ 9FD, and RefMFAD, with scene-level, environment-level, and object-level annotations. The benchmark defines TestA for normal-size targets in normal/strong light, TestB for nighttime/weak-light/very-weak-light, and TestC for small-size targets, and evaluates using [email protected] (Zhao et al., 31 Dec 2025). Its structure is especially relevant to V-RGBX because it directly operationalizes daylight, low-light, and small-object robustness.

RGBX-Grounding converts RGBX tracking data into a multi-image grounding benchmark built from VisEvent, DepthTrack, RGBD2022, RGBT234, and LasHeR. It reports 7,432 samples and more than 58k/59k images, typically with 2 template images and 6 search images per sample, and evaluates both modality-known and modality-unknown settings using sequence-level [email protected] (Wu et al., 31 Jan 2026). This benchmark is notable because it evaluates whether a model can reason about thermal, depth, or event imagery rather than merely exploit them as fixed channels.

For restoration, V-TIEE provides 50 indoor and outdoor scenes captured under diverse nighttime conditions with a co-located visible camera, a co-located thermal camera, a gold dichroic mirror setup, and homography refinement; it is used as a real-world generalization benchmark for visible-thermal image enhancement (Jha et al., 30 May 2025). For detection, RGBX-DiffusionDet evaluates on KITTI RGB-Depth, an annotated RGB-Polarimetric dataset, and M $X$ 0FD RGB-Infrared (Orfaig et al., 5 May 2025). For tracking, DFAT evaluates on GTOT, VOT-RGBT2019, and VOT-RGBT2020 (Tang et al., 2022).

The metric landscape is correspondingly heterogeneous. Detection papers use AP50, AP, and AP $X$ 1; grounding papers use [email protected]; restoration papers use PSNR, SSIM, and LPIPS; video synthesis/editing papers report PSNR, SSIM, LPIPS, FVD, FID, and Smoothness (Tu et al., 2023, Zhao et al., 31 Dec 2025, Jha et al., 30 May 2025, Orfaig et al., 5 May 2025, Fang et al., 12 Dec 2025). This diversity reflects the fact that V-RGBX is a multimodal design problem crossing recognition, restoration, and generation.

5. Representative systems and empirical findings

A consistent empirical result is that $X$ 2 augments RGB but does not replace it. In the VT-VOD50 backbone comparison, RGB-only Darknet53 reaches AP50 41.28 and AP 21.27, whereas thermal-only reaches AP50 27.73 and AP 11.93, confirming that thermal is not a replacement for RGB even though it is more robust under adverse illumination (Tu et al., 2023). The same paper also shows that naive multimodal fusion can be harmful: when mainstream VOD methods are modified with naive pixel-wise RGB+thermal summation, performance drops, and baselines such as DFF, FGFA, SELSA, Temporal ROI Align, and TransVOD++ all degrade under unsophisticated multimodal injection (Tu et al., 2023).

Carefully designed multimodal interaction, however, produces measurable gains. EINet reaches AP50 46.32, AP 23.96, at 92.59 FPS on VT-VOD50, while its RGB-only variant “EINet w/o MI” reaches AP50 44.04, AP 22.55, at 204.2 FPS; the ablations attribute the improvement to the combination of a three-frame temporal window and the Erasure-based Interaction module (Tu et al., 2023). In tracking, DFAT reports EAO 0.3986 on VOT-RGBT2019, above the best feature-level result 0.3788 and the best pixel-level result 0.3481, and reaches EAO 0.4178 on VOT-RGBT2020, reinforcing the paper’s conclusion that late calibrated fusion is strongest in this regime (Tang et al., 2022).

Language-conditioned V-RGBX models show the same pattern. RGBT-VGNet reaches 72.65 on RefFLIR test, 74.34 on RefM $X$ 3FD test, and 66.63 on RefMFAD test, and the benchmark analyses show especially strong behavior in nighttime/weak-light and small-object subsets, though some adapted RGB+TIR baselines remain competitive on particular subsets (Zhao et al., 31 Dec 2025). RGBX-R1-7B reaches an average [email protected] of 46.53 on RGBX-Grounding under modality-known evaluation, compared with 31.04 for Qwen2.5-VL-7B-sft, and the paper reports superiority by 22.71% on three RGBX grounding tasks; importantly, its X-only performance improves far more than standard SFT, supporting the claim that VM-CoT supervision and MuST reward teach actual non-RGB modality understanding (Wu et al., 31 Jan 2026).

In low-level enhancement, RT-X Net reaches 27.75 PSNR / 0.85 SSIM on LLVIP and 0.12 LPIPS / 0.71 SSIM on V-TIEE, outperforming the strongest reported visible-only baselines. Its ablation establishes the hierarchy “Self-Attention (Only RGB)” $X$ 4 < “Thermal channel concatenation” $X$ 5 < “Cross-Attention” $X$ 6, demonstrating that the gain is not merely due to adding a fourth channel but to structured cross-modal interaction (Jha et al., 30 May 2025).

RGBX-DiffusionDet extends this empirical message to diffusion-based detection. It raises AP $X$ 7 from 67.0 to 69.2 on KITTI RGB-D, from 52.8 to 54.2 on RGB-P, and from 54.1 to 58.1 on M $X$ 8FD RGB-IR, and on M $X$ 9FD it surpasses several prior fusion methods, reaching 58.1 AP $X$ 0 and 88.8 AP $X$ 1 (Orfaig et al., 5 May 2025). The strongest gain occurs in RGB-IR, consistent with the broader claim that V-RGBX helps most when RGB quality degrades.

Generative RGBX systems add a different empirical lesson: joint modeling can improve both cross-modal consistency and controllability. PRISM’s joint model outperforms modality-specific variants on HyperSim, and its reconstruction-from-predicted-albedo-times-irradiance experiment yields RMSE 0.0849, PSNR 22.38, LPIPS 0.15, substantially better than both a single-channel PRISM variant and RGB↔X, indicating stronger alignment between jointly generated intrinsic layers (Dirik et al., 19 Apr 2025). V-RGBX, the intrinsic-aware video editing framework, reports forward-rendering performance of 22.42 PSNR, 0.7952 SSIM, 0.1930 LPIPS, and 367.89 FVD with reference conditioning, as well as strong RGB $X$ 2RGB cycle consistency on Evermotion and RealEstate10K (Fang et al., 12 Dec 2025).

Taken together, these results support three field-wide conclusions. First, thermal, depth, event, polarimetric, and intrinsic channels are most valuable when RGB is degraded or underconstrained. Second, naive fusion is often inferior to modality-aware interaction. Third, the meaning of “fusion” in V-RGBX has expanded from early concatenation to asymmetric denoising, decision calibration, language-guided cross-attention, shared latent diffusion, and chain-of-thought-mediated modality transfer.

6. Limitations, misconceptions, and open directions

The strongest recurring limitation is dependence on pairing and alignment. EINet relies on manual alignment between RGB and thermal streams; RT-X Net assumes co-located paired RGB-T inputs with homography refinement; RGBX-DiffusionDet assumes aligned paired 2D modalities; RGBX-R1 builds on datasets where RGB and $X$ 3 are spatially and temporally aligned (Tu et al., 2023, Jha et al., 30 May 2025, Orfaig et al., 5 May 2025, Wu et al., 31 Jan 2026). Several works explicitly note the absence of dedicated misalignment robustness studies. This makes registration error, asynchronous capture, and missing-modality robustness unresolved issues rather than peripheral implementation details.

A second limitation is dataset scale and domain concentration. VT-VOD50 contains 50 paired sequences; V-TIEE contains 50 scenes; RGBT-Ground is built from converted RGB-T detection datasets; RGBX-Grounding is built from tracking datasets; PRISM is trained primarily on synthetic and rendered indoor scenes with an internal real RGB-only dataset; V-RGBX is trained on an internal synthetic dataset rendered from 127 Evermotion interior scenes (Tu et al., 2023, Jha et al., 30 May 2025, Zhao et al., 31 Dec 2025, Wu et al., 31 Jan 2026, Dirik et al., 19 Apr 2025, Fang et al., 12 Dec 2025). These benchmark choices are valuable, but they also constrain claims about generalization.

A third limitation is modality specialization. EINet’s one-directional thermal-to-RGB erasure is justified by thermal sensor physics, and the paper itself notes that the exact erasure design may need to be reversed, made adaptive, or made bidirectional for other modalities such as depth, event streams, polarization, or radar (Tu et al., 2023). Similarly, PRISM’s $X$ 4 is intrinsic-focused rather than sensor-focused, and V-RGBX’s conditioning sampler chooses exactly one intrinsic modality per frame, which the paper states limits handling of complex multi-attribute keyframe edits (Dirik et al., 19 Apr 2025, Fang et al., 12 Dec 2025).

A fourth issue is reproducibility. RT-X Net does not clearly specify the number of transformer blocks, attention heads, feature dimensions, exact PCA implementation, or exact reconstruction-head architecture; RGBT-VGNet leaves the exact localization loss unspecified in the visible text; RGBX-R1 gives very little low-level architectural detail about how non-RGB images are handled inside the base MLLM (Jha et al., 30 May 2025, Zhao et al., 31 Dec 2025, Wu et al., 31 Jan 2026). These omissions matter because V-RGBX systems are often sensitive to preprocessing, synchronization, and modality encoding choices.

Several misconceptions are corrected by the literature. Thermal is not a replacement for RGB; naive multimodal fusion is not automatically beneficial; and V-RGBX is not confined to detection or tracking. It also includes low-light enhancement, joint RGBX generation, inverse rendering, and intrinsic-aware video editing (Tu et al., 2023, Jha et al., 30 May 2025, Dirik et al., 19 Apr 2025, Fang et al., 12 Dec 2025).

The open directions stated or implied across the cited work are coherent. One line seeks better efficiency and larger datasets, as in the planned expansion beyond 500 pairs of RGBT videos for video detection and the real-time ambitions stated for nighttime enhancement (Tu et al., 2023, Jha et al., 30 May 2025). Another line seeks broader vision-language tasks and more robust grounding, including referring expression segmentation, VQA, cross-modal retrieval, dynamic modality reliability estimation, stronger registration-aware fusion, and larger vision-language backbones trained natively on thermal data (Zhao et al., 31 Dec 2025). A third line, suggested by PRISM and V-RGBX, is to treat RGB and $X$ 5 as a joint distribution rather than a one-way prediction target, enabling decomposition, completion, conditional generation, and editable scene control within the same model family (Dirik et al., 19 Apr 2025, Fang et al., 12 Dec 2025).

In that sense, V-RGBX is not merely a multimodal supplement to RGB vision. It is a broader research program concerned with how RGB and auxiliary modalities should be co-modeled, when asymmetry between modalities should be explicit, and whether the auxiliary channel should add evidence, remove noise, regularize reasoning, or define a physically interpretable control space.