Papers
Topics
Authors
Recent
Search
2000 character limit reached

RIS-FUSION: Diverse Methods in Data Integration

Updated 4 July 2026
  • RIS-FUSION is a multifaceted framework that encompasses various fusion approaches, including restricted-information sanitization, text-driven infrared-visible fusion, and wireless decision fusion.
  • It unifies disparate methodologies by structuring the integration of heterogeneous data sources under explicit constraints on utility, secrecy, semantics, and channel awareness.
  • The framework leverages joint optimization techniques such as adversarial risk minimization, cascaded architectures, and covariance-aware inference to enhance performance across diverse applications.

Searching arXiv for papers using or defining “RIS-FUSION” across domains, to ground the article in published work. RIS-FUSION is a recurrent but non-unified label in the technical literature. In the cited works, it denotes several distinct formulations: a sanitization framework for fusing classified information under zero inferential leakage constraints, a text-conditioned infrared-visible image fusion method supervised by referring image segmentation, and multiple wireless-communications schemes in which heterogeneous path, channel, or local-decision information is fused for localization, inference, control, or feedback recovery (Jändel et al., 2017, Ma et al., 16 Sep 2025, Zhang et al., 2023). The shared motif is not a single algorithmic lineage, but the structured composition of multiple sources under explicit constraints on utility, semantics, secrecy, or channel awareness.

1. Nomenclature and scope

A common misconception is that RIS-FUSION denotes a single method. In the cited literature, the label is attached to several technically unrelated frameworks.

Usage of RIS-FUSION Core formulation Representative source
Restricted-information fusion R=SF(DB1,,DBm)R^* = S \circ F(DB_1,\dots,DB_m) with I(R;SI)=0I(R^*;SI)=0 (Jändel et al., 2017)
Referring-image-segmentation-driven IVIF Cascaded text-driven infrared-visible fusion and RIS with LangGatedFusion (Ma et al., 16 Sep 2025)
RIS-assisted wireless fusion WLS pseudo-position fusion, large-array decision fusion, distributed RIS control, or CSI fusion (Zhang et al., 2023)

The earliest formulation in the supplied corpus treats RIS-FUSION as “fusing restricted information,” where the central problem is that a conventional fusion result inherits the strictest input classification and therefore becomes difficult to disseminate (Jändel et al., 2017). A later computer-vision use recasts RIS-FUSION as “Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation,” where RIS stands for referring image segmentation rather than reconfigurable intelligent surfaces (Ma et al., 16 Sep 2025). In wireless papers, the same label is used in summaries to describe fusion of pathwise localization estimates, sensor decisions, CSI components, or local neural proposals at distributed RIS controllers (Zhang et al., 2023, Ciuonzo et al., 27 Feb 2026, Alexandropoulos et al., 21 Dec 2025, Guo et al., 2020).

This multiplicity matters because the acronym “RIS” changes meaning across subfields. In (Ma et al., 16 Sep 2025), RIS is referring image segmentation; in (Zhang et al., 2023, Ciuonzo et al., 27 Feb 2026), and (Alexandropoulos et al., 21 Dec 2025), RIS is reconfigurable intelligent surface. This suggests that “RIS-FUSION” functions as a context-dependent label rather than a universally standardized term.

2. Restricted-information sanitization

In “Fusing restricted information,” the fusion problem is formalized over a totally ordered set of security labels C={c1c2cL}C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}, with source databases DB1,,DBmDB_1,\dots,DB_m whose records carry their own classification labels (Jändel et al., 2017). A conventional fusion operator

F:(DB1,,DBm)RF:(DB_1,\dots,DB_m)\mapsto R

produces a fused result RR whose classification is

class(R)=max{class(DBi):i=1m}.\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.

When the fusion process is a classifier, the model KK may be exposed in white-box or black-box form.

The objective of RIS-FUSION in this setting is to construct a sanitized result RR^* or classifier KK^* that can be disseminated at a lower level I(R;SI)=0I(R^*;SI)=00, while preventing inference of any sensitive input I(R;SI)=0I(R^*;SI)=01. The formulation is

I(R;SI)=0I(R^*;SI)=02

subject to the non-leakage constraint

I(R;SI)=0I(R^*;SI)=03

An equivalent adversarial formulation requires the risk of any strategy I(R;SI)=0I(R^*;SI)=04 using I(R;SI)=0I(R^*;SI)=05 to equal the risk of an oracle that never sees I(R;SI)=0I(R^*;SI)=06 (Jändel et al., 2017). The same paper also introduces the leakage bound

I(R;SI)=0I(R^*;SI)=07

with I(R;SI)=0I(R^*;SI)=08 for perfect secrecy, and notes that some extensions may impose a differential-privacy analogue on model parameters I(R;SI)=0I(R^*;SI)=09.

The construction uses three abstract operators—build (C={c1c2cL}C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}0), filter (C={c1c2cL}C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}1), and fuse (C={c1c2cL}C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}2)—whose permitted orderings are generated by a context-sensitive grammar indexed by the sensitivity descriptor. Pipelines are then pruned by trust and competence constraints: the classifier producer is trusted to see C={c1c2cL}C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}3, the end-user receives only sanitized C={c1c2cL}C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}4, and a certifier reviews the pipeline (Jändel et al., 2017). The final selection rule is

C={c1c2cL}C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}5

The paper’s two-database SVM example illustrates the mechanism. First, SVMs C={c1c2cL}C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}6 are trained on C={c1c2cL}C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}7; next, a composite classifier C={c1c2cL}C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}8 is built; finally, support vectors in the region where C={c1c2cL}C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}9 predicts the sensitive label DB1,,DBmDB_1,\dots,DB_m0 are removed by moving them across the decision boundary until no support vector lies in that cell (Jändel et al., 2017). The theoretical claim is strong: by construction, the grammar forbids any pipeline that would expose correlated secret patterns, so DB1,,DBmDB_1,\dots,DB_m1; in the black-box setting, queries cannot increase mutual information about DB1,,DBmDB_1,\dots,DB_m2 beyond zero.

The practical constraints are equally explicit. The number of possible build/filter/fuse permutations grows combinatorially with DB1,,DBmDB_1,\dots,DB_m3; utility loss is inevitable if the sensitive region corresponds to high-information cells; the example covers only binary SVMs and pairwise sensitivity; and the framework assumes that the adversary has no side-channel beyond a black-box API and that the sensitivity descriptor correctly captures which correlations are secret (Jändel et al., 2017).

3. Referring-image-segmentation-driven infrared-visible fusion

In “RIS-FUSION: Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation,” the key observation is that text-driven infrared and visible image fusion (IVIF) and referring image segmentation (RIS) share the same semantic goal: highlighting exactly the object described by a natural-language expression (Ma et al., 16 Sep 2025). RIS-FUSION is therefore cast as a two-stage cascaded architecture in which fusion and segmentation are jointly optimized.

In the first stage, a dual-stream pyramid encoder extracts visible features DB1,,DBmDB_1,\dots,DB_m4 from the luminance DB1,,DBmDB_1,\dots,DB_m5 channel and infrared features DB1,,DBmDB_1,\dots,DB_m6 at four spatial resolutions DB1,,DBmDB_1,\dots,DB_m7. The first two scales are fused by simple addition, while the deeper two scales pass through LangGatedFusion, which injects a text embedding DB1,,DBmDB_1,\dots,DB_m8 obtained by a frozen BERT. A U-Net decoder produces a fused luminance image DB1,,DBmDB_1,\dots,DB_m9, which is recombined with the original chroma channels F:(DB1,,DBm)RF:(DB_1,\dots,DB_m)\mapsto R0 to synthesize the RGB fused image F:(DB1,,DBm)RF:(DB_1,\dots,DB_m)\mapsto R1 (Ma et al., 16 Sep 2025).

LangGatedFusion consists of language-guided attention, a soft spatial gate, and FiLM modulation. At each deep scale, the context map is computed as

F:(DB1,,DBm)RF:(DB_1,\dots,DB_m)\mapsto R2

from which the model predicts a gate F:(DB1,,DBm)RF:(DB_1,\dots,DB_m)\mapsto R3 and FiLM parameters F:(DB1,,DBm)RF:(DB_1,\dots,DB_m)\mapsto R4. The fused feature is

F:(DB1,,DBm)RF:(DB_1,\dots,DB_m)\mapsto R5

This permits pixel-wise, text-conditioned control over the contribution of visible and infrared features (Ma et al., 16 Sep 2025).

In the second stage, F:(DB1,,DBm)RF:(DB_1,\dots,DB_m)\mapsto R6 and the same text embedding are fed into a Swin-Transformer-based encoder-decoder. Text is injected via Language-Guided Attention in multiple encoder layers and concatenated in the decoder. The RIS head outputs a binary mask F:(DB1,,DBm)RF:(DB_1,\dots,DB_m)\mapsto R7 corresponding to the referred object. Crucially, gradients are not detached at the fusion output; the segmentation loss back-propagates through the fusion backbone, so perceptual fusion quality and target-object retention are optimized jointly (Ma et al., 16 Sep 2025).

Training uses a Dice segmentation loss,

F:(DB1,,DBm)RF:(DB_1,\dots,DB_m)\mapsto R8

a multi-term fusion reconstruction loss with weights F:(DB1,,DBm)RF:(DB_1,\dots,DB_m)\mapsto R9, RR0, RR1, RR2, and RR3, and a total objective

RR4

with RR5 set via cross-validation (Ma et al., 16 Sep 2025).

The benchmark introduced for this setting is MM-RIS, built on MRR6FD and MSRS. It contains 12,500 training and 3,500 testing triplets, each consisting of an infrared-visible image pair, a fine-grained binary mask, and a referring expression. Expressions average 5–10 tokens and use color, shape, size, or spatial cues; annotations are pixel-wise masks rather than bounding boxes (Ma et al., 16 Sep 2025).

Reported performance is state of the art on MM-RIS. RIS-FUSIONRR7 achieves 53.00% mIoU, a RR8 absolute gain over the strongest baseline, OmniFuse+RISRR9, at 47.71%, with improvements of class(R)=max{class(DBi):i=1m}.\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.0 to class(R)=max{class(DBi):i=1m}.\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.1 in class(R)=max{class(DBi):i=1m}.\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.2 thresholds. Ablation results attribute a class(R)=max{class(DBi):i=1m}.\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.3 mIoU increase to joint optimization alone, and a further increase to 53.00 after adding LangGatedFusion (Ma et al., 16 Sep 2025). The paper also notes limitations: MM-RIS currently covers urban traffic scenes, the cascaded Swin-Transformer RIS head and BERT encoder add overhead, and future work is identified around broader domains, lightweight architectures, open-vocabulary expressions, and evaluation beyond mIoU.

4. RIS-aided positioning and pathwise estimate fusion

In “Approximate Maximum-Likelihood RIS-Aided Positioning,” RIS-FUSION denotes a fusion method for localization in which a user equipment position and clock bias are inferred from estimated channel parameters associated with a direct LOS path and class(R)=max{class(DBi):i=1m}.\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.4 RIS-reflection paths (Zhang et al., 2023). The system model places the base station at class(R)=max{class(DBi):i=1m}.\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.5, the user at unknown class(R)=max{class(DBi):i=1m}.\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.6, and the class(R)=max{class(DBi):i=1m}.\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.7-th RIS panel at class(R)=max{class(DBi):i=1m}.\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.8, under an OFDM system with class(R)=max{class(DBi):i=1m}.\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.9 subcarriers and bandwidth KK0.

The effective MIMO channel on subcarrier KK1 is

KK2

and the unknown parameter vector KK3 includes path gains, delays, and angular parameters. Rather than solve the full nonconvex MLE jointly, the paper decomposes estimation into subproblems: BS-side AoD by subspace fitting, delays by MUSIC, UE-side AoA by analogous MUSIC, and path gains by least squares (Zhang et al., 2023).

Each path yields a pseudo-position estimate without clock-bias adjustment: KK4 with covariance matrices derived from Fisher-information approximations. These pseudo-positions are fused by weighted least squares over the LOS and reflection paths, jointly estimating KK5 and KK6. For fixed KK7, the estimator has the closed form

KK8

followed by a one-dimensional search over KK9 (Zhang et al., 2023).

The theoretical claim is asymptotic: by the Extended Invariance Principle, the WLS estimator is approximately equivalent to the MLE as the number of samples grows, provided pathwise estimates are approximately independent. The paper also derives the Fisher information matrix, the transformed FIM on RR^*0, the position error bound

RR^*1

and the clock-bias error bound

RR^*2

RIS phase-shift design is then posed as an expected reflection-gain maximization over angular spans and solved approximately by SVD after discretization (Zhang et al., 2023).

Numerically, the proposed WLS fusion RMSE versus SNR nearly attains the PEB over RR^*3–RR^*4 dB. Increasing the number of samples RR^*5, RIS elements RR^*6, or RIS count RR^*7 lowers RMSE, and the method is robust to actual RR^*8, unlike naive bias-ignoring methods. The fusion step is explicitly covariance-aware: low-SNR paths are downweighted (Zhang et al., 2023).

5. Large-array decision fusion and adversarial spectrum sensing

A separate line of work treats RIS-FUSION as channel-aware decision aggregation at a fusion center. In “Massive MIMO Channel-aware Decision Fusion Aided by Reconfigurable Intelligent Surfaces,” RR^*9 sensors transmit local BPSK-mapped binary decisions to an KK^*0-antenna fusion center assisted by an KK^*1-element RIS (Ciuonzo et al., 27 Feb 2026). The received vector is

KK^*2

The NP-optimal LLR has complexity KK^*3, so the paper derives three simpler large-array rules: MRC, modified MRC (mMRC-1 and mMRC-2), and ZFC.

The underlying analytical point is that conventional favorable propagation is altered by the RIS. Without RIS, KK^*4 is diagonal, but with RIS,

KK^*5

which is non-diagonal and random, so inter-sensor interference persists even as KK^*6 (Ciuonzo et al., 27 Feb 2026). The modified rules therefore whiten or invert KK^*7, and RIS phases are optimized from long-term channel statistics through a problem KK^*8 solved by majorization–minimization.

Simulation results show clear separation between the rules. For KK^*9 sensors and I(R;SI)=0I(R^*;SI)=000 RIS elements, mMRC-1+RIS-LTD reaches I(R;SI)=0I(R^*;SI)=001 at I(R;SI)=0I(R^*;SI)=002 and I(R;SI)=0I(R^*;SI)=003 at I(R;SI)=0I(R^*;SI)=004; ZFC+RIS-LTD yields I(R;SI)=0I(R^*;SI)=005 at I(R;SI)=0I(R^*;SI)=006 and I(R;SI)=0I(R^*;SI)=007 at I(R;SI)=0I(R^*;SI)=008; conventional MRC remains flat around I(R;SI)=0I(R^*;SI)=009; and mMRC-2 improves as the Rician factor strengthens, approaching I(R;SI)=0I(R^*;SI)=010 at I(R;SI)=0I(R^*;SI)=011 dB (Ciuonzo et al., 27 Feb 2026). The practical recommendation is a two-timescale design: slow RIS optimization from long-term statistics and fast FC-side linear combining.

The security counterpart appears in “Byzantine Attacks in RIS-Enhanced Cooperative Spectrum Sensing: A Decision Fusion Perspective,” where one primary user, I(R;SI)=0I(R^*;SI)=012 multi-antenna secondary users, one passive RIS with I(R;SI)=0I(R^*;SI)=013 reflecting elements, and I(R;SI)=0I(R^*;SI)=014 decode-and-forward relay hops feed local hard decisions to a fusion center (Zhang et al., 31 Oct 2025). Each secondary user performs energy detection with statistic

I(R;SI)=0I(R^*;SI)=015

and local false-alarm and detection probabilities determined by instantaneous SNR I(R;SI)=0I(R^*;SI)=016, which itself depends on the RIS phase matrix I(R;SI)=0I(R^*;SI)=017.

The fusion center constructs a channel- and attack-aware LLR by combining branchwise terms I(R;SI)=0I(R^*;SI)=018 and I(R;SI)=0I(R^*;SI)=019, where local reports may be flipped by Byzantine nodes with probabilities

I(R;SI)=0I(R^*;SI)=020

The striking result is that the optimal attack depends only on the Byzantine fraction I(R;SI)=0I(R^*;SI)=021, not on global instantaneous CSI. For I(R;SI)=0I(R^*;SI)=022, the optimum is Always-False with I(R;SI)=0I(R^*;SI)=023, hence I(R;SI)=0I(R^*;SI)=024. For I(R;SI)=0I(R^*;SI)=025, any strategy satisfying I(R;SI)=0I(R^*;SI)=026 blinds the fusion center exactly, producing I(R;SI)=0I(R^*;SI)=027 (Zhang et al., 31 Oct 2025). Under the large-scale random-flip attack, BER tends to I(R;SI)=0I(R^*;SI)=028, I(R;SI)=0I(R^*;SI)=029, and mutual information tends to zero for all SNRs and system sizes.

Taken together, these papers show two complementary points. RIS-assisted decision fusion can be substantially improved by large-array linear processing and long-term phase design (Ciuonzo et al., 27 Feb 2026), yet improved sensing and diversity do not, by themselves, immunize the fusion center against universal Byzantine strategies (Zhang et al., 31 Oct 2025).

6. Distributed RIS control and cooperative CSI fusion

In “RIS-Enabled Smart Wireless Environments: Fundamentals and Distributed Optimization,” RIS-FUSION is presented as a Hybrid Distributed-and-Fusion (HDF) machine-learning architecture for joint configuration of multiple Beyond-Diagonal RISs and multi-user precoding in a MISO downlink (Alexandropoulos et al., 21 Dec 2025). A single BS with I(R;SI)=0I(R^*;SI)=030 antennas serves I(R;SI)=0I(R^*;SI)=031 users within the area of influence of I(R;SI)=0I(R^*;SI)=032 BD-RISs, each with I(R;SI)=0I(R^*;SI)=033 tunable unit elements and a general linear transformation I(R;SI)=0I(R^*;SI)=034.

The optimization target is the instantaneous sum-rate

I(R;SI)=0I(R^*;SI)=035

subject to discrete feasible sets for RIS profiles and BS precoders. Because the discrete formulation is NP-hard, the paper replaces per-block optimization with a policy-learning problem and then factorizes it in HDF form: each RIS has a local mapping I(R;SI)=0I(R^*;SI)=036 based on local CSI plus direct BS–UE CSI, while the BS hosts a fusion mapping I(R;SI)=0I(R^*;SI)=037 to select final beamformers (Alexandropoulos et al., 21 Dec 2025).

Each RIS branch runs a Multi-Branch Attention Convolutional Neural Network with shared weights I(R;SI)=0I(R^*;SI)=038. Inputs are the real and imaginary parts of I(R;SI)=0I(R^*;SI)=039, I(R;SI)=0I(R^*;SI)=040, and I(R;SI)=0I(R^*;SI)=041. Self-attention extracts inter-element channel correlations; convolutional layers compress the representations; MLP-1 outputs discrete I(R;SI)=0I(R^*;SI)=042; MLP-2 outputs softmax scores over a precoder codebook for each user, yielding an index set I(R;SI)=0I(R^*;SI)=043. A small BS-side feed-forward network with parameters I(R;SI)=0I(R^*;SI)=044 merges I(R;SI)=0I(R^*;SI)=045 into I(R;SI)=0I(R^*;SI)=046, which maps to the final precoder I(R;SI)=0I(R^*;SI)=047 (Alexandropoulos et al., 21 Dec 2025).

Training uses a neuroevolutionary CoSyNE procedure on the average episode sum-rate I(R;SI)=0I(R^*;SI)=048, with selection, crossover, and mutation over a population of weight vectors for approximately 25 generations. At run time, inference is non-iterative: local CSI estimation, parallel local inference, few-bit index signaling to the BS, and BS-side fusion inference. The paper reports latency of a few milliseconds on a modest GPU, near-optimal sum-rate performance, and sum-rate loss I(R;SI)=0I(R^*;SI)=049 relative to exhaustive discrete search (Alexandropoulos et al., 21 Dec 2025).

A related cooperative fusion problem appears in “Deep Learning-based CSI Feedback for RIS-assisted Multi-user Systems,” where the supplied summary identifies the RIS-CoCsiNet framework as RIS-FUSION (Guo et al., 2020). Here the central idea is to exploit correlation among proximate user equipments by splitting angular-domain RIS–UE CSI magnitude into shared and user-specific parts,

I(R;SI)=0I(R^*;SI)=050

encoding each user locally, and reconstructing shared CSI and individual refinements at the BS through a shared decoder, individual decoders, and combination networks (Guo et al., 2020).

The framework also separates magnitude and phase feedback via two magnitude-dependent phase feedback strategies. MDPF-1 uses a magnitude-weighted phase loss

I(R;SI)=0I(R^*;SI)=051

while MDPF-2 augments the phase encoder input with instantaneous magnitude. For multi-antenna users, three stacked LSTM layers exploit antenna-wise correlation (Guo et al., 2020).

Reported gains are concrete. At I(R;SI)=0I(R^*;SI)=052, I(R;SI)=0I(R^*;SI)=053, and BPD I(R;SI)=0I(R^*;SI)=054, the LSTM decoder yields up to I(R;SI)=0I(R^*;SI)=055 dB NMSE gain over FC-only decoding. At BPD I(R;SI)=0I(R^*;SI)=056, MDPF-1 and MDPF-2 provide approximately I(R;SI)=0I(R^*;SI)=057 dB NMSE improvement over a uniform-loss phase baseline. At BPD I(R;SI)=0I(R^*;SI)=058, the cooperative scheme reduces NMSE by approximately I(R;SI)=0I(R^*;SI)=059–I(R;SI)=0I(R^*;SI)=060 dB versus a per-UE autoencoder, and at BPD I(R;SI)=0I(R^*;SI)=061 on QuaDRiGa channels, compressed sensing gives about I(R;SI)=0I(R^*;SI)=062 dB NMSE, a single autoencoder about I(R;SI)=0I(R^*;SI)=063 dB, and RIS-Fusion-2UE about I(R;SI)=0I(R^*;SI)=064 dB (Guo et al., 2020). The additional complexity is modest: total FLOPs are approximately I(R;SI)=0I(R^*;SI)=065 million, less than I(R;SI)=0I(R^*;SI)=066 above a single autoencoder, with the extra burden shifted to the BS.

7. Adjacent uses in remote sensing and image fusion

The supplied literature also associates RIS-FUSION terminology with neighboring multimodal data-fusion problems. In “FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models,” the detailed summary states that FusionRS realizes the RIS-FUSION approach by building a large dual-modal dataset, training contrastive and generative models, evaluating retrieval and captioning, and ablating IR-aware captions (Han et al., 15 Jun 2026).

FusionRS contains 600,000 aligned RGB–IR–text triplets, split into 580,000 train, 10,000 validation, and 10,000 test samples. The infrared modality is synthetic: each I(R;SI)=0I(R^*;SI)=067 RGB tile is translated by DiffV2IR using the provided after_phase_2.ckpt checkpoint into a three-channel IR-style image. Conventional captions are taken from RS5M, SkyScript, NWPU, RSICD, and RSITMD after rule-based cleaning, while IR-aware captions are generated by Qwen2.5-VL-72B-Instruct and explicitly describe infrared cues. The IR-aware subset sizes are 49,068 train, 416 validation, and 10,000 test (Han et al., 15 Jun 2026).

The tri-modal CLIP objective averages RGB–text, IR–text, and RGB–IR InfoNCE losses,

I(R;SI)=0I(R^*;SI)=068

and the captioning model is trained with token-wise cross-entropy

I(R;SI)=0I(R^*;SI)=069

On the 10,000 test triplets with original captions, OpenAI CLIP ViT-L/14 achieves IRI(R;SI)=0I(R^*;SI)=070text I(R;SI)=0I(R^*;SI)=071, I(R;SI)=0I(R^*;SI)=072, I(R;SI)=0I(R^*;SI)=073, RGBI(R;SI)=0I(R^*;SI)=074IR I(R;SI)=0I(R^*;SI)=075, IRI(R;SI)=0I(R^*;SI)=076RGB I(R;SI)=0I(R^*;SI)=077, and mean recall approximately I(R;SI)=0I(R^*;SI)=078. After IR-aware fine-tuning in the C-mixed setting, ViT-L/14 improves to mean recall I(R;SI)=0I(R^*;SI)=079, while ViT-B/32 improves from I(R;SI)=0I(R^*;SI)=080. In a Qwen2.5-VL-7B ablation, B-iraware yields Cap.Auto I(R;SI)=0I(R^*;SI)=081, Cap.IR I(R;SI)=0I(R^*;SI)=082, Cap.R-L I(R;SI)=0I(R^*;SI)=083, and IR-cue QA I(R;SI)=0I(R^*;SI)=084, compared with A-original at I(R;SI)=0I(R^*;SI)=085, I(R;SI)=0I(R^*;SI)=086, I(R;SI)=0I(R^*;SI)=087, and I(R;SI)=0I(R^*;SI)=088, respectively (Han et al., 15 Jun 2026). The conclusion is explicit: modality-specific textual supervision is crucial for RGB-infrared remote sensing vision-language representation learning.

Another adjacent case is “SIRF: Simultaneous Image Registration and Fusion in A Unified Framework,” which the supplied summary presents as “RIS-FUSION (a.k.a. SIRF)” (Chen et al., 2014). SIRF solves a joint convex program

I(R;SI)=0I(R^*;SI)=089

where the regularizer is a dynamic gradient sparsity term

I(R;SI)=0I(R^*;SI)=090

The method alternates FISTA-based updates of the fused high-resolution multispectral image I(R;SI)=0I(R^*;SI)=091 and gradient-descent updates of the transform I(R;SI)=0I(R^*;SI)=092, achieving linear complexity in the output image size per iteration (Chen et al., 2014).

On 158 simulated patches from Quickbird, GeoEye, SPOT, and IKONOS, the supplied summary reports that SIRF reaches ERGAS I(R;SI)=0I(R^*;SI)=093, QAVE I(R;SI)=0I(R^*;SI)=094, RASE I(R;SI)=0I(R^*;SI)=095, SAM I(R;SI)=0I(R^*;SI)=096, PSNR I(R;SI)=0I(R^*;SI)=097, MSSIM I(R;SI)=0I(R^*;SI)=098, and RMSE I(R;SI)=0I(R^*;SI)=099, outperforming PCA, IHS, Wavelet, Brovey, P+XS, and AVWP (Chen et al., 2014). Runtime on a C={c1c2cL}C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}00 image is reported as C={c1c2cL}C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}01 s, compared with C={c1c2cL}C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}02 s for AVWP and C={c1c2cL}C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}03 s for P+XS. This suggests that some summaries use RIS-FUSION as a broad descriptor for registration-and-fusion pipelines, even when the canonical paper title uses a different acronym.

Across these adjacent cases, the label marks structured multimodal fusion rather than a single established method family. The unifying pattern is architectural rather than terminological: heterogeneous modalities are aligned, selectively combined, and constrained by downstream objectives such as caption grounding, registration fidelity, or information leakage control.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RIS-FUSION.