RIS-FUSION: Diverse Methods in Data Integration

Updated 4 July 2026

RIS-FUSION is a multifaceted framework that encompasses various fusion approaches, including restricted-information sanitization, text-driven infrared-visible fusion, and wireless decision fusion.
It unifies disparate methodologies by structuring the integration of heterogeneous data sources under explicit constraints on utility, secrecy, semantics, and channel awareness.
The framework leverages joint optimization techniques such as adversarial risk minimization, cascaded architectures, and covariance-aware inference to enhance performance across diverse applications.

Searching arXiv for papers using or defining “RIS-FUSION” across domains, to ground the article in published work. RIS-FUSION is a recurrent but non-unified label in the technical literature. In the cited works, it denotes several distinct formulations: a sanitization framework for fusing classified information under zero inferential leakage constraints, a text-conditioned infrared-visible image fusion method supervised by referring image segmentation, and multiple wireless-communications schemes in which heterogeneous path, channel, or local-decision information is fused for localization, inference, control, or feedback recovery (Jändel et al., 2017, Ma et al., 16 Sep 2025, Zhang et al., 2023). The shared motif is not a single algorithmic lineage, but the structured composition of multiple sources under explicit constraints on utility, semantics, secrecy, or channel awareness.

1. Nomenclature and scope

A common misconception is that RIS-FUSION denotes a single method. In the cited literature, the label is attached to several technically unrelated frameworks.

Usage of RIS-FUSION	Core formulation	Representative source
Restricted-information fusion	$R^* = S \circ F(DB_1,\dots,DB_m)$ with $I(R^*;SI)=0$	(Jändel et al., 2017)
Referring-image-segmentation-driven IVIF	Cascaded text-driven infrared-visible fusion and RIS with LangGatedFusion	(Ma et al., 16 Sep 2025)
RIS-assisted wireless fusion	WLS pseudo-position fusion, large-array decision fusion, distributed RIS control, or CSI fusion	(Zhang et al., 2023)

The earliest formulation in the supplied corpus treats RIS-FUSION as “fusing restricted information,” where the central problem is that a conventional fusion result inherits the strictest input classification and therefore becomes difficult to disseminate (Jändel et al., 2017). A later computer-vision use recasts RIS-FUSION as “Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation,” where RIS stands for referring image segmentation rather than reconfigurable intelligent surfaces (Ma et al., 16 Sep 2025). In wireless papers, the same label is used in summaries to describe fusion of pathwise localization estimates, sensor decisions, CSI components, or local neural proposals at distributed RIS controllers (Zhang et al., 2023, Ciuonzo et al., 27 Feb 2026, Alexandropoulos et al., 21 Dec 2025, Guo et al., 2020).

This multiplicity matters because the acronym “RIS” changes meaning across subfields. In (Ma et al., 16 Sep 2025), RIS is referring image segmentation; in (Zhang et al., 2023, Ciuonzo et al., 27 Feb 2026), and (Alexandropoulos et al., 21 Dec 2025), RIS is reconfigurable intelligent surface. This suggests that “RIS-FUSION” functions as a context-dependent label rather than a universally standardized term.

2. Restricted-information sanitization

In “Fusing restricted information,” the fusion problem is formalized over a totally ordered set of security labels $C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}$ , with source databases $DB_1,\dots,DB_m$ whose records carry their own classification labels (Jändel et al., 2017). A conventional fusion operator

$F:(DB_1,\dots,DB_m)\mapsto R$

produces a fused result $R$ whose classification is

$\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.$

When the fusion process is a classifier, the model $K$ may be exposed in white-box or black-box form.

The objective of RIS-FUSION in this setting is to construct a sanitized result $R^*$ or classifier $K^*$ that can be disseminated at a lower level $I(R^*;SI)=0$ 0, while preventing inference of any sensitive input $I(R^*;SI)=0$ 1. The formulation is

$I(R^*;SI)=0$ 2

subject to the non-leakage constraint

$I(R^*;SI)=0$ 3

An equivalent adversarial formulation requires the risk of any strategy $I(R^*;SI)=0$ 4 using $I(R^*;SI)=0$ 5 to equal the risk of an oracle that never sees $I(R^*;SI)=0$ 6 (Jändel et al., 2017). The same paper also introduces the leakage bound

$I(R^*;SI)=0$ 7

with $I(R^*;SI)=0$ 8 for perfect secrecy, and notes that some extensions may impose a differential-privacy analogue on model parameters $I(R^*;SI)=0$ 9.

The construction uses three abstract operators—build ( $C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}$ 0), filter ( $C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}$ 1), and fuse ( $C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}$ 2)—whose permitted orderings are generated by a context-sensitive grammar indexed by the sensitivity descriptor. Pipelines are then pruned by trust and competence constraints: the classifier producer is trusted to see $C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}$ 3, the end-user receives only sanitized $C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}$ 4, and a certifier reviews the pipeline (Jändel et al., 2017). The final selection rule is

$C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}$ 5

The paper’s two-database SVM example illustrates the mechanism. First, SVMs $C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}$ 6 are trained on $C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}$ 7; next, a composite classifier $C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}$ 8 is built; finally, support vectors in the region where $C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}$ 9 predicts the sensitive label $DB_1,\dots,DB_m$ 0 are removed by moving them across the decision boundary until no support vector lies in that cell (Jändel et al., 2017). The theoretical claim is strong: by construction, the grammar forbids any pipeline that would expose correlated secret patterns, so $DB_1,\dots,DB_m$ 1; in the black-box setting, queries cannot increase mutual information about $DB_1,\dots,DB_m$ 2 beyond zero.

The practical constraints are equally explicit. The number of possible build/filter/fuse permutations grows combinatorially with $DB_1,\dots,DB_m$ 3; utility loss is inevitable if the sensitive region corresponds to high-information cells; the example covers only binary SVMs and pairwise sensitivity; and the framework assumes that the adversary has no side-channel beyond a black-box API and that the sensitivity descriptor correctly captures which correlations are secret (Jändel et al., 2017).

3. Referring-image-segmentation-driven infrared-visible fusion

In “RIS-FUSION: Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation,” the key observation is that text-driven infrared and visible image fusion (IVIF) and referring image segmentation (RIS) share the same semantic goal: highlighting exactly the object described by a natural-language expression (Ma et al., 16 Sep 2025). RIS-FUSION is therefore cast as a two-stage cascaded architecture in which fusion and segmentation are jointly optimized.

In the first stage, a dual-stream pyramid encoder extracts visible features $DB_1,\dots,DB_m$ 4 from the luminance $DB_1,\dots,DB_m$ 5 channel and infrared features $DB_1,\dots,DB_m$ 6 at four spatial resolutions $DB_1,\dots,DB_m$ 7. The first two scales are fused by simple addition, while the deeper two scales pass through LangGatedFusion, which injects a text embedding $DB_1,\dots,DB_m$ 8 obtained by a frozen BERT. A U-Net decoder produces a fused luminance image $DB_1,\dots,DB_m$ 9, which is recombined with the original chroma channels $F:(DB_1,\dots,DB_m)\mapsto R$ 0 to synthesize the RGB fused image $F:(DB_1,\dots,DB_m)\mapsto R$ 1 (Ma et al., 16 Sep 2025).

LangGatedFusion consists of language-guided attention, a soft spatial gate, and FiLM modulation. At each deep scale, the context map is computed as

$F:(DB_1,\dots,DB_m)\mapsto R$ 2

from which the model predicts a gate $F:(DB_1,\dots,DB_m)\mapsto R$ 3 and FiLM parameters $F:(DB_1,\dots,DB_m)\mapsto R$ 4. The fused feature is

$F:(DB_1,\dots,DB_m)\mapsto R$ 5

This permits pixel-wise, text-conditioned control over the contribution of visible and infrared features (Ma et al., 16 Sep 2025).

In the second stage, $F:(DB_1,\dots,DB_m)\mapsto R$ 6 and the same text embedding are fed into a Swin-Transformer-based encoder-decoder. Text is injected via Language-Guided Attention in multiple encoder layers and concatenated in the decoder. The RIS head outputs a binary mask $F:(DB_1,\dots,DB_m)\mapsto R$ 7 corresponding to the referred object. Crucially, gradients are not detached at the fusion output; the segmentation loss back-propagates through the fusion backbone, so perceptual fusion quality and target-object retention are optimized jointly (Ma et al., 16 Sep 2025).

Training uses a Dice segmentation loss,

$F:(DB_1,\dots,DB_m)\mapsto R$ 8

a multi-term fusion reconstruction loss with weights $F:(DB_1,\dots,DB_m)\mapsto R$ 9, $R$ 0, $R$ 1, $R$ 2, and $R$ 3, and a total objective

$R$ 4

with $R$ 5 set via cross-validation (Ma et al., 16 Sep 2025).

The benchmark introduced for this setting is MM-RIS, built on M $R$ 6FD and MSRS. It contains 12,500 training and 3,500 testing triplets, each consisting of an infrared-visible image pair, a fine-grained binary mask, and a referring expression. Expressions average 5–10 tokens and use color, shape, size, or spatial cues; annotations are pixel-wise masks rather than bounding boxes (Ma et al., 16 Sep 2025).

Reported performance is state of the art on MM-RIS. RIS-FUSION $R$ 7 achieves 53.00% mIoU, a $R$ 8 absolute gain over the strongest baseline, OmniFuse+RIS $R$ 9, at 47.71%, with improvements of $\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.$ 0 to $\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.$ 1 in $\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.$ 2 thresholds. Ablation results attribute a $\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.$ 3 mIoU increase to joint optimization alone, and a further increase to 53.00 after adding LangGatedFusion (Ma et al., 16 Sep 2025). The paper also notes limitations: MM-RIS currently covers urban traffic scenes, the cascaded Swin-Transformer RIS head and BERT encoder add overhead, and future work is identified around broader domains, lightweight architectures, open-vocabulary expressions, and evaluation beyond mIoU.

4. RIS-aided positioning and pathwise estimate fusion

In “Approximate Maximum-Likelihood RIS-Aided Positioning,” RIS-FUSION denotes a fusion method for localization in which a user equipment position and clock bias are inferred from estimated channel parameters associated with a direct LOS path and $\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.$ 4 RIS-reflection paths (Zhang et al., 2023). The system model places the base station at $\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.$ 5, the user at unknown $\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.$ 6, and the $\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.$ 7-th RIS panel at $\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.$ 8, under an OFDM system with $\operatorname{class}(R)=\max\{\operatorname{class}(DB_i):i=1\dots m\}.$ 9 subcarriers and bandwidth $K$ 0.

The effective MIMO channel on subcarrier $K$ 1 is

$K$ 2

and the unknown parameter vector $K$ 3 includes path gains, delays, and angular parameters. Rather than solve the full nonconvex MLE jointly, the paper decomposes estimation into subproblems: BS-side AoD by subspace fitting, delays by MUSIC, UE-side AoA by analogous MUSIC, and path gains by least squares (Zhang et al., 2023).

Each path yields a pseudo-position estimate without clock-bias adjustment: $K$ 4 with covariance matrices derived from Fisher-information approximations. These pseudo-positions are fused by weighted least squares over the LOS and reflection paths, jointly estimating $K$ 5 and $K$ 6. For fixed $K$ 7, the estimator has the closed form

$K$ 8

followed by a one-dimensional search over $K$ 9 (Zhang et al., 2023).

The theoretical claim is asymptotic: by the Extended Invariance Principle, the WLS estimator is approximately equivalent to the MLE as the number of samples grows, provided pathwise estimates are approximately independent. The paper also derives the Fisher information matrix, the transformed FIM on $R^*$ 0, the position error bound

$R^*$ 1

and the clock-bias error bound

$R^*$ 2

RIS phase-shift design is then posed as an expected reflection-gain maximization over angular spans and solved approximately by SVD after discretization (Zhang et al., 2023).

Numerically, the proposed WLS fusion RMSE versus SNR nearly attains the PEB over $R^*$ 3– $R^*$ 4 dB. Increasing the number of samples $R^*$ 5, RIS elements $R^*$ 6, or RIS count $R^*$ 7 lowers RMSE, and the method is robust to actual $R^*$ 8, unlike naive bias-ignoring methods. The fusion step is explicitly covariance-aware: low-SNR paths are downweighted (Zhang et al., 2023).

5. Large-array decision fusion and adversarial spectrum sensing

A separate line of work treats RIS-FUSION as channel-aware decision aggregation at a fusion center. In “Massive MIMO Channel-aware Decision Fusion Aided by Reconfigurable Intelligent Surfaces,” $R^*$ 9 sensors transmit local BPSK-mapped binary decisions to an $K^*$ 0-antenna fusion center assisted by an $K^*$ 1-element RIS (Ciuonzo et al., 27 Feb 2026). The received vector is

$K^*$ 2

The NP-optimal LLR has complexity $K^*$ 3, so the paper derives three simpler large-array rules: MRC, modified MRC (mMRC-1 and mMRC-2), and ZFC.

The underlying analytical point is that conventional favorable propagation is altered by the RIS. Without RIS, $K^*$ 4 is diagonal, but with RIS,

$K^*$ 5

which is non-diagonal and random, so inter-sensor interference persists even as $K^*$ 6 (Ciuonzo et al., 27 Feb 2026). The modified rules therefore whiten or invert $K^*$ 7, and RIS phases are optimized from long-term channel statistics through a problem $K^*$ 8 solved by majorization–minimization.

Simulation results show clear separation between the rules. For $K^*$ 9 sensors and $I(R^*;SI)=0$ 00 RIS elements, mMRC-1+RIS-LTD reaches $I(R^*;SI)=0$ 01 at $I(R^*;SI)=0$ 02 and $I(R^*;SI)=0$ 03 at $I(R^*;SI)=0$ 04; ZFC+RIS-LTD yields $I(R^*;SI)=0$ 05 at $I(R^*;SI)=0$ 06 and $I(R^*;SI)=0$ 07 at $I(R^*;SI)=0$ 08; conventional MRC remains flat around $I(R^*;SI)=0$ 09; and mMRC-2 improves as the Rician factor strengthens, approaching $I(R^*;SI)=0$ 10 at $I(R^*;SI)=0$ 11 dB (Ciuonzo et al., 27 Feb 2026). The practical recommendation is a two-timescale design: slow RIS optimization from long-term statistics and fast FC-side linear combining.

The security counterpart appears in “Byzantine Attacks in RIS-Enhanced Cooperative Spectrum Sensing: A Decision Fusion Perspective,” where one primary user, $I(R^*;SI)=0$ 12 multi-antenna secondary users, one passive RIS with $I(R^*;SI)=0$ 13 reflecting elements, and $I(R^*;SI)=0$ 14 decode-and-forward relay hops feed local hard decisions to a fusion center (Zhang et al., 31 Oct 2025). Each secondary user performs energy detection with statistic

$I(R^*;SI)=0$ 15

and local false-alarm and detection probabilities determined by instantaneous SNR $I(R^*;SI)=0$ 16, which itself depends on the RIS phase matrix $I(R^*;SI)=0$ 17.

The fusion center constructs a channel- and attack-aware LLR by combining branchwise terms $I(R^*;SI)=0$ 18 and $I(R^*;SI)=0$ 19, where local reports may be flipped by Byzantine nodes with probabilities

$I(R^*;SI)=0$ 20

The striking result is that the optimal attack depends only on the Byzantine fraction $I(R^*;SI)=0$ 21, not on global instantaneous CSI. For $I(R^*;SI)=0$ 22, the optimum is Always-False with $I(R^*;SI)=0$ 23, hence $I(R^*;SI)=0$ 24. For $I(R^*;SI)=0$ 25, any strategy satisfying $I(R^*;SI)=0$ 26 blinds the fusion center exactly, producing $I(R^*;SI)=0$ 27 (Zhang et al., 31 Oct 2025). Under the large-scale random-flip attack, BER tends to $I(R^*;SI)=0$ 28, $I(R^*;SI)=0$ 29, and mutual information tends to zero for all SNRs and system sizes.

Taken together, these papers show two complementary points. RIS-assisted decision fusion can be substantially improved by large-array linear processing and long-term phase design (Ciuonzo et al., 27 Feb 2026), yet improved sensing and diversity do not, by themselves, immunize the fusion center against universal Byzantine strategies (Zhang et al., 31 Oct 2025).

6. Distributed RIS control and cooperative CSI fusion

In “RIS-Enabled Smart Wireless Environments: Fundamentals and Distributed Optimization,” RIS-FUSION is presented as a Hybrid Distributed-and-Fusion (HDF) machine-learning architecture for joint configuration of multiple Beyond-Diagonal RISs and multi-user precoding in a MISO downlink (Alexandropoulos et al., 21 Dec 2025). A single BS with $I(R^*;SI)=0$ 30 antennas serves $I(R^*;SI)=0$ 31 users within the area of influence of $I(R^*;SI)=0$ 32 BD-RISs, each with $I(R^*;SI)=0$ 33 tunable unit elements and a general linear transformation $I(R^*;SI)=0$ 34.

The optimization target is the instantaneous sum-rate

$I(R^*;SI)=0$ 35

subject to discrete feasible sets for RIS profiles and BS precoders. Because the discrete formulation is NP-hard, the paper replaces per-block optimization with a policy-learning problem and then factorizes it in HDF form: each RIS has a local mapping $I(R^*;SI)=0$ 36 based on local CSI plus direct BS–UE CSI, while the BS hosts a fusion mapping $I(R^*;SI)=0$ 37 to select final beamformers (Alexandropoulos et al., 21 Dec 2025).

Each RIS branch runs a Multi-Branch Attention Convolutional Neural Network with shared weights $I(R^*;SI)=0$ 38. Inputs are the real and imaginary parts of $I(R^*;SI)=0$ 39, $I(R^*;SI)=0$ 40, and $I(R^*;SI)=0$ 41. Self-attention extracts inter-element channel correlations; convolutional layers compress the representations; MLP-1 outputs discrete $I(R^*;SI)=0$ 42; MLP-2 outputs softmax scores over a precoder codebook for each user, yielding an index set $I(R^*;SI)=0$ 43. A small BS-side feed-forward network with parameters $I(R^*;SI)=0$ 44 merges $I(R^*;SI)=0$ 45 into $I(R^*;SI)=0$ 46, which maps to the final precoder $I(R^*;SI)=0$ 47 (Alexandropoulos et al., 21 Dec 2025).

Training uses a neuroevolutionary CoSyNE procedure on the average episode sum-rate $I(R^*;SI)=0$ 48, with selection, crossover, and mutation over a population of weight vectors for approximately 25 generations. At run time, inference is non-iterative: local CSI estimation, parallel local inference, few-bit index signaling to the BS, and BS-side fusion inference. The paper reports latency of a few milliseconds on a modest GPU, near-optimal sum-rate performance, and sum-rate loss $I(R^*;SI)=0$ 49 relative to exhaustive discrete search (Alexandropoulos et al., 21 Dec 2025).

A related cooperative fusion problem appears in “Deep Learning-based CSI Feedback for RIS-assisted Multi-user Systems,” where the supplied summary identifies the RIS-CoCsiNet framework as RIS-FUSION (Guo et al., 2020). Here the central idea is to exploit correlation among proximate user equipments by splitting angular-domain RIS–UE CSI magnitude into shared and user-specific parts,

$I(R^*;SI)=0$ 50

encoding each user locally, and reconstructing shared CSI and individual refinements at the BS through a shared decoder, individual decoders, and combination networks (Guo et al., 2020).

The framework also separates magnitude and phase feedback via two magnitude-dependent phase feedback strategies. MDPF-1 uses a magnitude-weighted phase loss

$I(R^*;SI)=0$ 51

while MDPF-2 augments the phase encoder input with instantaneous magnitude. For multi-antenna users, three stacked LSTM layers exploit antenna-wise correlation (Guo et al., 2020).

Reported gains are concrete. At $I(R^*;SI)=0$ 52, $I(R^*;SI)=0$ 53, and BPD $I(R^*;SI)=0$ 54, the LSTM decoder yields up to $I(R^*;SI)=0$ 55 dB NMSE gain over FC-only decoding. At BPD $I(R^*;SI)=0$ 56, MDPF-1 and MDPF-2 provide approximately $I(R^*;SI)=0$ 57 dB NMSE improvement over a uniform-loss phase baseline. At BPD $I(R^*;SI)=0$ 58, the cooperative scheme reduces NMSE by approximately $I(R^*;SI)=0$ 59– $I(R^*;SI)=0$ 60 dB versus a per-UE autoencoder, and at BPD $I(R^*;SI)=0$ 61 on QuaDRiGa channels, compressed sensing gives about $I(R^*;SI)=0$ 62 dB NMSE, a single autoencoder about $I(R^*;SI)=0$ 63 dB, and RIS-Fusion-2UE about $I(R^*;SI)=0$ 64 dB (Guo et al., 2020). The additional complexity is modest: total FLOPs are approximately $I(R^*;SI)=0$ 65 million, less than $I(R^*;SI)=0$ 66 above a single autoencoder, with the extra burden shifted to the BS.

7. Adjacent uses in remote sensing and image fusion

The supplied literature also associates RIS-FUSION terminology with neighboring multimodal data-fusion problems. In “FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models,” the detailed summary states that FusionRS realizes the RIS-FUSION approach by building a large dual-modal dataset, training contrastive and generative models, evaluating retrieval and captioning, and ablating IR-aware captions (Han et al., 15 Jun 2026).

FusionRS contains 600,000 aligned RGB–IR–text triplets, split into 580,000 train, 10,000 validation, and 10,000 test samples. The infrared modality is synthetic: each $I(R^*;SI)=0$ 67 RGB tile is translated by DiffV2IR using the provided after_phase_2.ckpt checkpoint into a three-channel IR-style image. Conventional captions are taken from RS5M, SkyScript, NWPU, RSICD, and RSITMD after rule-based cleaning, while IR-aware captions are generated by Qwen2.5-VL-72B-Instruct and explicitly describe infrared cues. The IR-aware subset sizes are 49,068 train, 416 validation, and 10,000 test (Han et al., 15 Jun 2026).

The tri-modal CLIP objective averages RGB–text, IR–text, and RGB–IR InfoNCE losses,

$I(R^*;SI)=0$ 68

and the captioning model is trained with token-wise cross-entropy

$I(R^*;SI)=0$ 69

On the 10,000 test triplets with original captions, OpenAI CLIP ViT-L/14 achieves IR $I(R^*;SI)=0$ 70text $I(R^*;SI)=0$ 71, $I(R^*;SI)=0$ 72, $I(R^*;SI)=0$ 73, RGB $I(R^*;SI)=0$ 74IR $I(R^*;SI)=0$ 75, IR $I(R^*;SI)=0$ 76RGB $I(R^*;SI)=0$ 77, and mean recall approximately $I(R^*;SI)=0$ 78. After IR-aware fine-tuning in the C-mixed setting, ViT-L/14 improves to mean recall $I(R^*;SI)=0$ 79, while ViT-B/32 improves from $I(R^*;SI)=0$ 80. In a Qwen2.5-VL-7B ablation, B-iraware yields Cap.Auto $I(R^*;SI)=0$ 81, Cap.IR $I(R^*;SI)=0$ 82, Cap.R-L $I(R^*;SI)=0$ 83, and IR-cue QA $I(R^*;SI)=0$ 84, compared with A-original at $I(R^*;SI)=0$ 85, $I(R^*;SI)=0$ 86, $I(R^*;SI)=0$ 87, and $I(R^*;SI)=0$ 88, respectively (Han et al., 15 Jun 2026). The conclusion is explicit: modality-specific textual supervision is crucial for RGB-infrared remote sensing vision-language representation learning.

Another adjacent case is “SIRF: Simultaneous Image Registration and Fusion in A Unified Framework,” which the supplied summary presents as “RIS-FUSION (a.k.a. SIRF)” (Chen et al., 2014). SIRF solves a joint convex program

$I(R^*;SI)=0$ 89

where the regularizer is a dynamic gradient sparsity term

$I(R^*;SI)=0$ 90

The method alternates FISTA-based updates of the fused high-resolution multispectral image $I(R^*;SI)=0$ 91 and gradient-descent updates of the transform $I(R^*;SI)=0$ 92, achieving linear complexity in the output image size per iteration (Chen et al., 2014).

On 158 simulated patches from Quickbird, GeoEye, SPOT, and IKONOS, the supplied summary reports that SIRF reaches ERGAS $I(R^*;SI)=0$ 93, QAVE $I(R^*;SI)=0$ 94, RASE $I(R^*;SI)=0$ 95, SAM $I(R^*;SI)=0$ 96, PSNR $I(R^*;SI)=0$ 97, MSSIM $I(R^*;SI)=0$ 98, and RMSE $I(R^*;SI)=0$ 99, outperforming PCA, IHS, Wavelet, Brovey, P+XS, and AVWP (Chen et al., 2014). Runtime on a $C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}$ 00 image is reported as $C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}$ 01 s, compared with $C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}$ 02 s for AVWP and $C=\{c_1 \prec c_2 \prec \cdots \prec c_L\}$ 03 s for P+XS. This suggests that some summaries use RIS-FUSION as a broad descriptor for registration-and-fusion pipelines, even when the canonical paper title uses a different acronym.

Across these adjacent cases, the label marks structured multimodal fusion rather than a single established method family. The unifying pattern is architectural rather than terminological: heterogeneous modalities are aligned, selectively combined, and constrained by downstream objectives such as caption grounding, registration fidelity, or information leakage control.