RIS-FUSION: Diverse Methods in Data Integration
- RIS-FUSION is a multifaceted framework that encompasses various fusion approaches, including restricted-information sanitization, text-driven infrared-visible fusion, and wireless decision fusion.
- It unifies disparate methodologies by structuring the integration of heterogeneous data sources under explicit constraints on utility, secrecy, semantics, and channel awareness.
- The framework leverages joint optimization techniques such as adversarial risk minimization, cascaded architectures, and covariance-aware inference to enhance performance across diverse applications.
Searching arXiv for papers using or defining “RIS-FUSION” across domains, to ground the article in published work. RIS-FUSION is a recurrent but non-unified label in the technical literature. In the cited works, it denotes several distinct formulations: a sanitization framework for fusing classified information under zero inferential leakage constraints, a text-conditioned infrared-visible image fusion method supervised by referring image segmentation, and multiple wireless-communications schemes in which heterogeneous path, channel, or local-decision information is fused for localization, inference, control, or feedback recovery (Jändel et al., 2017, Ma et al., 16 Sep 2025, Zhang et al., 2023). The shared motif is not a single algorithmic lineage, but the structured composition of multiple sources under explicit constraints on utility, semantics, secrecy, or channel awareness.
1. Nomenclature and scope
A common misconception is that RIS-FUSION denotes a single method. In the cited literature, the label is attached to several technically unrelated frameworks.
| Usage of RIS-FUSION | Core formulation | Representative source |
|---|---|---|
| Restricted-information fusion | with | (Jändel et al., 2017) |
| Referring-image-segmentation-driven IVIF | Cascaded text-driven infrared-visible fusion and RIS with LangGatedFusion | (Ma et al., 16 Sep 2025) |
| RIS-assisted wireless fusion | WLS pseudo-position fusion, large-array decision fusion, distributed RIS control, or CSI fusion | (Zhang et al., 2023) |
The earliest formulation in the supplied corpus treats RIS-FUSION as “fusing restricted information,” where the central problem is that a conventional fusion result inherits the strictest input classification and therefore becomes difficult to disseminate (Jändel et al., 2017). A later computer-vision use recasts RIS-FUSION as “Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation,” where RIS stands for referring image segmentation rather than reconfigurable intelligent surfaces (Ma et al., 16 Sep 2025). In wireless papers, the same label is used in summaries to describe fusion of pathwise localization estimates, sensor decisions, CSI components, or local neural proposals at distributed RIS controllers (Zhang et al., 2023, Ciuonzo et al., 27 Feb 2026, Alexandropoulos et al., 21 Dec 2025, Guo et al., 2020).
This multiplicity matters because the acronym “RIS” changes meaning across subfields. In (Ma et al., 16 Sep 2025), RIS is referring image segmentation; in (Zhang et al., 2023, Ciuonzo et al., 27 Feb 2026), and (Alexandropoulos et al., 21 Dec 2025), RIS is reconfigurable intelligent surface. This suggests that “RIS-FUSION” functions as a context-dependent label rather than a universally standardized term.
2. Restricted-information sanitization
In “Fusing restricted information,” the fusion problem is formalized over a totally ordered set of security labels , with source databases whose records carry their own classification labels (Jändel et al., 2017). A conventional fusion operator
produces a fused result whose classification is
When the fusion process is a classifier, the model may be exposed in white-box or black-box form.
The objective of RIS-FUSION in this setting is to construct a sanitized result or classifier that can be disseminated at a lower level 0, while preventing inference of any sensitive input 1. The formulation is
2
subject to the non-leakage constraint
3
An equivalent adversarial formulation requires the risk of any strategy 4 using 5 to equal the risk of an oracle that never sees 6 (Jändel et al., 2017). The same paper also introduces the leakage bound
7
with 8 for perfect secrecy, and notes that some extensions may impose a differential-privacy analogue on model parameters 9.
The construction uses three abstract operators—build (0), filter (1), and fuse (2)—whose permitted orderings are generated by a context-sensitive grammar indexed by the sensitivity descriptor. Pipelines are then pruned by trust and competence constraints: the classifier producer is trusted to see 3, the end-user receives only sanitized 4, and a certifier reviews the pipeline (Jändel et al., 2017). The final selection rule is
5
The paper’s two-database SVM example illustrates the mechanism. First, SVMs 6 are trained on 7; next, a composite classifier 8 is built; finally, support vectors in the region where 9 predicts the sensitive label 0 are removed by moving them across the decision boundary until no support vector lies in that cell (Jändel et al., 2017). The theoretical claim is strong: by construction, the grammar forbids any pipeline that would expose correlated secret patterns, so 1; in the black-box setting, queries cannot increase mutual information about 2 beyond zero.
The practical constraints are equally explicit. The number of possible build/filter/fuse permutations grows combinatorially with 3; utility loss is inevitable if the sensitive region corresponds to high-information cells; the example covers only binary SVMs and pairwise sensitivity; and the framework assumes that the adversary has no side-channel beyond a black-box API and that the sensitivity descriptor correctly captures which correlations are secret (Jändel et al., 2017).
3. Referring-image-segmentation-driven infrared-visible fusion
In “RIS-FUSION: Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation,” the key observation is that text-driven infrared and visible image fusion (IVIF) and referring image segmentation (RIS) share the same semantic goal: highlighting exactly the object described by a natural-language expression (Ma et al., 16 Sep 2025). RIS-FUSION is therefore cast as a two-stage cascaded architecture in which fusion and segmentation are jointly optimized.
In the first stage, a dual-stream pyramid encoder extracts visible features 4 from the luminance 5 channel and infrared features 6 at four spatial resolutions 7. The first two scales are fused by simple addition, while the deeper two scales pass through LangGatedFusion, which injects a text embedding 8 obtained by a frozen BERT. A U-Net decoder produces a fused luminance image 9, which is recombined with the original chroma channels 0 to synthesize the RGB fused image 1 (Ma et al., 16 Sep 2025).
LangGatedFusion consists of language-guided attention, a soft spatial gate, and FiLM modulation. At each deep scale, the context map is computed as
2
from which the model predicts a gate 3 and FiLM parameters 4. The fused feature is
5
This permits pixel-wise, text-conditioned control over the contribution of visible and infrared features (Ma et al., 16 Sep 2025).
In the second stage, 6 and the same text embedding are fed into a Swin-Transformer-based encoder-decoder. Text is injected via Language-Guided Attention in multiple encoder layers and concatenated in the decoder. The RIS head outputs a binary mask 7 corresponding to the referred object. Crucially, gradients are not detached at the fusion output; the segmentation loss back-propagates through the fusion backbone, so perceptual fusion quality and target-object retention are optimized jointly (Ma et al., 16 Sep 2025).
Training uses a Dice segmentation loss,
8
a multi-term fusion reconstruction loss with weights 9, 0, 1, 2, and 3, and a total objective
4
with 5 set via cross-validation (Ma et al., 16 Sep 2025).
The benchmark introduced for this setting is MM-RIS, built on M6FD and MSRS. It contains 12,500 training and 3,500 testing triplets, each consisting of an infrared-visible image pair, a fine-grained binary mask, and a referring expression. Expressions average 5–10 tokens and use color, shape, size, or spatial cues; annotations are pixel-wise masks rather than bounding boxes (Ma et al., 16 Sep 2025).
Reported performance is state of the art on MM-RIS. RIS-FUSION7 achieves 53.00% mIoU, a 8 absolute gain over the strongest baseline, OmniFuse+RIS9, at 47.71%, with improvements of 0 to 1 in 2 thresholds. Ablation results attribute a 3 mIoU increase to joint optimization alone, and a further increase to 53.00 after adding LangGatedFusion (Ma et al., 16 Sep 2025). The paper also notes limitations: MM-RIS currently covers urban traffic scenes, the cascaded Swin-Transformer RIS head and BERT encoder add overhead, and future work is identified around broader domains, lightweight architectures, open-vocabulary expressions, and evaluation beyond mIoU.
4. RIS-aided positioning and pathwise estimate fusion
In “Approximate Maximum-Likelihood RIS-Aided Positioning,” RIS-FUSION denotes a fusion method for localization in which a user equipment position and clock bias are inferred from estimated channel parameters associated with a direct LOS path and 4 RIS-reflection paths (Zhang et al., 2023). The system model places the base station at 5, the user at unknown 6, and the 7-th RIS panel at 8, under an OFDM system with 9 subcarriers and bandwidth 0.
The effective MIMO channel on subcarrier 1 is
2
and the unknown parameter vector 3 includes path gains, delays, and angular parameters. Rather than solve the full nonconvex MLE jointly, the paper decomposes estimation into subproblems: BS-side AoD by subspace fitting, delays by MUSIC, UE-side AoA by analogous MUSIC, and path gains by least squares (Zhang et al., 2023).
Each path yields a pseudo-position estimate without clock-bias adjustment: 4 with covariance matrices derived from Fisher-information approximations. These pseudo-positions are fused by weighted least squares over the LOS and reflection paths, jointly estimating 5 and 6. For fixed 7, the estimator has the closed form
8
followed by a one-dimensional search over 9 (Zhang et al., 2023).
The theoretical claim is asymptotic: by the Extended Invariance Principle, the WLS estimator is approximately equivalent to the MLE as the number of samples grows, provided pathwise estimates are approximately independent. The paper also derives the Fisher information matrix, the transformed FIM on 0, the position error bound
1
and the clock-bias error bound
2
RIS phase-shift design is then posed as an expected reflection-gain maximization over angular spans and solved approximately by SVD after discretization (Zhang et al., 2023).
Numerically, the proposed WLS fusion RMSE versus SNR nearly attains the PEB over 3–4 dB. Increasing the number of samples 5, RIS elements 6, or RIS count 7 lowers RMSE, and the method is robust to actual 8, unlike naive bias-ignoring methods. The fusion step is explicitly covariance-aware: low-SNR paths are downweighted (Zhang et al., 2023).
5. Large-array decision fusion and adversarial spectrum sensing
A separate line of work treats RIS-FUSION as channel-aware decision aggregation at a fusion center. In “Massive MIMO Channel-aware Decision Fusion Aided by Reconfigurable Intelligent Surfaces,” 9 sensors transmit local BPSK-mapped binary decisions to an 0-antenna fusion center assisted by an 1-element RIS (Ciuonzo et al., 27 Feb 2026). The received vector is
2
The NP-optimal LLR has complexity 3, so the paper derives three simpler large-array rules: MRC, modified MRC (mMRC-1 and mMRC-2), and ZFC.
The underlying analytical point is that conventional favorable propagation is altered by the RIS. Without RIS, 4 is diagonal, but with RIS,
5
which is non-diagonal and random, so inter-sensor interference persists even as 6 (Ciuonzo et al., 27 Feb 2026). The modified rules therefore whiten or invert 7, and RIS phases are optimized from long-term channel statistics through a problem 8 solved by majorization–minimization.
Simulation results show clear separation between the rules. For 9 sensors and 00 RIS elements, mMRC-1+RIS-LTD reaches 01 at 02 and 03 at 04; ZFC+RIS-LTD yields 05 at 06 and 07 at 08; conventional MRC remains flat around 09; and mMRC-2 improves as the Rician factor strengthens, approaching 10 at 11 dB (Ciuonzo et al., 27 Feb 2026). The practical recommendation is a two-timescale design: slow RIS optimization from long-term statistics and fast FC-side linear combining.
The security counterpart appears in “Byzantine Attacks in RIS-Enhanced Cooperative Spectrum Sensing: A Decision Fusion Perspective,” where one primary user, 12 multi-antenna secondary users, one passive RIS with 13 reflecting elements, and 14 decode-and-forward relay hops feed local hard decisions to a fusion center (Zhang et al., 31 Oct 2025). Each secondary user performs energy detection with statistic
15
and local false-alarm and detection probabilities determined by instantaneous SNR 16, which itself depends on the RIS phase matrix 17.
The fusion center constructs a channel- and attack-aware LLR by combining branchwise terms 18 and 19, where local reports may be flipped by Byzantine nodes with probabilities
20
The striking result is that the optimal attack depends only on the Byzantine fraction 21, not on global instantaneous CSI. For 22, the optimum is Always-False with 23, hence 24. For 25, any strategy satisfying 26 blinds the fusion center exactly, producing 27 (Zhang et al., 31 Oct 2025). Under the large-scale random-flip attack, BER tends to 28, 29, and mutual information tends to zero for all SNRs and system sizes.
Taken together, these papers show two complementary points. RIS-assisted decision fusion can be substantially improved by large-array linear processing and long-term phase design (Ciuonzo et al., 27 Feb 2026), yet improved sensing and diversity do not, by themselves, immunize the fusion center against universal Byzantine strategies (Zhang et al., 31 Oct 2025).
6. Distributed RIS control and cooperative CSI fusion
In “RIS-Enabled Smart Wireless Environments: Fundamentals and Distributed Optimization,” RIS-FUSION is presented as a Hybrid Distributed-and-Fusion (HDF) machine-learning architecture for joint configuration of multiple Beyond-Diagonal RISs and multi-user precoding in a MISO downlink (Alexandropoulos et al., 21 Dec 2025). A single BS with 30 antennas serves 31 users within the area of influence of 32 BD-RISs, each with 33 tunable unit elements and a general linear transformation 34.
The optimization target is the instantaneous sum-rate
35
subject to discrete feasible sets for RIS profiles and BS precoders. Because the discrete formulation is NP-hard, the paper replaces per-block optimization with a policy-learning problem and then factorizes it in HDF form: each RIS has a local mapping 36 based on local CSI plus direct BS–UE CSI, while the BS hosts a fusion mapping 37 to select final beamformers (Alexandropoulos et al., 21 Dec 2025).
Each RIS branch runs a Multi-Branch Attention Convolutional Neural Network with shared weights 38. Inputs are the real and imaginary parts of 39, 40, and 41. Self-attention extracts inter-element channel correlations; convolutional layers compress the representations; MLP-1 outputs discrete 42; MLP-2 outputs softmax scores over a precoder codebook for each user, yielding an index set 43. A small BS-side feed-forward network with parameters 44 merges 45 into 46, which maps to the final precoder 47 (Alexandropoulos et al., 21 Dec 2025).
Training uses a neuroevolutionary CoSyNE procedure on the average episode sum-rate 48, with selection, crossover, and mutation over a population of weight vectors for approximately 25 generations. At run time, inference is non-iterative: local CSI estimation, parallel local inference, few-bit index signaling to the BS, and BS-side fusion inference. The paper reports latency of a few milliseconds on a modest GPU, near-optimal sum-rate performance, and sum-rate loss 49 relative to exhaustive discrete search (Alexandropoulos et al., 21 Dec 2025).
A related cooperative fusion problem appears in “Deep Learning-based CSI Feedback for RIS-assisted Multi-user Systems,” where the supplied summary identifies the RIS-CoCsiNet framework as RIS-FUSION (Guo et al., 2020). Here the central idea is to exploit correlation among proximate user equipments by splitting angular-domain RIS–UE CSI magnitude into shared and user-specific parts,
50
encoding each user locally, and reconstructing shared CSI and individual refinements at the BS through a shared decoder, individual decoders, and combination networks (Guo et al., 2020).
The framework also separates magnitude and phase feedback via two magnitude-dependent phase feedback strategies. MDPF-1 uses a magnitude-weighted phase loss
51
while MDPF-2 augments the phase encoder input with instantaneous magnitude. For multi-antenna users, three stacked LSTM layers exploit antenna-wise correlation (Guo et al., 2020).
Reported gains are concrete. At 52, 53, and BPD 54, the LSTM decoder yields up to 55 dB NMSE gain over FC-only decoding. At BPD 56, MDPF-1 and MDPF-2 provide approximately 57 dB NMSE improvement over a uniform-loss phase baseline. At BPD 58, the cooperative scheme reduces NMSE by approximately 59–60 dB versus a per-UE autoencoder, and at BPD 61 on QuaDRiGa channels, compressed sensing gives about 62 dB NMSE, a single autoencoder about 63 dB, and RIS-Fusion-2UE about 64 dB (Guo et al., 2020). The additional complexity is modest: total FLOPs are approximately 65 million, less than 66 above a single autoencoder, with the extra burden shifted to the BS.
7. Adjacent uses in remote sensing and image fusion
The supplied literature also associates RIS-FUSION terminology with neighboring multimodal data-fusion problems. In “FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models,” the detailed summary states that FusionRS realizes the RIS-FUSION approach by building a large dual-modal dataset, training contrastive and generative models, evaluating retrieval and captioning, and ablating IR-aware captions (Han et al., 15 Jun 2026).
FusionRS contains 600,000 aligned RGB–IR–text triplets, split into 580,000 train, 10,000 validation, and 10,000 test samples. The infrared modality is synthetic: each 67 RGB tile is translated by DiffV2IR using the provided after_phase_2.ckpt checkpoint into a three-channel IR-style image. Conventional captions are taken from RS5M, SkyScript, NWPU, RSICD, and RSITMD after rule-based cleaning, while IR-aware captions are generated by Qwen2.5-VL-72B-Instruct and explicitly describe infrared cues. The IR-aware subset sizes are 49,068 train, 416 validation, and 10,000 test (Han et al., 15 Jun 2026).
The tri-modal CLIP objective averages RGB–text, IR–text, and RGB–IR InfoNCE losses,
68
and the captioning model is trained with token-wise cross-entropy
69
On the 10,000 test triplets with original captions, OpenAI CLIP ViT-L/14 achieves IR70text 71, 72, 73, RGB74IR 75, IR76RGB 77, and mean recall approximately 78. After IR-aware fine-tuning in the C-mixed setting, ViT-L/14 improves to mean recall 79, while ViT-B/32 improves from 80. In a Qwen2.5-VL-7B ablation, B-iraware yields Cap.Auto 81, Cap.IR 82, Cap.R-L 83, and IR-cue QA 84, compared with A-original at 85, 86, 87, and 88, respectively (Han et al., 15 Jun 2026). The conclusion is explicit: modality-specific textual supervision is crucial for RGB-infrared remote sensing vision-language representation learning.
Another adjacent case is “SIRF: Simultaneous Image Registration and Fusion in A Unified Framework,” which the supplied summary presents as “RIS-FUSION (a.k.a. SIRF)” (Chen et al., 2014). SIRF solves a joint convex program
89
where the regularizer is a dynamic gradient sparsity term
90
The method alternates FISTA-based updates of the fused high-resolution multispectral image 91 and gradient-descent updates of the transform 92, achieving linear complexity in the output image size per iteration (Chen et al., 2014).
On 158 simulated patches from Quickbird, GeoEye, SPOT, and IKONOS, the supplied summary reports that SIRF reaches ERGAS 93, QAVE 94, RASE 95, SAM 96, PSNR 97, MSSIM 98, and RMSE 99, outperforming PCA, IHS, Wavelet, Brovey, P+XS, and AVWP (Chen et al., 2014). Runtime on a 00 image is reported as 01 s, compared with 02 s for AVWP and 03 s for P+XS. This suggests that some summaries use RIS-FUSION as a broad descriptor for registration-and-fusion pipelines, even when the canonical paper title uses a different acronym.
Across these adjacent cases, the label marks structured multimodal fusion rather than a single established method family. The unifying pattern is architectural rather than terminological: heterogeneous modalities are aligned, selectively combined, and constrained by downstream objectives such as caption grounding, registration fidelity, or information leakage control.