Steganography in Multimodal Inputs

Updated 6 February 2026

Steganographic embedding in multimodal inputs is a technique that covertly encodes data using neural implicit representations and adaptive carrier allocation to minimize detectability.
It leverages transform-domain and residual learning methods such as DCT, DWT, and U-Net architectures to balance signal fidelity, robustness, and embedding capacity.
Innovative approaches integrate hardware-accelerated and neuromorphic pipelines to achieve high performance metrics like PSNR and SSIM while resisting steganalysis and signal distortions.

Steganographic embedding in multimodal inputs denotes the process of covertly encoding information from one or more modalities (e.g., image, audio, video, text, 3D data) within cover data of potentially different modalities, such that the existence of the message is imperceptible to external observers and robust against steganalysis. Advanced steganographic systems exploit cross-modal neural representations, adaptive embedding algorithms, signal transforms, and novel hardware-centric pipelines to maximize embedding capacity, imperceptibility, and robustness.

1. Foundational Approaches and Modality-General Embedding

Cross-modal steganography historically confronted difficulties due to domain incompatibilities, distinct statistical structures, and the risk of detectability when mapping payloads between heterogeneous carriers. Classic techniques included transform-domain embedding (e.g., DWT, DCT) and block-based spatial algorithms but were limited to unimodal or weakly multimodal settings (Das et al., 2012). Recent developments introduce modality-agnostic neural representation, adaptive carrier allocation, and direct raster-domain quantization.

A notable advance is the use of Implicit Neural Representations (INRs) to encode arbitrary-shaped high-dimensional data (image, audio, video, 3D) as a continuous mapping parameterized by a compact MLP. These INRs are represented by trainable weight vectors $\theta$ , which, once fitted to the secret modality, can be quantized and embedded in a cover (typically an image), forming a stego-image (Han et al., 2023, Song et al., 2023). The flexibility arises because recovery of the secret only requires reconstructing the INR from these weights, then decoding the secret via querying the MLP with suitable coordinates.

Glyph perturbation cardinality (GPC) extends steganographic embedding to the raster space of text, robustly mapping multimodal payloads into quantized interior glyph pixel perturbations with cardinality encoding (Kandala, 25 Dec 2025). Multiresolution wavelet-based schemes preprocess any payload into a noise-equivalent format and exploit the redundancy of image decomposition for block-selective embedding, providing generalization across image, audio, and text (Das et al., 2012).

Deep neural architectures enable learnable transformations between modalities—most prominently, residual architectures that superpose encoded secret representations onto host media in the transform domain. PixInWav, for example, learns a spectrogram-domain residual for arbitrary images, additively combining this with the short-time DCT of an audio signal for robust image-in-audio hiding (Geleta et al., 2021). Fully convolutional (U-Net–style) encoder-decoder pairs are standard, optimized to minimize task-weighted reconstruction loss over secret and cover metrics.

Adaptive frameworks generalize this logic: Intelligent Carrier Allocation (ICA) employs a Cross-Modal Reasoning (CMR) engine that computes quantitative suitability per carrier via a unified reliability score $R_k$ , aggregating entropy, signal complexity, imperceptibility, and robustness (Das et al., 12 Nov 2025). Bit allocation is proportionally distributed using

$B_k = M \cdot \frac{R_k}{\sum_{j=1}^N R_j}$

where $M$ is total payload size and $k$ indexes carrier modalities (e.g., image, audio, text), thereby maximizing security and minimizing overall detectability. Modality-specific embedding is delegated to dedicated routines, such as spatial-LSB for images, echo-hiding for audio, or synonym substitution for text.

The design of modern frameworks incorporates quantization-aware training (QAT) and loss balancing. In cross-modal INR frameworks, quantization of INR weights to fit image channels is explicitly modeled during training updates; this is essential to maintaining both secret and cover fidelity (e.g., QAT increases secret recovery PSNR from 17.73 dB to 29.98 dB in video-in-image embedding (Han et al., 2023)).

3. Transform and Residual Domain Techniques

Transform-domain and residual-based paradigms remain central to steganographic embedding in multimodal contexts. Standard approaches leverage analytic transforms (DCT, STFT, DWT) to exploit the perceptual and statistical masking properties of different frequency bands:

Block-based DWT: Payload embedding is concentrated in the homogeneous blocks of LL and HH subbands for enhanced imperceptibility and redundancy. Homogeneous blocks are identified by ordering 8×8 subblocks in ascending order of sample variance (Das et al., 2012).
Residual learning in audio: Both PixInWav and subsequent robust extensions embed image information in the spectrogram domain of audio via additive or learned residuals, often exploiting pixel-shuffle and tiling mechanisms for dimensionality matching and redundancy (Geleta et al., 2021, Ros et al., 2023). Advanced variants (e.g., WS-Replicate, luma buffering) increase robustness by allowing multiple watermark copies and error-correction without sacrificing capacity or transparency.
STFT-based models: U-Net–style architectures simultaneously process magnitude and phase channels with separate or fused network heads, with empirical results showing robustness both to partial data loss and to lossy compression, while maintaining high-fidelity recovery (e.g., SNR >30 dB, image SSIM up to 0.91) (Ros et al., 2023).

Performance metrics typically include PSNR, SSIM, LPIPS for images, SNR/AE for audio, and structural similarity for text and shape data.

4. Security, Robustness, and Steganalysis

Steganographic systems in multimodal scenarios are evaluated by their imperceptibility, robustness against common signal distortions, and resistance to advanced steganalysis:

Statistical and perceptual invisibility: Modern systems achieve PSNR >40 dB and SSIM >0.97 on cover modalities (e.g., SteganoSNN, INRSteg, ICA). Kullback–Leibler divergence and LPIPS are additional metrics for visual similarity (Song et al., 2023, Sahoo et al., 9 Nov 2025, Das et al., 12 Nov 2025).
Robustness to noise and attacks: Methods employing redundancy (e.g., tiled or replicated embedding), luma buffering, or spread across robust carriers yield significant resilience to compression, noise, and partial data loss (SM BER ≈ 11.2% vs. ICA BER ≈ 2.9% post-attack; image-in-audio maintains SSIM ≈ 0.8 even under heavy additive noise) (Ros et al., 2023, Das et al., 12 Nov 2025).
Steganalysis resistance: INRSteg demonstrates perfect undetectability (50% accuracy) by CNN-based detectors SiaStegNet and XuNet, in contrast to classical deep stego methods (>90% detection) (Song et al., 2023). SteganoSNN similarly yields low statistical detector scores (SPA_G, WS_G), validating the protection afforded by neuromorphic embedding (Sahoo et al., 9 Nov 2025).
Countermeasures: Preprocessing (JPEG, Gaussian smoothing), neural detection, attention regularization, and adversarial training are effective as multi-layered defenses, reducing attack success rates with minimal perceptual impact (Pathade, 30 Jul 2025).

5. Specialized Hardware and Neuromorphic Methods

Hardware-accelerated and neuromorphic pipelines serve applications requiring real-time, low-power, and high-capacity embedding. SteganoSNN exemplifies this trend by transducing audio into spike-train codes using leaky integrate-and-fire (LIF) neurons, encrypting via modulo-based mapping, and embedding the output at 8 bpp across RGBA image channels (Sahoo et al., 9 Nov 2025). Deployed on a PYNQ-Z2 FPGA, this hybrid analog/digital architecture achieves full-HD capacity with 100% bitwise recovery, while maintaining image fidelity of 40.42–41.35 dB PSNR and SSIM >0.97. The neuromorphic approach endows increased robustness and resistance to bit-level and statistical attacks while incurring low hardware resource cost.

6. Multimodal AI Chains and Linguistic Steganography

Advances in generative and multimodal AI enable novel embedding paradigms operating entirely outside the signal/transform space. Chain-of-multimodal-AI architectures conceal payloads beyond physical (spatial, temporal) data representations by controlling the generation of transcript text from audiovisual content. Secret messages modulate the token sampling process of a LLM via a key-derived keyword set, biasing paraphrase choices in the linguistic space (Chang et al., 25 Feb 2025). The stego-modified transcript is then synthesized into audio and lip-synced video, yielding modified media where the message is present only in the distribution of words in the transcript. Fidelity (face, voice, semantic content) remains high (cosine similarity ≥0.9 in video/audio), secrecy is retained under Zipf/perplexity evaluation, and robustness encompasses compression and deepfakes (true-positive rates >0.93). However, embedding capacity is modest (few bits/s), and detection by stylometric AI-watermarking remains possible.

A distinct security concern arises in the context of vision-LLMs (VLMs): steganographic prompt injection attacks. Here, hybrid spatial, frequency, and neural techniques embed textual instructions into images, which are then interpreted by VLMs as behavioral prompts (Pathade, 30 Jul 2025). Success rates reach 24.3% across leading models (with neural methods up to 31.8%), while maintaining imperceptibility (PSNR >38 dB, SSIM >0.94). Architectural vulnerabilities stem from the propagation of stego perturbations in ViT-based encoders and attention fusion, with defense strategies spanning input sanitization, steganalysis, and behavioral monitoring.

7. Comparison of State-of-the-Art Multimodal Steganographic Frameworks

Method / Reference	Modalities Handled	Core Principle	PSNR/SSIM (Cover)	Robustness	Security Against Steganalysis
INRSteg (Song et al., 2023)	Image ↔ Audio/Video/3D	MLP INR, block-diagonal masking	>62 dB / high SSIM	Very high	Perfect (50% detection rate)
Deep INR (Han et al., 2023)	Video/Audio/3D→Image	INR weight-to-image embedding/QAT	19.77–29.98 dB, SSIM	High (QAT)	Not explicitly quantified
SteganoSNN (Sahoo et al., 9 Nov 2025)	Audio→Image (FPGA + SNN)	SNN, time-to-LSB mapping	40.4–41.35 dB / >0.97	High, real-time	Low SPA/Wavelet-SS, 100% bitwise audio
ICA (Das et al., 12 Nov 2025)	Image, Audio, Text (all directions)	Adaptive reliability allocation	45 dB (image), 30 dB	BER ↓ 74%	Improved over static, not full immunity
GPC (Kandala, 25 Dec 2025)	Text, Image, Audio, Video→Text	Glyph perturbation cardinality	~40–44 dB	Medium	Dependent on raster pipeline integrity
PixInWav (Geleta et al., 2021)	Image→Audio	Residual DCT, cover-independent	27.4 dB / SSIM 0.92	~1 Mbit/s	Not discussed

Each approach tailors the embedding transformation, error-correction, and modality adaptation to the requirements and vulnerabilities of the cover/secret pairing.

Steganographic embedding in multimodal inputs has achieved significant generalization, capacity, and security advances through neural implicit representations, robust adaptive resource allocation, engineered redundancy, hardware-algorithm co-design, and embrace of linguistic/semantic embedding channels. Yet, the field continues to evolve in response to adversarial generative AI, requiring ongoing innovation in carrier analysis, embedding optimization, and defense.