Toward Accountable AI-Generated Content on Social Platforms: Steganographic Attribution and Multimodal Harm Detection

Published 12 Apr 2026 in cs.CV, cs.AI, cs.CR, and cs.ET | (2604.10460v1)

Abstract: The rapid growth of generative AI has introduced new challenges in content moderation and digital forensics. In particular, benign AI-generated images can be paired with harmful or misleading text, creating difficult-to-detect misuse. This contextual misuse undermines the traditional moderation framework and complicates attribution, as synthetic images typically lack persistent metadata or device signatures. We introduce a steganography enabled attribution framework that embeds cryptographically signed identifiers into images at creation time and uses multimodal harmful content detection as a trigger for attribution verification. Our system evaluates five watermarking methods across spatial, frequency, and wavelet domains. It also integrates a CLIP-based fusion model for multimodal harmful-content detection. Experiments demonstrate that spread-spectrum watermarking, especially in the wavelet domain, provides strong robustness under blur distortions, and our multimodal fusion detector achieves an AUC-ROC of 0.99, enabling reliable cross-modal attribution verification. These components form an end-to-end forensic pipeline that enables reliable tracing of harmful deployments of AI-generated imagery, supporting accountability in modern synthetic media environments. Our code is available at GitHub: https://github.com/bli1/steganography

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper proposes a forensic pipeline with cryptographic steganographic watermarking to embed verifiable identifiers into AI-generated images.
It evaluates multiple watermarking techniques, showing that wavelet-domain spread spectrum methods deliver optimal imperceptibility and robustness under attacks.
A CLIP-based multimodal classifier achieves 95% validation accuracy and a 0.99 AUC-ROC, enabling precise detection of harmful image-text combinations.

Steganographic Attribution and Multimodal Harm Detection for Accountable AI-Generated Content

Introduction and Motivation

The widespread adoption of generative AI, particularly diffusion-based models, has resulted in an unprecedented volume of synthetic imagery being disseminated on social platforms. This paradigm shift has introduced new threats to digital forensics, moderation, and accountability pipelines, especially due to the contextual harm introduced when benign AI-generated images are paired with misleading or malicious text. Traditional moderation approaches that evaluate only unimodal content are no longer tenable, as toxic or deceptive intent frequently emerges at the intersection of modalities.

In response to this, the paper presents a comprehensive forensic framework targeting the forensic attribution of AI-generated content on social platforms, addressing the lack of persistent metadata and device signatures in such content. The core concept is to embed cryptographically verifiable steganographic identifiers into AI-generated images at creation, enabling reliable post hoc attribution when multimodal harmful deployments are detected. The framework integrates a robust steganographic gateway with a state-of-the-art CLIP-based multimodal harm detection model for accountability-driven content tracing.

Figure 1: Overview of the Steganographic Security Gateway—a two-stage system integrating cryptographic marking with multimodal harm detection.

Technical Framework

The method formalizes an end-to-end attribution pipeline instantiated as a Steganographic Security Gateway. The entire flow comprises cryptographic identifier embedding at image generation, post-distribution multimodal harm detection, and (when triggered) cryptographic extraction and verification for user-level attribution.

Steganographic Embedding Methodology

The paper evaluates five watermarking techniques distinguished by embedding domains:

LSB (Least Significant Bit) spatial-domain embedding
DCT (Discrete Cosine Transform) domain embedding
DWT (Discrete Wavelet Transform) domain embedding
Spatial spread spectrum watermarking
DWT-domain spread spectrum watermarking (DWT-SS)

These methods are unified by a cryptographic layer: identifiers are encoded as RSA signatures, with compact SHA-256-derived fingerprints adapted for domains with lower capacity or where correlation-based detection is preferable.

The embedding process is designed to optimize for robust recoverability post-distribution, prioritizing resistance to typical operations such as compression, resizing, geometric transforms, and blurring, which are pervasive on social platforms.

Figure 2: Flowchart of the Steganographic Enabled Image Tracing Pipeline, showing data flow from embedding to multimodal analysis and attribution.

Multimodal Harm Detection

Harmful deployment detection is operationalized via a two-layer MLP classifier built on a frozen pre-trained CLIP encoder. Image-text pairs are mapped into a fused semantic embedding incorporating modality-specific vectors, element-wise differences, element-wise products, and overall cosine similarity, forming a $\mathbb{R}^{4d+1}$ fusion. Classification uses binary cross-entropy loss, with only the classifier being trainable to ensure deployment efficiency.

This fusion enables detection of harmfulness that emerges only in context (image-text union), supporting precise attribution triggers.

Experimental Results

Watermark Robustness

Empirical evaluation on diverse PNG, JPEG, and BMP corpora quantifies both perceptual imperceptibility and recovery rates under post-processing attacks. Robustness benchmarks include:

No Attack: LSB achieves 100% fidelity but is catastrophically fragile to minor perturbations. Spread spectrum methods (Spatial SS and DWT-SS) maintain 98.8% and 98.3% robustness, respectively.
Gaussian Blur: LSB, DCT, and DWT exhibit complete collapse (0% verification success). Spread spectrum methods show only minor degradation (Spatial SS: ~98.8%, DWT-SS: ~98.3% success).
JPEG Compression (Q=50): Bit-wise approaches are nonfunctional. Spatial SS maintains 96.8% robustness; DWT-SS degrades to 77.4%.
Resizing/Center Crop: Spatial SS and DWT-SS retain over 98% recovery rates.

TrustMark baselines are also evaluated, showing notable degradation under blur and resize but superior to bit-wise methods under attack conditions.

Figure 3: Visual comparison of watermarking methods on a natural color image under no attack and common perturbation scenarios, highlighting the imperceptibility/robustness tradeoff.

Qualitatively, LSB is visually lossless; DCT introduces faint block artifacts, DWT creates mild grain/brightness shifts, and spread spectrum methods superimpose minimal noise—DWT-SS yields the best balance between imperceptibility and robustness.

Multimodal Classifier

Training and validation on the benchmark dataset (from [wang-etal-2025-cant]) demonstrates:

Near-zero generalization gap with validation accuracy converging to 95%
AUC-ROC of 0.99 for harmfulness classification, outperforming canonical baselines

The model shows strong separation between harmful and benign multimodal inputs, accurately reflecting contextual harm not detectable in isolated modalities.

Figure 4: Training and validation curves for the CLIP-based multimodal harm detector, indicating stable convergence and minimal overfitting.

Implications and Future Directions

The integration of robust watermark-based steganographic attribution with high-performance multimodal harm detection yields a practical pipeline for platform-level accountability. Key implications:

Forensic utility: Investigators can trace harmful synthetic imagery back to originators even when conventional metadata/device evidence is absent.
Deterrence: The presence of user-level attribution may inhibit malicious use, complementing reactive moderation processes.
Deployment feasibility: The system is model-agnostic and can be incorporated during AI image generation, scaling to existing multimedia platforms.

Several critical limitations and future research vectors are highlighted:

The bit-exactness requirement for cryptographic signature verification in transform-domain embedding is a bottleneck, especially under lossy workflows. Error-tolerant signature schemes and soft-decision decoding are proposed as potential mitigations.
Cross-platform identity resolution remains nontrivial; better integration with account linkage/graph analytics could further augment attribution accuracy and adversarial resilience.
The scheme’s resilience against advanced removal attacks (e.g., adversarial erasure, adversarial harmonization) and adaptability to video/audio modalities remains open for extension.

Conclusion

This paper advances the state of accountable AI-generated content moderation by coupling asymmetric cryptographically verifiable steganographic marking with robust multimodal harm detection. The framework empirically demonstrates that wavelet-domain spread spectrum watermarking delivers high resilience against post-processing attacks, while a CLIP-based multimodal classifier achieves near-optimal separation of contextually harmful deployments. This system provides a quantifiable advance towards forensic-grade attribution and provenance tracking in social media ecosystems, with direct applications for regulatory compliance, forensic investigation, and platform integrity initiatives (2604.10460).

Markdown Report Issue