SSH-Net: A Self-Supervised and Hybrid Network for Noisy Image Watermark Removal (2505.05088v1)

Published 8 May 2025 in cs.MM, cs.CV, and eess.IV

Abstract: Visible watermark removal is challenging due to its inherent complexities and the noise carried within images. Existing methods primarily rely on supervised learning approaches that require paired datasets of watermarked and watermark-free images, which are often impractical to obtain in real-world scenarios. To address this challenge, we propose SSH-Net, a Self-Supervised and Hybrid Network specifically designed for noisy image watermark removal. SSH-Net synthesizes reference watermark-free images using the watermark distribution in a self-supervised manner and adopts a dual-network design to address the task. The upper network, focused on the simpler task of noise removal, employs a lightweight CNN-based architecture, while the lower network, designed to handle the more complex task of simultaneously removing watermarks and noise, incorporates Transformer blocks to model long-range dependencies and capture intricate image features. To enhance the model's effectiveness, a shared CNN-based feature encoder is introduced before dual networks to extract common features that both networks can leverage. Our code will be available at https://github.com/wenyang001/SSH-Net.

Summary

The paper introduces a self-supervised hybrid network that synthesizes reference watermark-free images from noisy inputs to enable effective watermark and noise removal.
It leverages a dual-decoder architecture combining a CNN-based denoising branch with a Transformer-enhanced branch for capturing both local and global image features.
The approach achieves state-of-the-art results on metrics like PSNR, SSIM, and LPIPS using a mixed loss function, eliminating the need for paired clean data.

Visible watermark removal from images is a challenging task, particularly when the images are also affected by noise. Traditional methods often rely on supervised learning, which requires paired datasets of watermarked and watermark-free images. Obtaining such paired data is frequently impractical in real-world scenarios. The paper "SSH-Net: A Self-Supervised and Hybrid Network for Noisy Image Watermark Removal" (2505.05088) addresses this limitation by proposing a self-supervised learning approach that does not require explicit ground truth watermark-free images.

The core idea behind SSH-Net is to synthesize reference watermark-free images ( $Y_w$ ) directly from the watermarked noisy input images ( $X_{wn}$ ). This is achieved by adding additional watermarks to the input $X_{wn}$ . This synthesized image $Y_w$ is then used as an unbiased estimator of the clean image $Y$ in a self-supervised training framework. The model learns to map the noisy watermarked input $X_{wn}$ to this synthesized $Y_w$ , effectively learning the inverse mapping to remove both the original watermark and noise.

The network architecture, illustrated in Figure 1, is a hybrid dual-network design comprising four main components:

Shared Encoder (SE): This initial part of the network uses convolutional layers (specifically, NAFBlocks (2505.05088)) and downsampling layers to extract multi-scale features from the input noisy watermarked image $X_{wn}$ . These features are shared between the two subsequent decoder branches, promoting efficient feature reuse and reducing redundant computation compared to using separate encoders.

$F_{se} = H_{SE}(\text{Conv}(X_{wn}))$
Noise Removal Decoder (NRD): This upper branch is designed specifically for the simpler task of noise removal. It consists of a lightweight CNN-based U-Net structure using NAFBlocks. It operates on the features from the SE and aims to produce an intermediate output $Y_n$ that is primarily noise-free but may still contain watermarks. This serves as an auxiliary task to help the shared encoder focus on relevant features.
Watermark and Noise Removal Decoder (WNRD): This lower branch is responsible for the more complex task of simultaneously removing both watermarks and noise. To effectively handle the structured patterns of watermarks and capture long-range dependencies, it incorporates a Sparse Transformer U-Net, as depicted in Figure 2. This hybrid CNN-Transformer structure allows the model to leverage the local feature extraction capabilities of CNNs and the global context modeling power of Transformers. The Sparse Transformer U-Net employs Sparse Self-Attention (SSA) layers within its blocks. Standard Transformer attention has quadratic computational complexity with respect to spatial dimensions, making it expensive for high-resolution images. SSA, based on the MDTA layer (2505.05088), addresses this by applying attention across the channel dimension (linear complexity) and further enhances efficiency by using a top- $k$ selection strategy. This strategy focuses computation on the most relevant elements based on attention scores, which is particularly effective for sparsely distributed watermark patterns.

$F'_{l} = F_{l-1} + \text{SSA}(\text{LN}(F_{l-1}))$

$F_{l} = F'_{l} + \text{FFN}(\text{LN}(F'_{l}))$

The WNRD produces an intermediate output $Y_{wm}$ aiming to remove both degradations.
Feature Fusion Unit (FFU): The features from the NRD ( $F_n$ ) and WNRD ( $F_{wn}$ ) are combined in the FFU using a learned gating mechanism. This gating mechanism, learned from $F_{wn}$ , adaptively modulates the contribution of the noise-removed features $F_n$ before adding them to $F_{wn}$ .

$F_{fuse} = \text{NAFBlock}(F_{wn} + \text{Gating}(F_{wn}) \odot F_n)$

This allows the network to dynamically balance the information from the two branches based on local image characteristics. The final restored image $\hat{Y}$ is then produced from the fused features, incorporating a residual connection with the initial input $X_{wn}$ for improved fidelity.

The training of SSH-Net employs a mixed loss function that combines a structural loss ( $L_s$ ) and a texture loss ( $L_t$ ). The structural loss, based on L1 difference, is applied to the outputs of both decoders ( $Y_n$ vs. $X_w$ , $Y_{wm}$ vs. $Y_w$ ) and the final output ( $\hat{Y}$ vs. $Y_w$ ) to ensure pixel-level accuracy and preserve image structure. The texture loss, based on features from a pre-trained VGG network, is applied to $Y_{wm}$ vs. $Y_w$ and $\hat{Y}$ vs. $Y_w$ to ensure perceptual quality and realistic texture reconstruction.

$L = L_s + \alpha L_t$

For implementation, the authors use NAFBlocks as the convolutional building block, which are designed for efficiency without traditional non-linear activations. The network depth (number of blocks) and channel dimensions are configured to balance performance and computational cost. The Sparse Transformer U-Net employs specific attention head counts and FFN expansion ratios. Training is performed using the ADAM optimizer with a learning rate schedule, optimizing the mixed loss function.

The datasets used for training and evaluation are based on PASCAL VOC, where watermarks with varying transparency, coverage, and scale, along with different levels of Gaussian noise, are synthesized and added to clean images to create the watermarked noisy inputs $X_{wn}$ and synthesized references $Y_w$ . Performance is evaluated using standard image quality metrics: PSNR, SSIM, and LPIPS.

Experimental results demonstrate that SSH-Net achieves state-of-the-art performance compared to various image restoration and watermark removal methods, including PSLNet (2505.05088), particularly excelling in scenarios with varying noise levels and watermark transparencies (Tables 1-5). It also shows strong performance in pure denoising (Table 6) and pure watermark removal (Table 7) tasks when trained appropriately.

In terms of computational complexity, SSH-Net has a higher parameter count than some CNN-based baselines but achieves fewer FLOPs than several compared methods, including PSLNet (Table 8). However, practical runtime and GPU memory consumption can be higher than purely convolutional networks due to the sequential nature of some operations and the overhead of the dual-branch structure (Table 12). The visualization of the gating signal (Figure 11) provides insight into the network's learned behavior, showing that the NRD output is used complementarily and adaptively based on the image content.

Ablation studies (Tables 9, 10, 11) confirm the importance of the proposed components:

The hybrid WNRD with Transformers is more effective than a purely CNN-based decoder.
The full SSH-Net architecture leveraging both NRD and WNRD with FFU performs best.
The shared encoder is more efficient and provides slightly better performance than separate encoders.
The Sparse Self-Attention mechanism significantly improves performance compared to using standard attention or MDTA layers.

In summary, SSH-Net offers a practical self-supervised solution for the challenging task of noisy image watermark removal. Its hybrid CNN-Transformer architecture, tailored dual decoders, and efficient attention mechanism enable it to achieve superior restoration quality across diverse noise and watermark conditions without relying on expensive paired clean data, making it highly applicable to real-world scenarios where such data is scarce.

PDF Markdown

SSH-Net: A Self-Supervised and Hybrid Network for Noisy Image Watermark Removal (2505.05088v1)

Summary

Related Papers

GitHub

YouTube