Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 23 tok/s
GPT-5 High 19 tok/s Pro
GPT-4o 108 tok/s
GPT OSS 120B 465 tok/s Pro
Kimi K2 179 tok/s Pro
2000 character limit reached

SLENet: Semantic Localization and Enhancement Network

Updated 6 September 2025
  • SLENet is a neural framework that detects camouflaged objects underwater by enhancing multi-scale features and integrating spatial guidance.
  • The network employs a Gamma-Asymmetric Enhancement module, Localization Guidance Branch, and Multi-Scale Supervised Decoder to refine segmentation.
  • Empirical results on benchmarks like DeepCamo demonstrate significant improvements in metrics such as MAE, S-measure, and F-measure over prior methods.

Semantic Localization and Enhancement Network (SLENet) is an advanced neural framework for camouflaged object detection in challenging underwater environments. Its design tackles the precise differentiation, localization, and enhancement of objects that are visually ambiguous due to optical distortions, water turbidity, and intrinsic camouflage traits of marine organisms. SLENet introduces modular mechanisms for feature augmentation and spatial guidance, delivering high generality for the broader camouflaged object detection (COD) domain. The approach is originally developed for the DeepCamo benchmark dataset, where it establishes new state-of-the-art performance.

1. Motivation and Task Definition

Underwater Camouflaged Object Detection (UCOD) is defined as identifying and segmenting objects that seamlessly blend with underwater backgrounds, often eluding recognition due to environmental artifacts such as light scattering, partial occlusion, and texture confusion. The principal difficulties arise from the lack of distinctive boundaries, low-contrast edges, and variable scale characteristics. SLENet addresses these by jointly learning to enhance multi-scale features and by embedding explicit localization cues that guide downstream segmentation.

2. Network Architecture

SLENet is constructed upon a parameter-efficient encoder derived from SAM2, employing trainable Adapter modules for domain adaptation. The core architecture comprises three interlocked modules:

A. Gamma-Asymmetric Enhancement (GAE) Module

  • Expands the receptive field while preserving fine detail through a multi-branch cascade. Each input scale XiX_i is reduced by a 1×11 \times 1 convolution, then processed in a chain of asymmetric convolutions and max pooling, followed by dilated convolution (r=2r=2).
  • Mathematical formulation for branch rr:

Cir=Casy(Cat(Xir,Up(Di(r1),Xir))),Dir=Cdil(AMP×(4r)(Cir))C_i^r = C_{asy}(Cat(X_i^r, Up(D_i^{(r-1)}, X_i^r))), \quad D_i^r = C_{dil}(AMP_{\times (4-r)}(C_i^r))

Final enhancement:

Fi=γ(Di4CA(Di4))F_i = \gamma (D_i^4 \otimes CA(D_i^4))

where γ\gamma is a learnable scalar and CACA denotes channel attention.

B. Localization Guidance Branch (LGB)

  • Generates a coarse localization map MM by recursively fusing shallow-to-deep features. Each stage compresses the feature dimension and fuses it in a bottom-up fashion:

F2l=GAE(Down(X1l)X2l),Fil=GAE(Down(Fi1l)Xil),i=3,4F_2^l = GAE(Down(X_1^l) \oplus X_2^l), \quad F_i^l = GAE(Down(F_{i-1}^l) \oplus X_i^l), \quad i=3,4

The final location map is generated:

M=CBR1×1(F4l)M = CBR_{1 \times 1}(F_4^l)

with CBRCBR denoting a convolution, batch normalization, and ReLU.

C. Multi-Scale Supervised Decoder (MSSD)

  • Fuses feature maps across scales in a top-down manner. Features FiF_i are concatenated with upsampled prior predictions, producing SiS'_i via consecutive 3×33 \times 3 convolutions. Modulation with the localization map MM is realized by generating affine parameters α(M)\alpha(M) and β(M)\beta(M), yielding:

Si=BN(Si)α(M)β(M)S_i'' = BN(S_i') \otimes \alpha(M) \oplus \beta(M)

After spatial attention and residual linking:

Ri=(Sigmoid(SA(Si))Si)SiR_i = (Sigmoid(SA(S_i'')) \otimes S_i'') \oplus S_i''

Final outputs:

Pi=Conv1×1(Ri)P_i = Conv_{1 \times 1}(R_i)

Multi-scale supervision is imposed during learning.

3. Loss Functions and Training Strategy

A composite loss balances the location map and boundary-sensitive predictions using a linear decay coefficient:

ωm=max(μ(1epoch/epochs),0.1)\omega_m = \max(\mu(1-\text{epoch}/\text{epochs}), 0.1)

Ltotal=ωmLBCE(G,M)+1ωm3i=13L(G,Pi)L_{total} = \omega_m L_{BCE}(G, M) + \frac{1-\omega_m}{3} \sum_{i=1}^3 L(G, P_i)

Binary Cross-Entropy (LBCEL_{BCE}) penalizes localization errors, and L(G,Pi)L(G, P_i) measures prediction quality per scale.

4. Empirical Performance

SLENet has been benchmarked on DeepCamo, COD10K, CAMO-Test, and CHAMELEON datasets. It reports leading values, notably S-measure (SαS_\alpha) of $0.869$, E-measure (EϕE_\phi) of $0.930$, weighted F-measure (FβwF^w_\beta) of $0.800$, and MAE of $0.022$ on DeepCamo. Compared to prior methods (e.g., SAM2-UNet, SINet), SLENet achieves a 9.5%9.5\% reduction in MAE on COD10K, consistently enhancing localization accuracy, boundary recovery, and suppressing false negatives.

5. Enhancement Mechanisms and Feature Fusion

The cascade of asymmetric convolutions plus dilated convolution in GAE enables both local detail preservation and long-range context aggregation. The channel attention (CA) selectively amplifies informative features. Bottom-up LGB supplies global semantic guidance by integrating features across scales, which are subsequently used in MSSD for affine modulation (via α,β\alpha,\beta) and spatial attention, anchoring pixel-level predictions to semantically plausible locations. This architecture is empirically shown to outperform naive fusion or simple attention-only regimes.

6. Generalization and Practical Applications

While tailored for underwater camouflage, SLENet’s design is generalizable to broader COD scenarios due to its modular feature enhancement, semantic-guided localization, and efficient encoder adaptation (Adapters on SAM2 backbone). Applications include marine ecology (species monitoring, population assessment), surveillance, and medical image segmentation tasks demanding fine detail discrimination and robust localization. The Adapter mechanism allows rapid re-training for domain transfer with minimal parameter overhead.

7. Mathematical Summary of Core Modules

Module Key Operations Role
GAE Cascaded asymmetric/dilated conv, CA, γ\gamma Multi-scale enhancement, context/detail fusion
LGB (Guidance) Recursive feature fusion, 1×\times1 conv, GAE Localization map generation, semantic anchoring
MSSD (Decoder) Top-down fusion, affine modulation (α,β\alpha,\beta from MM), spatial attention Prediction refinement, multi-scale supervision

Specific formulas:

  • Fi=γ(Di4CA(Di4))F_i = \gamma(D_i^4 \otimes CA(D_i^4))
  • M=CBR1×1(F4l)M = CBR_{1 \times 1}(F_4^l)
  • Si=BN(Si)α(M)β(M)S_i'' = BN(S_i') \otimes \alpha(M) \oplus \beta(M)

Conclusion

SLENet integrates a Gamma-Asymmetric Enhancement module for context/detail fusion, a Localization Guidance Branch for spatial semantic anchoring, and a Multi-Scale Supervised Decoder for robust, scale-adaptive prediction. Empirical results on benchmark datasets demonstrate substantial improvements over existing methods in accuracy and granularity, with architectural components that support generalization to other camouflage and segmentation tasks. The network's use of parameter-efficient adaptation and multi-scale guidance mechanisms positions it as a versatile approach to semantic localization and enhancement in the COD domain (Wang et al., 4 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)