Semantic-Agnostic Salience Labels
- Semantic-agnostic salience labels are supervisory signals that use low-level visual cues to highlight spatially important regions without relying on semantic priors.
- They enable weakly supervised image manipulation localization by leveraging edge maps and prompt tokens to extract precise boundary information.
- In scene graph generation, these labels decouple geometric cues from predicate biases, improving spatial coherence and debiased recall.
Semantic-agnostic salience labels are supervisory signals or intermediary representations that highlight visually or structurally important regions, entities, or relationships within data—such as images or scene graphs—without reliance on high-level semantic, categorical, or predicate information. In contrast to standard salience annotations driven by task or class labels, semantic-agnostic salience is typically inferred from low-level features (e.g., boundaries, objectness) or geometric/spatial cues, ensuring that model learning and evaluation are not corrupted by priors encoded in annotated semantics. Semantic-agnostic salience labels have recently emerged as critical components in weakly supervised image forensics and unbiased scene graph generation, enabling methodologies that improve generalization and spatial fidelity.
1. Concept and Motivation
Semantic-agnostic salience labels are defined and utilized independently of explicit class, object, or predicate semantics. Their principal purpose is to emphasize local information—such as edges in image manipulation or spatial proximity in entity pairs—enabling models to avoid overfitting to semantic shortcuts or dataset biases.
In weakly supervised image manipulation localization, semantic-agnostic means that prompt signals or boundary representations do not encode object category names; instead, they are driven purely by low-level cues such as edge energy or boundary contrast, focusing the model’s attention on areas likely corresponding to manipulations regardless of object identity (Wang et al., 9 Jan 2026). In scene graph generation, semantic-agnosticity refers to binary entity pair salience labels determined exclusively by geometric alignment (such as intersection-over-union), decoupling the supervision signal from predicate frequency or class imbalances (Qu et al., 13 Jan 2026).
This approach is motivated by the observation that semantic priors frequently degrade spatial localization and underrepresent visually rare or boundary-centric phenomena. Thus, semantic-agnostic salience labels are designed to recover spatial or structural fidelity absent in purely semantic learning pipelines.
2. Semantic-Agnostic Salience in Weakly Supervised Image Manipulation Localization
The Semantic-Agnostic Prompt Learning (SAPL) framework on CLIP exemplifies the use of semantic-agnostic salience for manipulation localization (Wang et al., 9 Jan 2026). Unlike previous methods relying on pixelwise annotation or image-level binary labels that drive global feature learning, SAPL operates under strict weak supervision. Only binary image-level "manipulated vs. pristine" labels are available, and the model is equipped with modules to extract and leverage boundary-centric cues:
Edge-Aware Contextual Prompt Learning (ECPL):
- Input images are converted to edge maps via classical operators (e.g., Sobel, Canny).
- learnable prompt tokens are combined according to edge-based attention weights :
- acts as a boundary prototype in CLIP's text embedding space. This formulation is explicitly semantic-agnostic, as prompt embeddings depend solely on the edge map.
Hierarchical Edge Contrastive Learning (HECL):
- Salient patches centered at high-intensity edge pixels are sampled across multiple scales and labeled as “positive” (manipulated) or “negative” (pristine).
- A contrastive loss enforces separation in CLIP embedding space purely by local appearance.
Dense similarity between the learned boundary prototype and per-pixel CLIP image features yields a salience map centered on manipulation edges rather than semantic regions. Post-processing refines this map into masks or continuous maps, usable for downstream manipulation localization. The entire pipeline is notable for requiring no semantic supervision at the prompt, feature, or mask level.
3. Semantic-Agnostic Spatial Salience in Scene Graph Generation
Salience-SGG (Qu et al., 13 Jan 2026) introduces semantic-agnostic salience within scene graph generation—specifically to support unbiased learning in contexts where predicate distributions are long-tailed and semantic debiasing leads to spatially incoherent graph edges.
Salience Label Construction:
- For each detected entity-pair , a binary salience label is assigned based solely on spatial overlap with ground-truth subject–object pairs:
where is a geometric threshold (e.g., ).
- No information about the predicate label is included, making the supervision strictly spatial.
Use in Iterative Salience Decoder (ISD):
- The ISD module refines entity representations across layers, incorporating Geometry-Enhanced Self-Attention (G-ESA) and Predicate-Enhanced Cross-Attention (P-ECA).
- The running salience score matrix is iteratively refined, initializing with and performing updates that blend geometry-aware affinity with feature similarity.
- The ISD is trained with a focal loss on the semantic-agnostic labels, supplemented by standard object detection and predicate debiasing losses.
Significance: This procedure enforces that model attention is directed to spatially plausible and visually salient connections rather than those favored by semantic priors, yielding improvements in both spatial localization metrics and debiased recall, particularly on rare but geometrically valid triplets.
4. Quantitative Outcomes and Empirical Analysis
Empirical results from both domains indicate the utility of semantic-agnostic salience signals.
Scene Graph Generation (Salience-SGG) (Qu et al., 13 Jan 2026):
- On Visual Genome, Salience-SGG achieves versus baselines (Mag-RMPN, ; Hydra-SGG, ).
- Pairwise Localization AP (pl-AP) is simultaneously optimized, countering the trade-off in semantic-debiased models such as TDE and IETrans which often improve mean recall at the expense of spatial precision.
- Ablations confirm that bottom-up (purely geometric) semantic-agnostic labels outperform various top-down (semantically driven) strategies in both and .
| Label Type | R@100 | mR@100 | F@100 |
|---|---|---|---|
| top_down(gt) | 32.0 | 18.4 | 23.3 |
| top_down(entity) | 33.0 | 19.6 | 24.6 |
| top_down(triplet) | 32.5 | 19.1 | 24.1 |
| bottom_up (Ours) | 33.4 | 21.6 | 26.2 |
Image Manipulation Localization (SAPL) (Wang et al., 9 Jan 2026):
- SAPL outperforms prior weakly supervised approaches across multiple public benchmarks (CASIA v2, COVERAGE, IMD2020, MFC) in metrics including pixelwise F1, IoU, and edge-centric PR/AUC.
- The framework’s elimination of semantic or class-driven prompt learning is critical for this generalization.
5. Methodological Integration and Broader Applicability
Semantic-agnostic salience labeling approaches can be directly integrated into a variety of visual recognition and spatial reasoning frameworks:
- In image-level weak supervision, edge-centric prototypes derived from boundary energy replace class-token prompts, leading to spatially precise salience maps.
- In scene graph generation, bottom-up entity-pair labels computed from geometry can be used to re-rank triplets proposed by any baseline SGG model, simply by inserting a salience decoder module and augmenting with a salience-driven loss term.
Both architectures—SAPL and Salience-SGG—are compatible with prevailing backbones (CLIP, DETR-derived detectors) and demonstrate that spatial fidelity and generalization are promoted when semantic-agnostic supervision is prioritized.
6. Impact, Limitations, and Research Directions
Semantic-agnostic salience labels mitigate spatial or structural degradation caused by semantic dataset biases, rare class underrepresentation, and annotation expense. Empirical findings demonstrate improved precision in manipulation localization and spatially coherent triplet ranking. Key limitations include potential instability when low-level cues are themselves ambiguous (e.g., weak boundary contrast, crowded scenes) and the constraint that geometric alignment may not suffice for all task domains.
A plausible implication is that future research could explore hierarchical or semi-agnostic labeling hybrids, integrating multi-modal cues (e.g., temporal consistency, motion) or unsupervised structural priors. Rigorous analysis of failure modes and cross-domain transferability, particularly in non-visual or multimodal settings, remains an open area for investigation.