Iterative Salience Decoder (ISD)
- Iterative Salience Decoder (ISD) is a module that iteratively refines candidate sets via spatial or support salience for enhanced recovery and localization.
- In scene graph generation, ISD integrates geometry-enhanced self-attention and predicate-enhanced cross-attention to re-rank candidate triplets effectively.
- For compressive sensing, ISD employs threshold-based support estimation and truncated ℓ1 minimization to achieve faster reconstruction with lower recovery errors.
The Iterative Salience Decoder (ISD) is an architectural module designed to iteratively highlight spatially salient relationships in scene graph generation (SGG), specifically addressing salience insensitivity present in standard debiasing approaches. In parallel, ISD also names an iterative support detection mechanism for sparse signal reconstruction, demonstrating the core principle of self-correcting iterative enhancement in both vision and signal recovery contexts. ISD mechanisms re-rank candidates by geometric or support salience, yielding improved localization or recovery, and are compatible with existing backbone pipelines in both domains (0909.4359, Qu et al., 13 Jan 2026).
1. Conceptual Foundations of ISD
ISD in vision (scene graph generation) is motivated by limitations in Unbiased-SGG backbones, which address predicate-class imbalance but often lose sensitivity to spatial cues. Here, ISD propagates pairwise spatial salience—defined by geometric overlaps of subject-object pairs—via an iterative message-passing decoder. In compressive sensing, ISD estimates a support set from an initial signal reconstruction, solves a truncated optimization excluding this support, then refines support estimates through iteration. The unifying theme of ISD is iterative refinement of candidate sets (entities or support indices), guided by salience or magnitude, moving towards exact structure recovery.
2. Iterative Decoding and Optimization Procedure
Scene Graph Case
After object detection, initial entity features, category scores, and bounding boxes are extracted. ISD maintains subject-side and object-side hidden queries for all entities, and a salience score matrix quantifying subject-object pair salience. Decoding proceeds iteratively:
- Initialization: Combined entity feature is linearly projected to initialize , , and .
- Layerwise Decoding:
- Geometry-Enhanced Self-Attention leverages IoU between boxes for spatially-aware propagation within subject/object branches.
- Predicate-Enhanced Cross-Attention uses predicted predicate logits as bias signals for subject-object interactions.
- Feedforward networks update state.
- Salience Refinement: New salience is computed via elementwise inverse-sigmoid aggregation in logit space between previous scores and new query affinity.
- Termination: After layers, the final salience matrix is used to re-rank candidate triplets.
Compressive Sensing Case
Each iteration comprises:
- Support Estimation: From current reconstruction , select indices by thresholding.
- Truncated Minimization: Solve with the complement of .
- Convergence Criteria: Stop if support stabilizes or solution changes below a specified tolerance; typically 4–10 iterations are sufficient.
3. Mathematical Formulation and Salience Labeling
Scene Graph ISD
Let (boxes), (categories), (features), (predicate logits). Salience queries interact via G-ESA and P-ECA.
Class-agnostic binary salience labels are constructed by thresholding the IoU of candidate boxes against ground-truth: iff there exists a GT pair with both subject and object box IoU (typically ).
Compressive Sensing ISD
At each iteration :
With detection threshold chosen as the value at the first significant jump in sorted (where ), and empirically set as for slowly reducing constant .
4. Theoretical Guarantees and Recovery Analysis
ISD in compressive sensing is rigorously analyzed under the truncated Null Space Property (t-NSP). For satisfying t-NSP, truncated BP recovers exactly when true nonzeros are retained in and . Random Gaussian matrices are shown to satisfy t-NSP with high probability, providing quantifiable recovery thresholds in terms of measurement numbers and nonzero distributions. Iterative support improvement reduces the t-NSP parameter , advancing the recovery process (0909.4359).
5. Integration with Existing Pipelines and Empirical Results
Vision
ISD is modular and can be attached to backbone SGG frameworks (eg. TDE, IETrans, one-stage models) without architectural alteration. The module operates with the backbone frozen during joint training, optimizing entity, predicate, and salience loss terms simultaneously. At inference, triplet scoring multiplies semantic predicate probability with the learned spatial salience, and top- triplets are sorted accordingly. Empirically, ISD improves Pairwise Localization Average Precision (pl-AP), defined by IoU thresholds of predicted subject/object boxes against ground-truth pairs, and also raises overall scene graph F-scores (Qu et al., 13 Jan 2026).
Compressive Sensing
Numerical experiments using MATLAB/YALL1 solvers on random Gaussian and partial DCT matrices demonstrate ISD's competitiveness with IRL1 and IRLS in terms of measurement requirements, with runtime approaching a single BP solve. ISD exhibits up to 10 faster computation than IRL1/IRLS, with consistently lower recovery errors. The threshold-ISD variant further exploits rapidly decaying signal distributions for support refinement, outperforming single-shot BP for signals with significant nonzero decay.
| Method | Success (90%) (n=600, k=40) | CPU Time (s) |
|---|---|---|
| BP | 200 | 0.08 |
| IRL1 | 180 | 0.5 |
| IRLS | 140 | 1.2 |
| ISD | 140 | 0.1 |
6. Comparison with Related Approaches
ISD contrasts with classical minimization by introducing a dynamic, self-correcting support/salience loop. Unlike Orthogonal Matching Pursuit (OMP), ISD's candidate set need not grow monotonically and is global in update scope. Vis-à-vis iteratively reweighted minimization techniques (IRL1, IRLS), ISD typically requires fewer iterations and less computation, while matching or surpassing their measurement efficiency. In vision, prior Unbiased-SGG debias predicate frequencies at the expense of localization acuity; ISD restores geometric discrimination through explicit, spatially-driven decoding (0909.4359, Qu et al., 13 Jan 2026).
7. Practical Recommendations and Implementation Notes
For compressive sensing, ISD benefits from fast solvers that support warm-starting and is robust when nonzeros exhibit rapid decay; the “first significant jump” rule for thresholding is empirically recommended. Iteration counts of 4–10 balance speed with reconstruction accuracy. For vision, ISD is compatible with any backbone and requires only standard entity/predicate features and bounding boxes. Class-agnostic label generation ensures that spatial salience is learned independently of predicate frequencies, and scoring at inference seamlessly fuses spatial and semantic cues. Signals with known structure (group sparsity, tree structure, total variation) may leverage specialized support detection in ISD's loop.
In summary, ISD operationalizes iterative, self-correcting detection or decoding to enhance recovery or localization performance in structured prediction tasks, maintaining efficiency and modularity while restoring spatial or support sensitivity lost in conventional or debiased pipelines.