Local Continuity Module (LCM)
- LCM is a module that integrates domain-specific local correspondence and compact modeling to capture fine-grained spatial and inter-image relationships.
- In co-salient object detection, LCM uses multi-stage pairwise correlation and 3D convolutions to fuse local and global features, significantly improving accuracy.
- For point cloud masked modeling, LCM leverages a locally constrained encoder and a Mamba-based decoder to reduce computational cost while boosting reconstruction fidelity.
The Local Continuity Module (LCM) designates two distinct, high-impact architectural strategies for modeling fine-grained local relationships: (1) Local Correspondence Modeling in co-salient object detection, and (2) Locally Constrained Compact Models for efficient masked point modeling. Both lines of work replace or augment standard attention frameworks with domain-specific locality-aware components to encode spatial or inter-image affinities, achieving improvements in both accuracy and computational efficiency. The principal designs are exemplified by the LCM in GLNet for co-salient object detection (Cong et al., 2022), and the Locally Constrained Compact Model for point-cloud masked modeling (Zha et al., 27 May 2024).
1. LCM in Co-Salient Object Detection: Architecture and Operations
In the context of co-salient object detection (CoSOD), the Local Correspondence Modeling (LCM) module is a core component of the global-and-local collaborative learning architecture (GLNet), engineered to explicitly capture local inter-image correspondence for robust co-saliency prediction (Cong et al., 2022).
The LCM operates on a feature map for each image in a group of images, typically with for VGG16-based backbones. For each image , LCM computes pairwise local correspondences with all other images via a multi-stage Pairwise Correlation Transformation (PCT):
- Subspace Mapping: 1×1 convolution projects each to .
- Affinity Estimation: Each and are reshaped to ; their affinities are computed as transposed matrix product, measuring pixel-wise similarity.
- Score Pooling and Normalization: For image , globally pooled local maxima and softmax normalization yield a weighting map .
- Feature Fusion with Attention: Local affinity maps are broadcast and fused into residual-attention-weighted feature flows: .
- Inter-image Aggregation: These local maps for each are stacked, followed by stacked 3D convolutions (kernel ), yielding the local inter-image descriptor .
Internal attention mechanisms (SE-based channel attention and CBAM-style spatial attention) refine both fusion and local context. Key architectural details include two 3D convolutions (approx. $9.4$M parameters for ), a conv, and attention modules, totaling $10$–$11$M parameters per image .
2. LCM Contribution to Global-and-Local Feature Fusion
The LCM’s output provides fine-grained pairwise local descriptors, which are fused with global group-level features from the Global Correspondence Modeling (GCM) module. Fusion occurs via the Global-and-Local Correspondence Aggregation (GLA):
- is fused using a 3D convolution, ReLU, and subsequent channel/spatial attention operations.
- This yields the final inter-image feature incorporated into downstream co-saliency prediction.
Ablation studies demonstrate that removing the LCM results in significant performance degradation: on the Cosal2015 dataset, drops from $0.8936$ to $0.8550$ and increases from $0.0648$ to $0.0783$, indicating substantial loss of co-saliency discrimination especially in groups with strong intra-class variance (Cong et al., 2022).
3. LCM in Point Cloud Modeling: Locally Constrained Compact Model Design
Separately, the Locally Constrained Compact Model (LCM) for masked point modeling (Zha et al., 27 May 2024) establishes a locality-driven alternative to quadratic-complexity Transformer frameworks, targeting redundancy reduction and linear scaling.
The architecture consists of two principal modules:
- Locally Constrained Compact Encoder (LCCE): Replaces global self-attention with local aggregation layers. Each patch token finds its -nearest neighbors via geometric KNN on patch centers, aggregating local structure using concatenation and local MLPs, followed by channel-wise max-pooling. The static neighbor graph () is shared across all encoder layers, enforcing locality and continuity.
- Locally Constrained Mamba-Based Decoder (LCMD): Integrates a linear-time State-Space Model (SSM, as in Mamba) with a local-constrained feed-forward network (LCFFN). The decoder preserves mutual information for masked patch reconstruction by ensuring only geometric neighbors communicate, achieving robustness to patch ordering and high reconstruction fidelity.
This design compresses a Point-MAE backbone to $2.7$M parameters (from $22.1$M) and $1.3$G FLOPs (from $4.8$G), while increasing ScanObjectNN OBJ-BG accuracy from to and ScanNetV2 AP from $59.5$ to $64.7$ (+$5.2$) (Zha et al., 27 May 2024).
4. Mathematical Operations in LCM Modules
LCM in CoSOD (Cong et al., 2022):
- Affinity matrix between :
- Global score and weighting:
- Fusion:
- 3D Convolutional Aggregation:
LCM in Point Cloud Masked Modeling (Zha et al., 27 May 2024):
- Local Aggregation Layer:
For each token :
- Mutual Information Guarantee (Decoder): The mutual information preserved by the Mamba SSM decoder:
due to the data processing inequality and the linear nature of the SSM.
5. Computational Characteristics and Ablation Insights
Both forms of LCM dramatically reduce computational cost by confining feature interactions to local neighborhoods.
- Parameter Efficiency: In point cloud MPM, LCM reduces parameter count by (from $22.1$M to $2.7$M) and FLOPs by (from $4.8$G to $1.3$G).
- Accuracy Gains: LCM-Point-MAE outperforms Transformer-based Point-MAE by (OBJ-BG), (OBJ-ONLY), and (PB-T50-RS).
- Key Hyperparameters: Local aggregation with neighbors provides optimal accuracy/resource tradeoff; geometric KNN matches or slightly outperforms dynamic/feature-space KNN.
- Ablations: In the point-cloud domain, inclusion of both local aggregation and FFN in the encoder yields optimal accuracy ( PB-T50-RS), but local aggregation is the dominant driver, accounting for most performance improvements.
- Decoder Variants: Mamba+LCFFN configuration yields the highest masked reconstruction accuracy ( on ScanObjectNN PB-T50-RS).
| Model/Setting | Params | FLOPs | Acc./Metric |
|---|---|---|---|
| Transformer (PM) | 22.1M | 4.8G | OBJ-BG 92.67% |
| LCM (PM) | 2.7M | 1.3G | OBJ-BG 94.51% |
| GLNet w/ LCM | 10–11M/𝑘 | — | |
| GLNet w/o LCM | <10M | — |
6. Theoretical Significance and Limitations
LCM-based designs replace non-local self-attention with constrained, neighborhood-preserving aggregation, underpinned by the principle that most relevant contextual information in highly structured domains (e.g., 3D space, local object saliency) is localized. In point cloud modeling, information-theoretic analysis shows that the locally constrained Mamba decoder retains at least as much mutual information about masked regions as a Transformer, while relying on linear operations.
A plausible implication is that for structured data with clear geometric or semantic neighborhoods, LCM-like modules can deliver superior efficiency–accuracy profiles compared to transformer-based paradigms, provided domain knowledge about locality is available. However, for modeling long-range dependencies or highly non-local relationships, purely local architectures may require auxiliary modules or hybrid fusion.
7. Impact and Current Use
LCMs have proven critical both for improved model performance and for making inference or pretraining feasible on larger, more realistic inputs without the quadratic cost of classical attention. In image co-saliency detection, LCM enables fine-grained correspondence learning between images, overcoming limitations of global feature pooling. In point cloud masked modeling, the Locally Constrained Compact Model supports scalable pretraining and robust transfer across 3D tasks, with empirical evidence showing up to parameter reductions with no loss—and sometimes improvement—in downstream accuracy (Cong et al., 2022, Zha et al., 27 May 2024). In both cases, architectural modularity allows seamless integration with global contextual modeling, supporting a hierarchy of correspondence cues.