Rotation-Invariant Attention Convolution
- RIAttnConv is a rotation-invariant deep learning method for 3D point clouds that fuses local invariant features with shadow-informed global pose cues.
- It employs classic Point Pair Features augmented by novel shadow-difference descriptors to overcome issues like wing-tip collapse in symmetric structures.
- The operator integrates a dynamic attention mechanism with self-attention based convolution, achieving state-of-the-art classification and segmentation on benchmarks.
Rotation-invariant Attention Convolution (RIAttnConv) is an architectural paradigm within deep learning for 3D point cloud analysis, designed to guarantee invariance to arbitrary rotations while preserving global pose information. RIAttnConv integrates rotation-invariant local geometric descriptors with an attention-augmented convolutional operator that restores the ability to resolve spatially distinct but geometrically symmetric structures, a key limitation of prior rotation-invariant (RI) frameworks. Its technical instantiation notably advances the state of the art for fine-grained 3D shape classification and segmentation, and generalizes naturally to broader equivariant attention-convolution operators in vision.
1. Problem Motivation and Theoretical Foundations
Conventional deep learning approaches for 3D point clouds primarily focus on translation and permutation invariance, often neglecting or ineffectively addressing invariance to arbitrary SO(3) rotations. Earlier RI methods achieve invariance by replacing raw coordinates with handcrafted rotation-invariant (RI) features. However, such local descriptors strip away global spatial context, leading to phenomena termed "wing-tip collapse," where symmetric yet spatially distinct structures (e.g., left vs. right airplane wings) become indistinguishable under rotation. Standard attention-convolutional approaches, e.g., EdgeConv or Transformer variants, are not inherently RI unless extensively augmented (Guo et al., 11 Nov 2025).
RIAttnConv is constructed to achieve two key properties:
- Provable rotation invariance by exclusively using RI features in convolution and attention computation.
- Global pose awareness via the introduction of a globally consistent reference (“shadow”), so that both local and global geometric information are encoded.
2. Construction of Shadow-informed Pose Features (SiPFs)
The core input to RIAttnConv is the Shadow-informed Pose Feature (SiPF), which fuses classic local RI Point Pair Features (PPFs) with novel shadow-difference features anchored to a learned global orientation. The process is as follows:
- Local Reference Frame (LRF): For each point , its nearest neighbors are gathered. A local orthonormal frame is established via Gram–Schmidt using the surface normal and barycentric vector.
- Classic PPF: For each neighbor , the 4D point pair feature is
with .
- Shadow Construction: A global rotation (learned via a task-adaptive shadow locating module, using the Bingham distribution over unit quaternions) is applied to to determine its "shadow" position .
- Shadow-difference Feature (SiPPF):
- 8-D SiPF Vector:
This procedure ensures that SiPFs capture both local geometric invariants and their global relationship to a consistent spatial reference.
3. RIAttnConv Operator: Attention-augmented RI Convolution
RIAttnConv implements a neighborhood attention mechanism over invariant features, described as follows:
- Dynamic Weighting: For each center-neighbor pair, the SiPF is embedded by a small MLP , producing dynamic weights .
- Self-attention Computation: Let be the neighbor feature, and let
where is elementwise multiplication, and is the stacked neighbor feature matrix. Standard scaled dot-product self-attention is then computed:
- Neighborhood Aggregation: The attended neighborhood features are pooled by taking the maximum along the neighbor axis:
- Feature Fusion: The output is generated by a fusion MLP on .
The critical property is that since all quantities are ultimately derived from RI inputs, the entire operator is rotation-invariant. The use of attention across all neighbors (i.e., a attention matrix) yields a receptive field that dynamically adapts and considers global pose cues via the shared shadow.
4. Comparison to Related RI and Attention Convolution Operators
RIAttnConv differs from previous methods in several essential respects:
| Property | Prior RI Conv (PPF, PaRI, etc.) | Standard Attention Conv | RIAttnConv |
|---|---|---|---|
| Local RI Features | ✓ | ✗ | ✓ |
| Global Pose Awareness | ✗ | ✓ (if not RI) | ✓ (via shadow) |
| Rotation Invariance | ✓ | ✗ | ✓ |
| Fine-grained Symmetry Disc. | ✗ (wing-tip collapse) | ✓ (if non-RI, not robust) | ✓ |
| Attention Mechanism | Typically absent or pairwise | Arbitrary, not RI | RI, self-attention on SiPFs |
Earlier work on "Affine Self-Convolution" (ASC) constructs attention-augmented convolutions that are translation or roto-translation equivariant in the image domain (Diaconu et al., 2019). However, these do not address arbitrary SO(3) invariance in 3D point clouds, nor resolve the global pose collapse inherent to local RI features (Guo et al., 11 Nov 2025). Recent surface-based RI operators, such as RISurConv (Zhang et al., 12 Aug 2024), integrate surface triangle invariants with attention for improved 3D RI convolution, but do not inject global pose via shadow features.
5. Training, Computational Complexity, and Implementation
Each RIAttnConv layer consists of two primary MLPs: an 8-dimensional SiPF-to-weight network (parameter cost ) and a fusion MLP. For a point cloud with points and neighborhood size :
- Complexity: kNN search: per layer; SiPF computation ; attention per point. In typical use, or $40$, making the cost manageable.
- Memory Usage: Dominated by for Q/K/V and attention storage.
- Parameter Count: Comparable with other adaptive RI convolution layers.
- Implementation: Task-adaptive global shadow locating is accomplished via a module leveraging the Bingham distribution over unit quaternions, allowing flexible, data-driven shadow orientation learning.
6. Empirical Performance and Ablation Analysis
RIAttnConv demonstrates superior performance on standard benchmarks under arbitrary rotations, notably:
- ModelNet40 (SO(3)/SO(3), with normals): RIAttnConv achieves 92.6% overall accuracy, which surpasses all prior RI-centric models.
- ShapeNetPart (z/SO(3)): Class mIoU of 82.9% and instance mIoU of 85.0% with normals.
- Ablation Studies: Utilizing the full 8-D SiPF (vs. only PPF) yields a gain of ~1.8% C.mIoU; using the RIAttnConv layer (versus alternative SI-PF aggregations) gives a further 0.5–0.8% improvement. The architecture retains stable performance across various neighborhood sizes and loss weightings for shadow location.
Qualitative results demonstrate the ability of RIAttnConv to segment and classify symmetric, spatially distinct parts consistently under rotation, due to the preserved global pose cues. Previous RI methods (PPF-CNN, PaRI-Conv) fail in this regime due to local feature ambiguity (“wing-tip collapse”) (Guo et al., 11 Nov 2025).
7. Generalization and Broader Context
RIAttnConv generalizes the approach of attention-augmented convolution to the strict rotation-invariant setting for 3D data, establishing a blueprint for simultaneously achieving invariance and pose discrimination—traits not simultaneously realized by prior approaches. Related developments in rotation-equivariant attentional convolution for images via group convolution and affine attention suggest further avenues for generalizing such operators to other symmetry groups (Diaconu et al., 2019). The systematic integration of global reference anchors (shadows) for pose-awareness is a salient innovation compared to purely local invariant methods or attention schemes that lack symmetry constraints.
A plausible implication is that attention-based RI convolutions with global pose referencing mark a new direction for invariant deep learning, particularly for application regimes where fine-grained spatial part discrimination is essential under arbitrary transformations.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free