RoMAE: Rotary Masked Autoencoder
- RoMAE is a method that extends the masked autoencoder paradigm by incorporating rotary positional embeddings for continuous multidimensional inputs and rotation invariance for 3D point clouds.
- It introduces a dual-stream masking approach—combining structured spatial grid and progressive semantic masking—to capture geometric and semantic features efficiently.
- Experimental evaluations demonstrate state-of-the-art performance improvements across various modalities with only a moderate increase in training overhead.
The Rotary Masked Autoencoder (RoMAE) is a family of methods extending the Masked Autoencoder (MAE) paradigm to accommodate either (1) multidimensional continuous positional information, such as in irregular time series or spatial data, via rotary positional encoding mechanisms, or (2) rotation invariance for point clouds via geometry- and semantics-aware masking schedules. RoMAE methods are distinguished by minimal changes to core Transformer or MAE architectures and by the introduction of new masking or positional schemes, allowing for improved generalization and robustness across modalities including time series, images, audio, and 3D point clouds (Zivanovic et al., 26 May 2025, Yin et al., 18 Sep 2025).
1. Architectural Foundations
RoMAE frameworks operate atop standard Transformer encoder–decoder stacks, typically following the original MAE pipeline of input patchification, random masking, representation learning in encoding, and masked-patch reconstruction via a small decoder.
For modality-agnostic RoMAE (continuous positions):
- Input: A -dimensional array is divided into non-overlapping patches of size , producing patches.
- Patch Embedding: Each patch is linearly projected: . A continuous position vector is associated with each patch.
- Rotary Positional Embeddings (RoPE): Embeddings are augmented using Axial RoPE — blockwise 2D rotations parameterized by each coordinate dimension, , with rotations applied per axis.
- Attention: Before dot-product computation in multi-head self-attention blocks, queries and keys are replaced by their RoPE-rotated versions.
- Masking: Uniform random masking (typically at 75%) is applied to the patch sequence, regardless of modality.
For rotation-invariant RoMAE (3D point clouds):
- Input: An unordered point set under arbitrary orientation.
- Patchification: Farthest point sampling (FPS) collects centroids; -NN around each centroid yields local patches, forming the processing units for the encoder.
- Masking Module: A dual-stream architecture orchestrates 3D Spatial Grid Masking (enforcing geometric priors) and Progressive Semantic Masking (clustering attention affinities), combined by a dynamic curriculum .
These design choices guarantee plug-and-play integration into Transformer-based pipelines and allow RoMAE to operate across disparate data geometries with minimal engineering effort.
2. Rotary Positional Embeddings for Continuous Inputs
The core innovation in modality-agnostic RoMAE is the use of rotary positional embeddings to encode arbitrary, potentially continuous, multi-axis positions (e.g., real-valued time or spatial coordinates) into the self-attention mechanism.
Given patch vector and position :
- Split into disjoint subspaces, apply RoPE per axis:
where each is a block-diagonal matrix of 2D rotations at sequence of frequencies.
- Detach the need for any learned positional embeddings; rotations are deterministic, supporting irregularity and continuity in positional data.
In each attention head and layer:
This approach supports time-series, spatial, and hybrid data in a unified manner, with patchwise positions encoded for MAE-based pretext tasks, such as masked signal reconstruction.
RoMAE's positional encoding strategy supports high performance with standard random masking, and unlike absolute sinusoids or learned embeddings, does not require indiscriminate architectural specialization to sequence modality.
3. Dual-Stream Masking for Rotation-Invariant Point Clouds
In rotation-invariant RoMAE for 3D point clouds, the masking module combines structured masking aligned to geometric and semantic invariants:
3D Spatial Grid Masking:
- Compute ordinal ranks along each spatial axis for all patch centroids .
- Discretize into a checkerboard structure (cube corners), each indexed by .
- Assign each grid type a predefined masking probability .
- Mask patches in a structured pattern: .
- The scheme is invariant to rotations, providing geometric stability in feature learning.
Progressive Semantic Masking:
- Compute the self-attention matrix at iteration for current visible patches.
- Threshold affinities (with increasing ) to construct a sparse semantic graph.
- Cluster patches using EM into Gaussian components, capturing object part semantics at varying granularity (from to ), as
where .
- All patches within each component are masked collectively via a per-component random probability.
Curriculum Learning Combination:
- The aggregated mask is , with scheduled (convex, e.g., ) so that early training emphasizes geometry and later epochs focus on semantic part coherence.
This dual-stream, curriculum-scheduled masking ensures features encode both pose-invariant structure and part-level semantics.
4. Objectives, Losses, and Optimization
A standard MAE-style reconstruction loss is used across both RoMAE settings:
where is the masked patch set, is the predicted content, and the original.
For point clouds, the loss is applied to masked point patches, using a frozen encoder for downstream tasks after pretraining.
No additional regularizers are required for rotation invariance when using a backbone with built-in invariance (e.g., via local reference frames, PCA, or handcrafted invariants).
Training uses optimizer AdamW with typical hyperparameters (e.g., learning rate, batch size 128, 300 epochs, cosine decay), and mixed precision.
5. Experimental Evaluation
5.1 Modality-agnostic RoMAE
RoMAE achieves state-of-the-art results or matches leading baselines across irregular time series, images, and audio tasks, all using a uniform architecture:
- DESC ELAsTiCC (irregular time series, F-score): Specialized ATAT, 0.6270; RoMAE-tiny, 0.6770.
- Pendulum regression (MSE ): RoMAE 3.32; S5 3.41; CRU 3.94; ContiFormer 4.63.
- Tiny ImageNet (F-score): RoMAE 0.345; MAE baseline 0.342.
- Audio, ESC-50 accuracy: RoMAE (AudioSet-20k) 83.4%; SSAST 82.2%.
Ablations confirm that the presence of a learned [CLS] token enables absolute position recovery in RoPE-based positioning, while otherwise RoPE maintains only relative-position guarantees.
5.2 Rotation-invariant RoMAE
Plug-in of RoMAE masking into existing -invariant MAE backbones consistently increases downstream accuracy by $0.3$– (Tables follow the original):
| Method (ModelNet40) | A/A | A/R | Z/Z | Z/R | R/R |
|---|---|---|---|---|---|
| RI-MAE | 87.9 | 88.2 | 88.2 | 88.0 | 88.5 |
| +RoMAE | 88.3 | 88.5 | 88.7 | 88.5 | 88.6 |
Largest gains occur under SO(3) rotation during both pretraining and evaluation, confirming the value of masking strategies aligned with geometric and semantic object structure.
6. Ablations and Overhead
- A dynamic clustering schedule for semantic masking outperforms fixed component counts; best results achieved by convex schedule (), transitioning fine coarse components over training.
- Disabling either grid or semantic masking stream reduces accuracy by $0.5$–, demonstrating complementarity.
- Training time increases by $13$– relative to naive random masking, with no additional inference cost (all masking and clustering are only used during pretraining).
| Backbone | Random Masking | +RoMAE | Increase |
|---|---|---|---|
| RI-MAE | 21:36 h | 24:51 h | +15.0% |
| MaskLRF | 25:08 h | 28:21 h | +12.8% |
| HFBRI-MAE | 22:49 h | 26:01 h | +14.0% |
7. Strengths, Limitations, and Extensions
Strengths
- Unified handling of sequence, spatial, and multi-modal continuous input geometries with a single masking/encoding pipeline.
- No need for learned position embeddings—Rotary/axial rotations are deterministic.
- Plug-and-play dual-stream masking improves performance on -invariant point cloud benchmarks with only moderate extra computational cost.
- Outperforms and matches specialized architectures across diverse benchmarks after unsupervised pretraining.
Limitations
- Not suitable for extremely long sequences without further attention/decoder modifications.
- Standard RoMAE is bidirectional and not causally constrained, limiting extrapolation beyond training range.
- Minor computational overhead when positions are non-static.
Potential Extensions
- Replacing the MAE decoder with linear-time attention or state-space blocks for scalability.
- Investigating anisotropic or learned RoPE frequency schedules.
- Applying dual-stream masking to time-sequenced 3D data, or extending progressive semantic masking to non-vision modalities.
RoMAE, by leveraging structured masking or continuous rotary encoding, generalizes Masked Autoencoder learning to settings requiring either geometric/semantic equivariant representations or flexible handling of continuous positions, and is extensible to emerging data modalities and architectures (Zivanovic et al., 26 May 2025, Yin et al., 18 Sep 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free