RoMAE: Rotary Masked Autoencoder

Updated 11 November 2025

RoMAE is a method that extends the masked autoencoder paradigm by incorporating rotary positional embeddings for continuous multidimensional inputs and rotation invariance for 3D point clouds.
It introduces a dual-stream masking approach—combining structured spatial grid and progressive semantic masking—to capture geometric and semantic features efficiently.
Experimental evaluations demonstrate state-of-the-art performance improvements across various modalities with only a moderate increase in training overhead.

The Rotary Masked Autoencoder (RoMAE) is a family of methods extending the Masked Autoencoder (MAE) paradigm to accommodate either (1) multidimensional continuous positional information, such as in irregular time series or spatial data, via rotary positional encoding mechanisms, or (2) rotation invariance for point clouds via geometry- and semantics-aware masking schedules. RoMAE methods are distinguished by minimal changes to core Transformer or MAE architectures and by the introduction of new masking or positional schemes, allowing for improved generalization and robustness across modalities including time series, images, audio, and 3D point clouds (Zivanovic et al., 26 May 2025, Yin et al., 18 Sep 2025).

1. Architectural Foundations

RoMAE frameworks operate atop standard Transformer encoder–decoder stacks, typically following the original MAE pipeline of input patchification, random masking, representation learning in encoding, and masked-patch reconstruction via a small decoder.

For modality-agnostic RoMAE (continuous positions):

Input: A $D$ -dimensional array $\mathbf{x} \in \mathbb{R}^{d_1 \times d_2 \cdots \times d_D}$ is divided into non-overlapping patches of size $(p_1,\dots,p_D)$ , producing $k = \prod_{i=1}^D (d_i/p_i)$ patches.
Patch Embedding: Each patch is linearly projected: $z_i = W_{\rm patch} \,\text{patch}_i \in \mathbb{R}^{d_{\rm model}}$ . A continuous position vector $s_i=(s_{i,1},\dots,s_{i,D})\in\mathbb{R}^D$ is associated with each patch.
Rotary Positional Embeddings (RoPE): Embeddings are augmented using Axial RoPE — blockwise 2D rotations parameterized by each coordinate dimension, $\theta_i = 10000^{-2(i-1)/d}$ , with rotations $R(s_{i,j} \theta_i)$ applied per axis.
Attention: Before dot-product computation in multi-head self-attention blocks, queries and keys are replaced by their RoPE-rotated versions.
Masking: Uniform random masking (typically at 75%) is applied to the patch sequence, regardless of modality.

For rotation-invariant RoMAE (3D point clouds):

Input: An unordered point set $P_0 \in \mathbb{R}^{N \times 3}$ under arbitrary $SO(3)$ orientation.
Patchification: Farthest point sampling (FPS) collects $K$ centroids; $k$ -NN around each centroid yields $K$ local patches, forming the processing units for the encoder.
Masking Module: A dual-stream architecture orchestrates 3D Spatial Grid Masking (enforcing geometric priors) and Progressive Semantic Masking (clustering attention affinities), combined by a dynamic curriculum $\alpha(t)$ .

These design choices guarantee plug-and-play integration into Transformer-based pipelines and allow RoMAE to operate across disparate data geometries with minimal engineering effort.

2. Rotary Positional Embeddings for Continuous Inputs

The core innovation in modality-agnostic RoMAE is the use of rotary positional embeddings to encode arbitrary, potentially continuous, multi-axis positions (e.g., real-valued time or spatial coordinates) into the self-attention mechanism.

Given patch vector $z_i$ and position $s_i$ :

Split $z_i$ into $D$ disjoint subspaces, apply RoPE per axis:

$\mathrm{RoPE}(z_i, s_i) = z_i \odot R^{(1)}(s_{i,1}) \odot \cdots \odot R^{(D)}(s_{i,D}),$

where each $R^{(j)}(s_{i,j})$ is a block-diagonal matrix of 2D rotations at sequence of frequencies.

Detach the need for any learned positional embeddings; rotations are deterministic, supporting irregularity and continuity in positional data.

In each attention head and layer:

$q_i^{(\ell)} \leftarrow R(s_i)q_i^{(\ell)},\quad k_i^{(\ell)} \leftarrow R(s_i)k_i^{(\ell)}$

This approach supports time-series, spatial, and hybrid data in a unified manner, with patchwise positions encoded for MAE-based pretext tasks, such as masked signal reconstruction.

RoMAE's positional encoding strategy supports high performance with standard random masking, and unlike absolute sinusoids or learned embeddings, does not require indiscriminate architectural specialization to sequence modality.

3. Dual-Stream Masking for Rotation-Invariant Point Clouds

In rotation-invariant RoMAE for 3D point clouds, the masking module combines structured masking aligned to geometric and semantic invariants:

3D Spatial Grid Masking:

Compute ordinal ranks along each spatial axis for all patch centroids $c_i^d$ .
Discretize into a $2 \times 2 \times 2$ checkerboard structure (cube corners), each indexed by $grid\_type \in \{0,\dots,7\}$ .
Assign each grid type a predefined masking probability $p_t$ .
Mask patches in a structured pattern: $M_{\rm spatial}[i] \sim \mathrm{Bernoulli}(p_{grid\_type[i]})$ .
The scheme is invariant to $SO(3)$ rotations, providing geometric stability in feature learning.

Progressive Semantic Masking:

Compute the self-attention matrix $A^{(t)}$ at iteration $t$ for current visible patches.
Threshold affinities (with increasing $\tau^{(t)}$ ) to construct a sparse semantic graph.
Cluster patches using EM into $C^{(t)}$ Gaussian components, capturing object part semantics at varying granularity (from $C_{\max}$ to $C_{\min}$ ), as

$\gamma_{i,c}^{(t)} = \frac{\pi_c^{(t)}\mathcal{N}\left(v_i | \mu_c^{(t)}, \Sigma_c^{(t)}\right)}{\sum_{j=1}^{C^{(t)}} \pi_j^{(t)} \mathcal{N}\left(v_i | \mu_j^{(t)}, \Sigma_j^{(t)}\right)}$

where $v_i = A_{i,:}^{(t)}$ .

All patches within each component are masked collectively via a per-component random probability.

Curriculum Learning Combination:

The aggregated mask is $M^{(t)} = (1-\alpha(t)) M_{\rm spatial} + \alpha(t) M_{\rm semantic}^{(t)}$ , with $\alpha(t)$ scheduled (convex, e.g., $\gamma=2$ ) so that early training emphasizes geometry and later epochs focus on semantic part coherence.

This dual-stream, curriculum-scheduled masking ensures features encode both pose-invariant structure and part-level semantics.

4. Objectives, Losses, and Optimization

A standard MAE-style reconstruction loss is used across both RoMAE settings:

$\mathcal{L}_{\rm MAE} = \frac{1}{|M|} \sum_{i\in M} \|\hat{x}_i - x_i\|^2$

where $M$ is the masked patch set, $\hat{x}_i$ is the predicted content, and $x_i$ the original.

For point clouds, the loss is applied to masked point patches, using a frozen encoder for downstream tasks after pretraining.

No additional regularizers are required for rotation invariance when using a backbone with built-in $SO(3)$ invariance (e.g., via local reference frames, PCA, or handcrafted invariants).

Training uses optimizer AdamW with typical hyperparameters (e.g., $1.5 \times 10^{-4}$ learning rate, batch size 128, 300 epochs, cosine decay), and mixed precision.

5. Experimental Evaluation

5.1 Modality-agnostic RoMAE

RoMAE achieves state-of-the-art results or matches leading baselines across irregular time series, images, and audio tasks, all using a uniform architecture:

DESC ELAsTiCC (irregular time series, F-score): Specialized ATAT, 0.6270; RoMAE-tiny, 0.6770.
Pendulum regression (MSE $\times 10^{-3}$ ): RoMAE 3.32; S5 3.41; CRU 3.94; ContiFormer 4.63.
Tiny ImageNet (F-score): RoMAE 0.345; MAE baseline $\sim$ 0.342.
Audio, ESC-50 accuracy: RoMAE (AudioSet-20k) 83.4%; SSAST 82.2%.

Ablations confirm that the presence of a learned [CLS] token enables absolute position recovery in RoPE-based positioning, while otherwise RoPE maintains only relative-position guarantees.

5.2 Rotation-invariant RoMAE

Plug-in of RoMAE masking into existing $SO(3)$ -invariant MAE backbones consistently increases downstream accuracy by $0.3$– $2.0\%$ (Tables follow the original):

Method (ModelNet40)	A/A	A/R	Z/Z	Z/R	R/R
RI-MAE	87.9	88.2	88.2	88.0	88.5
+RoMAE	88.3	88.5	88.7	88.5	88.6

Largest gains occur under SO(3) rotation during both pretraining and evaluation, confirming the value of masking strategies aligned with geometric and semantic object structure.

6. Ablations and Overhead

A dynamic clustering schedule for semantic masking outperforms fixed component counts; best results achieved by convex schedule ( $\gamma=2$ ), transitioning fine $\to$ coarse components over training.
Disabling either grid or semantic masking stream reduces accuracy by $0.5$– $1\%$ , demonstrating complementarity.
Training time increases by $13$– $15\%$ relative to naive random masking, with no additional inference cost (all masking and clustering are only used during pretraining).

Backbone	Random Masking	+RoMAE	Increase
RI-MAE	21:36 h	24:51 h	+15.0%
MaskLRF	25:08 h	28:21 h	+12.8%
HFBRI-MAE	22:49 h	26:01 h	+14.0%

7. Strengths, Limitations, and Extensions

Strengths

Unified handling of sequence, spatial, and multi-modal continuous input geometries with a single masking/encoding pipeline.
No need for learned position embeddings—Rotary/axial rotations are deterministic.
Plug-and-play dual-stream masking improves performance on $SO(3)$ -invariant point cloud benchmarks with only moderate extra computational cost.
Outperforms and matches specialized architectures across diverse benchmarks after unsupervised pretraining.

Limitations

Not suitable for extremely long sequences without further attention/decoder modifications.
Standard RoMAE is bidirectional and not causally constrained, limiting extrapolation beyond training range.
Minor computational overhead when positions are non-static.

Potential Extensions

Replacing the MAE decoder with linear-time attention or state-space blocks for scalability.
Investigating anisotropic or learned RoPE frequency schedules.
Applying dual-stream masking to time-sequenced 3D data, or extending progressive semantic masking to non-vision modalities.

RoMAE, by leveraging structured masking or continuous rotary encoding, generalizes Masked Autoencoder learning to settings requiring either geometric/semantic equivariant representations or flexible handling of continuous positions, and is extensible to emerging data modalities and architectures (Zivanovic et al., 26 May 2025, Yin et al., 18 Sep 2025).

PDF Markdown Chat (Pro)

References (2)

Rotary Masked Autoencoders are Versatile Learners (2025)

Beyond Random Masking: A Dual-Stream Approach for Rotation-Invariant Point Cloud Masked Autoencoders (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Rotary Masked Autoencoder (RoMAE).