Rotational Equivariance in CoMIRs

Updated 13 November 2025

The paper introduces a framework that enforces rotational equivariance in CoMIRs through modified contrastive loss, ensuring predictable output transformations under rotations.
It details the application of equivariant convolutional architectures, including C₄-equivariant loss and Zernike polynomial–based volumetric convolution, to handle both 2D and 3D imaging tasks.
Empirical results demonstrate that enforcing rotational equivariance significantly improves multimodal registration accuracy, with success rates rising from as low as 7-16% to 65-75% in challenging conditions.

Rotational equivariance in contrastive multimodal image representations (CoMIRs) refers to the property that a learned image representation transforms predictably and consistently under rotations of the input, which is essential for robust multimodal registration and cross-modal downstream tasks. This concept is foundational not only in 2D and 3D imaging but also across a range of biomedical and remote sensing applications where objects and samples may appear in arbitrary orientations. Equivariance-enforcing mechanisms and architectures for CoMIRs draw from invariant and equivariant deep learning, group theory, and signal processing, and they have seen practical realization in several contemporary works using methods such as C₄-constrained contrastive loss, equivariant convolution, Zernike polynomial–based volumetric convolution, and moment kernel parameterizations.

1. Mathematical Definition and Relevance in CoMIRs

Formally, a representation function $f:X\to Y$ is said to be rotationally equivariant with respect to a group $\mathcal G$ if, for every $x\in X$ and every rotation $T\in\mathcal G$ , there exists a (possibly the same) rotation $T'\in\mathcal G$ such that

$f(Tx) = T' f(x).$

In the case of 2D images, $\mathcal G$ is typically a discrete subgroup of $\mathrm{SO}(2)$ , e.g., $C_4$ for $90^\circ$ -step rotations. For 3D data, $\mathcal G$ can be $\mathrm{SO}(3)$ or subgroups thereof.

Rotational equivariance is crucial for multimodal registration, where the objective is to align images from distinct modalities (e.g., BF/SHG, RGB/NIR) that can be misaligned by an unknown rotation. Equivariant representations ensure that such misalignments in input space translate to predictable transformations in the representation space, allowing standard monomodal registration methods (SIFT, mutual information, rigid transform solvers) to be applied directly to the CoMIRs. If $f$ is equivariant, registration in the latent space is well-posed even in the presence of arbitrary orientation differences (Pielawski et al., 2020).

2. Mechanisms for Enforcing Rotational Equivariance

2.1. C₄-Equivariant Contrastive Loss

The CoMIRs framework enforces equivariance by modifying the InfoNCE loss to include random rotations drawn from $C_4$ (the group of $k\cdot90^\circ$ in-plane rotations). For each training example, independent random rotations $T_1, T_2 \in C_4$ are sampled and applied in input space, and their corresponding group actions are applied in output space: $L_{C_4}(\theta_1,\theta_2;\mathcal D) = -\frac{1}{n}\sum_{i=1}^n \log \frac{ \exp\left[ h(T_1' f_{\theta_1}(T_1 x_i^1),\; T_2' f_{\theta_2}(T_2 x_i^2))/\tau \right] }{ \sum_{\text{positives and negatives}} }$ By sampling all $16$ possible combinations of $C_4$ -rotations on pairs, and matching their representations with the critic, the encoder is forced to learn equivariant maps (Pielawski et al., 2020).

2.2. Equivariant Convolutional Architectures

Rotation-equivariant convolutions partition the image into conic sectors, each associated with a rotated copy of a canonical filter. All rotated filters share parameters, ensuring that outputs transform equivariantly under rotation:

The convolution result on each conic sector $C_r$ uses the filter rotated by $\theta_r = r\Delta\theta$ .
Output on the boundaries is pooled (e.g., via max pooling) to guarantee smoothness and equivariance.
At the representation head, a discrete Fourier transform (DFT) and magnitude pooling produce globally rotation-invariant features if desired (Chidester et al., 2018).

In contrast, group convolution or equivariant convolutional layers with explicit group structure (e.g., for $C_4$ or $SO(3)$ ) are also applicable, where the feature maps themselves may be defined over the group structure.

2.3. 3D Volumetric Convolution on the Unit Ball

For 3D volumes, volumetric convolution using Zernike polynomial expansions in $\mathbb B^3$ produces convolutional layers which are naturally equivariant under $\mathrm{SO}(3)$ rotations and radial translations: $(f*g)(\alpha, \beta) = \frac{4\pi}{3}\sum_{n=0}^\infty\sum_{l=0}^n\sum_{m=-l}^l \Omega_{n,l,m}(f)\Omega_{n,l,0}(g)Y_{l,m}(\alpha, \beta)$ By construction, this operation commutes with $\mathrm{SO}(3)$ rotations and radial translations—provided the kernel $g$ is axially symmetric—leading to strict equivariance of feature maps (Ramasinghe et al., 2019).

2.4. Moment Kernels for O(d)-Equivariance

Equivariant convolution kernels can be parameterized as "moment kernels," i.e., radially symmetric functions times monomials (or the identity) in the displacement vector: $k^{i_1\cdots i_r}(x) = \sum_{S\subset \text{pairs}} f_S(\|x\|)\Bigl[\prod_{(a,b)\in S}\delta_{i_a,i_b}\Bigr]\Bigl[\prod_{c\notin\cup S}x^{i_c}\Bigr]$ This construction covers all possible O(d)-equivariant kernels and enables simple plug-and-play replacement of standard conv layers in any CoMIR backbone, while retaining standard deep learning efficiency (Schlamowitz et al., 27 May 2025).

3. Empirical Quantification and Stability of Equivariance

To quantify equivariance, representations are probed using rotated input images $T_\theta x$ , with outputs $y(\theta) = f(T_\theta x)$ . After "unrotating" $y(\theta)$ by $T_\theta'$ , the pixel-wise Pearson correlation $\rho(\theta)$ between $y(0)$ and $T_\theta' y(\theta)$ is computed. Near-constant $\rho(\theta)\approx0.98-0.995$ across all $\theta$ indicates high-fidelity equivariance (Pielawski et al., 2020).

Stability of the CoMIRs encoding under different conditions is assessed by measuring the mean pairwise correlation of representations across independent runs, varying initialization seeds, choice of training images, contrastive temperature parameters, and critic choices (MSE, cosine, bilinear). Representations trained with the equivariant loss remain consistent (mean correlation $\approx$ 93%) only under fixed seeds and fixed training data, indicating some sensitivity to initialization and data variance outside these settings.

4. Implementation Strategies and Best Practices

4.1. Kernel Construction and Feature Design

For 2D, use conic partitioning with shared filters per orientation or moment kernel parameterizations, enforcing equivariance to $C_4$ or $O(2)$ .
For 3D volumes, use Zernike polynomial expansions and axially symmetric kernels to realize $\mathrm{SO}(3)$ equivariance.
Moment kernels in both 2D and 3D are realized as functions of $\|x\|$ times monomials in $x$ , with efficient computation via precomputed offsets and mixing coefficients.

4.2. Integration with Contrastive Loss

The contrastive loss (e.g., InfoNCE) needs no additional hyperparameters or architectural changes; equivariance is enforced solely via random group operations in input and output.
For strictly equivariant pipelines, apply group actions to both the input and predicted representations during loss computation.

4.3. Feature Pooling and Symmetry Measures

For invariant tasks, apply DFT magnitude pooling or norm/tracing of feature maps to collapse rotational degrees of freedom.
Axial symmetry of 3D functions can be quantified using projections onto axially symmetric Zernike modes, providing an additional interpretable feature channel for tasks where anatomical symmetry is relevant (Ramasinghe et al., 2019).

4.4. Training and Data Augmentation Considerations

Equivariant architectures still benefit from mild rotation/flip data augmentation, aiding generalization to unseen pose distributions.
Initialization should align with broad radial support in moment kernel profiles. Pretraining with Gaussian-mimicking kernels is recommended for fast early convergence.
Careful normalization (e.g., batch norm on log-magnitudes) stabilizes feature scales across layers, particularly in moment-kernel networks (Schlamowitz et al., 27 May 2025).

A summary table of key implementation differences:

Approach	Rotation Group	Architectural Change	Loss Modification
C₄-equivariant CoMIRs	$C_4$	None (any encoder)	Randomized group actions in InfoNCE
Conic-partition conv	$\mathrm{SO}(2)$ approx	Conic partitions	None
Zernike volumetric conv	$\mathrm{SO}(3)$	3D Zernike basis	None / optional group-augmented
Moment kernels	$O(2)$ / $O(3)$	Kernel parameterization	None

5. Impact on Registration and Multimodal Alignment

Rotationally equivariant CoMIRs yield a step change in registration accuracy for challenging multimodal problems relative to standard mutual information–based registration, GAN-based image translation, and domain-specific baselines. For instance, on Bright-Field/SHG biomedical data subject to random rotations of up to $\pm30^\circ$ , only $\approx7\%$ of registrations using standard monomodal MI and $16\%$ using CurveAlign succeed with $<1\%$ error, compared with $65\%-75\%$ for $\alpha$ -AMD and SIFT applied to CoMIRs (Pielawski et al., 2020). At $<5\%$ error threshold, success rate exceeds $90\%$ .

GAN-translation baselines (e.g., CycleGAN, pix2pix) fail to yield accurate registration, as intensity mismatches induced by generators prevent robust alignment in the monomodal space. In contrast, CoMIRs—via strict rotational equivariance—ensure that spatial transforms correspond exactly in representation space across modalities, enabling off-the-shelf rigid aligners to succeed.

6. Extensions, Limitations, and Theoretical Guarantees

Volumetric convolution via Zernike expansions guarantees strict $\mathrm{SO}(3)$ (and radial) equivariance in 3D, extensible to arbitrary axes of symmetry and used for symmetry quantification.
Moment kernel parameterization provides a universal framework for constructing O(d)-equivariant convolutional layers with minimal additional computational overhead and maximal flexibility for CoMIR backbones.
For tasks demanding invariance rather than equivariance, a pooling or norm operation must be appended to collapse the representation space to a canonical form (Schlamowitz et al., 27 May 2025).
Empirically, exact equivariance is achieved for grid-aligned rotations; other angles rely on interpolation and fine radial discretization for approximate equivariance.
Stability analyses indicate some sensitivity of representation reproducibility to random initialization and data sampling; practical reproducibility may require fixed seeds in benchmarking setups (Pielawski et al., 2020).

A plausible implication is that the integration of moment kernels or Zernike-based volumetric convolution into CoMIRs architectures reduces sample complexity and enhances convergence speed, while providing rigorous group-theoretic guarantees on representation transformation properties under rotation.

7. Summary and Outlook

Rotational equivariance in CoMIRs is realized via a combination of architectural (equivariant convolution, volumetric/Zernike methods, moment kernels) and training (group-constrained contrastive loss) strategies. These approaches guarantee that learned representations transform predictably under rotation, enabling robust multimodal registration and analysis even with challenging, arbitrarily oriented inputs. Established by works such as (Pielawski et al., 2020, Ramasinghe et al., 2019, Chidester et al., 2018), and (Schlamowitz et al., 27 May 2025), these methods represent the current foundation for symmetry-respecting representation learning in biomedical imaging and related fields. Further extensions may incorporate larger rotation groups, reflection symmetry, or finer-grained tensor fields, but the core mathematical principles remain as described above.