Efficient methodology to capture intrinsic 3D context in CT without 2D-slice simplification

Develop a self-supervised learning methodology for three-dimensional Computed Tomography volumes that effectively and efficiently captures intrinsic 3D contextual information, preserving axial coherence and full 3D structural context, without simplifying the data into independent 2D slices.

Background

The paper reviews prior masked autoencoder-based self-supervised learning approaches adapted to medical imaging and highlights that many methods, including MedMAE, process 3D CT data as independent 2D slices. This simplification discards axial coherence and inter-slice relationships that are crucial for accurate CT analysis.

While domain-specific pre-training has shown benefits in mitigating domain shift, the authors emphasize that an approach which simultaneously preserves 3D structural information and remains computationally efficient is not yet established in the field, motivating their proposed MAESIL framework and articulating the broader unresolved challenge.

References

Yet, a methodology that effectively and efficiently captures the intrinsic 3D contextual information of CT scans, without resorting to 2D-slice simplification, remains an open challenge in the field.

MAESIL: Masked Autoencoder for Enhanced Self-supervised Medical Image Learning  (2604.00514 - Kim et al., 1 Apr 2026) in Section 2 (Related Work), final paragraph