OCTCube-M: A 3D multimodal optical coherence tomography foundation model for retinal and systemic diseases with cross-cohort and cross-device validation (2408.11227v2)

Published 20 Aug 2024 in eess.IV, cs.AI, and cs.CV

Abstract: We present OCTCube-M, a 3D OCT-based multi-modal foundation model for jointly analyzing OCT and en face images. OCTCube-M first developed OCTCube, a 3D foundation model pre-trained on 26,685 3D OCT volumes encompassing 1.62 million 2D OCT images. It then exploits a novel multi-modal contrastive learning framework COEP to integrate other retinal imaging modalities, such as fundus autofluorescence and infrared retinal imaging, into OCTCube, efficiently extending it into multi-modal foundation models. OCTCube achieves best performance on predicting 8 retinal diseases, demonstrating strong generalizability on cross-cohort, cross-device and cross-modality prediction. OCTCube can also predict cross-organ nodule malignancy (CT) and low cardiac ejection fraction as well as systemic diseases, such as diabetes and hypertension, revealing its wide applicability beyond retinal diseases. We further develop OCTCube-IR using COEP with 26,685 OCT and IR image pairs. OCTCube-IR can accurately retrieve between OCT and IR images, allowing joint analysis between 3D and 2D retinal imaging modalities. Finally, we trained a tri-modal foundation model OCTCube-EF from 4 million 2D OCT images and 400K en face retinal images. OCTCube-EF attains the best performance on predicting the growth rate of geographic atrophy (GA) across datasets collected from 6 multi-center global trials conducted in 23 countries. This improvement is statistically equivalent to running a clinical trial with more than double the size of the original study. Our analysis based on another retrospective case study reveals OCTCube-EF's ability to avoid false positive Phase-III results according to its accurate treatment effect estimation on the Phase-II results. In sum, OCTCube-M is a 3D multi-modal foundation model framework that integrates OCT and other retinal imaging modalities revealing substantial diagnostic and prognostic benefits.

Summary

The paper introduces OCTCube-M, a 3D OCT foundation model that holistically models entire retinal volumes to enhance diagnostic accuracy.
The paper employs 3D masked autoencoders and FlashAttention to transform 1.62 million 2D images into 26,605 3D volumes, significantly boosting AUPRC metrics.
The paper demonstrates robust performance across cross-dataset, cross-device, and cross-modality settings, advancing predictions for both retinal and systemic diseases.

An Analytical Distillation of OCTCube: A 3D Foundation Model for Optical Coherence Tomography

The detailed presentation of the OCTCube model embodies a significant advancement in the domain of optical coherence tomography (OCT) imaging, focusing on the intricacies and advantages of 3D image modeling. This review meticulously dissects the paper's contributions, performance metrics, and future implications, firmly aimed at experienced researchers in the field.

Overview of OCTCube

OCTCube represents a paradigm shift from traditional 2D slice-based models to a comprehensive 3D approach. Pre-trained on a substantial dataset encompassing 1.62 million 2D OCT images organized into 26,605 3D OCT volumes, OCTCube deploys 3D masked autoencoders (MAE) for its foundational training. The model leverages FlashAttention to mitigate the increased GPU memory demands inherently associated with 3D data structures. OCTCube's architecture holistically models the entire 3D volume, diverging from common practice, which aggregates individual 2D slice predictions.

Results and Performance

The model's efficacy was rigorously validated across multiple dimensions, including cross-dataset, cross-disease, cross-device, and cross-modality settings. OCTCube demonstrated superior performance in predicting eight retinal diseases, surpassing the 2D model, RETFound, in both inductive and cross-dataset scenarios. Notably, it improved average AUPRC from 0.77 to 0.81 in the inductive setting and from 0.66 to 0.77 in cross-dataset settings. Additionally, OCTCube exhibited robust generalizability in cross-device contexts, significantly outperforming 2D models on datasets captured with different devices.

In the field of systemic disease prediction, OCTCube accurately predicted conditions such as diabetes and hypertension, further underscoring its versatile application. The model's extended capability for cross-modality analysis was showcased through the integration of OCT and infrared retinal (IR) images using a contrastive self-supervised learning framework named COIP. This approach enabled precise alignment between OCT and IR en face images, facilitating accurate and reliable multi-modal retina modeling.

Theoretical and Practical Implications

The transition from 2D to 3D modeling in OCTCube opens up several pathways for improved disease diagnosis and prognosis. The holistic modeling of 3D structures captures continuous spatial patterns more effectively than individual 2D slices, addressing suboptimal results from slice-by-slice aggregation. This advancement is particularly relevant in conditions like Age-related Macular Degeneration (AMD) and Primary Open-Angle Glaucoma (POAG), where disease processes extend across the three-dimensional retinal structure.

Future Developments and Speculations

The presented model paves the way for a broader application and future enhancements in AI-driven retinal diagnostics. Prospective developments could include:

Integration of Multi-Modal and Temporal Data: Future iterations of OCTCube could incorporate other imaging modalities like fundus autofluorescence (FAF), color fundus photography (CFP), and fluorescein angiography (FA) in a 4D framework, encompassing temporal data.
Enhanced Interpretability: Employing advanced interpretability methods such as SHAP and RELPROP could refine the model's clinical utility by pinpointing crucial 3D regions contributing to diagnostic predictions.
Computational Efficiency: Incorporating more computationally efficient neural network architectures and optimizing GPU memory usage will be paramount as models scale up in complexity and training datasets.

Conclusion

OCTCube marks a significant advancement in OCT imaging by effectively harnessing the three-dimensional structure inherent in OCT volumes, thereby enhancing diagnostic accuracy and facilitating broader applications in both retinal and systemic disease prediction. This model not only delineates a clear improvement over traditional 2D approaches but also sets a robust foundation for future innovations within the field. This work garners substantial implications for the development of more generalized, accurate, and computationally efficient AI models in medical imaging, particularly within the specialized domain of ophthalmology and beyond.