ConvNeXtV2-Large: OCT Teacher Model
- The teacher model is a deep convolutional network with hierarchical stages and 196.4M parameters, achieving 92.6% accuracy in retinal OCT classification.
- The training pipeline employs advanced augmentations, stochastic weight averaging, and focal loss to ensure robust performance despite class imbalance.
- In knowledge distillation, temperature scaling and a combined loss function enable efficient transfer of predictive soft labels to compact student models.
The ConvNeXtV2-Large teacher model serves as the high-capacity backbone and performance reference in the KD-OCT framework for clinical-grade retinal OCT classification. Built around state-of-the-art convolutional design and extensive training enhancements, it establishes a robust performance benchmark for transferring knowledge to compact student models via distillation, enabling efficient deployment while retaining diagnostic accuracy (Nourbakhsh et al., 9 Dec 2025).
1. Architectural Design and Core Components
ConvNeXtV2-Large is constructed as a deep convolutional neural network employing a hierarchical architecture. The backbone is initialized with weights pretrained on ImageNet-22K and further fine-tuned on ImageNet-1K with Fully Convolutional Masked Autoencoding (FCMAE). Its principal computational unit is the “ConvNeXtV2 block,” which includes:
- Depth-wise 7×7 convolution
- Layer Normalization
- Linear (point-wise) channel expansion by a factor of 4
- GELU nonlinearity
- Global Response Normalization (GRN)
- Linear projection to base channel count
- Residual connection with drop-path regularization
The spatial downsampling strategy is staged, employing a 2×2 strided convolution with LayerNorm at the onset of each stage. The model’s overall stage-wise structure is summarized in the following table:
| Stage | Output Size | Channels | Blocks | Expansion |
|---|---|---|---|---|
| Stem | 384×384→48×48 | 384 | – | – |
| Stage 1 | 48×48 | 384 | 3 | 4× |
| Stage 2 | 24×24 | 768 | 27 | 4× |
| Stage 3 | 12×12 | 1536 | 3 | 4× |
| Stage 4 | 6×6 | 3072 | 3 | 4× |
The classification head comprises global average pooling, dropout, and a fully connected layer producing logits for three target categories: normal, drusen, and CNV. The model contains approximately 196.4 million parameters. FLOPs were not specified.
2. Training Pipeline and Regularization Strategies
To maximize generalization and address dataset-specific challenges, the training pipeline integrates extensive augmentations and advanced regularization. The augmentation sequence involves:
- Resize and random crop to 384×384
- RandAugment (N=2, M=9): brightness, contrast, saturation, sharpness
- Random rotation (±20°), affine shear (±15°), scale (0.85–1.15)
- Horizontal flip (p=0.5), vertical flip (p=0.3)
- Gaussian blur (p=0.2), posterize (p=0.2), bit-depth reduction
- Random erasing (scale 0.02–0.15, p=0.25)
- ToTensor, ImageNet normalization (mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225])
Stochastic Weight Averaging (SWA) is employed for solution smoothness. After each of selected checkpoints, weights are averaged:
Focal loss addresses class imbalance:
with as the predicted probability for the true class, reflecting class priors, and focusing parameter .
3. Optimization Protocol and Hyperparameterization
Optimization is conducted using AdamW. Distinct learning rates are assigned: for the backbone and for the classification head. Weight decay is set to 0.05. Training begins with a linear warmup over 10 epochs, transitioning into a cosine-annealing schedule over 150 epochs:
with the initial and . Effective batch size is 16 using gradient accumulation (physical batch size 4 × accumulation 4), training proceeds with FP16 mixed-precision. Early stopping is applied with a 25-epoch patience.
4. Empirical Performance and Evaluation
On the Noor Eye Hospital (NEH) dataset, ConvNeXtV2-Large achieves:
- Accuracy: 92.6% ± 2.3
- Sensitivity: 92.9% ± 2.1
- Specificity: 98.1% ± 0.8
AUC was not reported. Inference is profiled on NVIDIA H200 GPU hardware, but the precise per-image latency was not specified. This level of performance establishes clinical-grade diagnostic reliability in three-class OCT classification. These results serve as the baseline for subsequent student model compression in the KD-OCT pipeline (Nourbakhsh et al., 9 Dec 2025).
5. Teacher Model Role in Knowledge Distillation
ConvNeXtV2-Large acts as the teacher for knowledge distillation within KD-OCT. During distillation, teacher logits undergo temperature scaling with :
The student’s loss combines cross-entropy on hard labels and KL divergence on these soft (temperature-adjusted) teacher predictions:
with relative weights (hard target) and (distillation term). This balance is configured to transfer both the informative soft class boundaries and ground-truth supervision.
6. Contextual Significance and Deployment Implications
ConvNeXtV2-Large, as configured in this framework, demonstrates the upper bound of achievable performance for three-class OCT diagnosis using convolutional architectures with advanced augmentations and regularization. While the computational resources required preclude its routine clinical deployment, its role as a distillation teacher allows for the training of lightweight student models (EfficientNet-B2 in KD-OCT) that achieve comparable accuracy with significantly reduced memory and inference time. This paradigm facilitates practical deployment of deep-learning-based OCT triage in edge and point-of-care environments, thus enabling broad clinical access without substantial loss of diagnostic utility (Nourbakhsh et al., 9 Dec 2025).