Papers
Topics
Authors
Recent
2000 character limit reached

ConvNeXtV2-Large: OCT Teacher Model

Updated 16 December 2025
  • The teacher model is a deep convolutional network with hierarchical stages and 196.4M parameters, achieving 92.6% accuracy in retinal OCT classification.
  • The training pipeline employs advanced augmentations, stochastic weight averaging, and focal loss to ensure robust performance despite class imbalance.
  • In knowledge distillation, temperature scaling and a combined loss function enable efficient transfer of predictive soft labels to compact student models.

The ConvNeXtV2-Large teacher model serves as the high-capacity backbone and performance reference in the KD-OCT framework for clinical-grade retinal OCT classification. Built around state-of-the-art convolutional design and extensive training enhancements, it establishes a robust performance benchmark for transferring knowledge to compact student models via distillation, enabling efficient deployment while retaining diagnostic accuracy (Nourbakhsh et al., 9 Dec 2025).

1. Architectural Design and Core Components

ConvNeXtV2-Large is constructed as a deep convolutional neural network employing a hierarchical architecture. The backbone is initialized with weights pretrained on ImageNet-22K and further fine-tuned on ImageNet-1K with Fully Convolutional Masked Autoencoding (FCMAE). Its principal computational unit is the “ConvNeXtV2 block,” which includes:

  • Depth-wise 7×7 convolution
  • Layer Normalization
  • Linear (point-wise) channel expansion by a factor of 4
  • GELU nonlinearity
  • Global Response Normalization (GRN)
  • Linear projection to base channel count
  • Residual connection with drop-path regularization

The spatial downsampling strategy is staged, employing a 2×2 strided convolution with LayerNorm at the onset of each stage. The model’s overall stage-wise structure is summarized in the following table:

Stage Output Size Channels Blocks Expansion
Stem 384×384→48×48 384
Stage 1 48×48 384 3
Stage 2 24×24 768 27
Stage 3 12×12 1536 3
Stage 4 6×6 3072 3

The classification head comprises global average pooling, dropout, and a fully connected layer producing logits for three target categories: normal, drusen, and CNV. The model contains approximately 196.4 million parameters. FLOPs were not specified.

2. Training Pipeline and Regularization Strategies

To maximize generalization and address dataset-specific challenges, the training pipeline integrates extensive augmentations and advanced regularization. The augmentation sequence involves:

  • Resize and random crop to 384×384
  • RandAugment (N=2, M=9): brightness, contrast, saturation, sharpness
  • Random rotation (±20°), affine shear (±15°), scale (0.85–1.15)
  • Horizontal flip (p=0.5), vertical flip (p=0.3)
  • Gaussian blur (p=0.2), posterize (p=0.2), bit-depth reduction
  • Random erasing (scale 0.02–0.15, p=0.25)
  • ToTensor, ImageNet normalization (mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225])

Stochastic Weight Averaging (SWA) is employed for solution smoothness. After each of kk selected checkpoints, weights are averaged:

θSWA=1ki=1kθi\theta_{\mathrm{SWA}} = \frac{1}{k} \sum_{i=1}^k \theta_i

Focal loss addresses class imbalance:

Lfocal=αt(1pt)γ(logpt)\mathcal{L}_{\mathrm{focal}} = \alpha_t\,(1 - p_t)^\gamma\,(-\log p_t)

with ptp_t as the predicted probability for the true class, αt\alpha_t reflecting class priors, and focusing parameter γ=2.0\gamma=2.0.

3. Optimization Protocol and Hyperparameterization

Optimization is conducted using AdamW. Distinct learning rates are assigned: 2×1052\times 10^{-5} for the backbone and 1×1041\times 10^{-4} for the classification head. Weight decay is set to 0.05. Training begins with a linear warmup over 10 epochs, transitioning into a cosine-annealing schedule over 150 epochs:

lr(t)=lrmin+12(lrmaxlrmin)[1+cos(πt/T)]\mathrm{lr}(t) = \mathit{lr}_{\min} + \frac{1}{2} (\mathit{lr}_{\max}-\mathit{lr}_{\min})\left[1 + \cos\left(\pi\,t/T\right)\right]

with lrmax\mathit{lr}_{\max} the initial and lrmin=1×107\mathit{lr}_{\min}=1\times10^{-7}. Effective batch size is 16 using gradient accumulation (physical batch size 4 × accumulation 4), training proceeds with FP16 mixed-precision. Early stopping is applied with a 25-epoch patience.

4. Empirical Performance and Evaluation

On the Noor Eye Hospital (NEH) dataset, ConvNeXtV2-Large achieves:

  • Accuracy: 92.6% ± 2.3
  • Sensitivity: 92.9% ± 2.1
  • Specificity: 98.1% ± 0.8

AUC was not reported. Inference is profiled on NVIDIA H200 GPU hardware, but the precise per-image latency was not specified. This level of performance establishes clinical-grade diagnostic reliability in three-class OCT classification. These results serve as the baseline for subsequent student model compression in the KD-OCT pipeline (Nourbakhsh et al., 9 Dec 2025).

5. Teacher Model Role in Knowledge Distillation

ConvNeXtV2-Large acts as the teacher for knowledge distillation within KD-OCT. During distillation, teacher logits ziz_i undergo temperature scaling with T=4.0T=4.0:

pi(T)=exp(zi/T)jexp(zj/T)p_i^{(T)} = \frac{\exp(z_i/T)}{\sum_j \exp(z_j/T)}

The student’s loss combines cross-entropy on hard labels and KL divergence on these soft (temperature-adjusted) teacher predictions:

L=β  LCE(y,pS)+α  T2  KL(pT(T)pS(T))\mathcal{L} = \beta\;\mathcal{L}_\mathrm{CE}(y, p_S) + \alpha\;T^2\;\mathrm{KL}\left(p_T^{(T)} \|\, p_S^{(T)}\right)

with relative weights β=0.3\beta=0.3 (hard target) and α=0.7\alpha=0.7 (distillation term). This balance is configured to transfer both the informative soft class boundaries and ground-truth supervision.

6. Contextual Significance and Deployment Implications

ConvNeXtV2-Large, as configured in this framework, demonstrates the upper bound of achievable performance for three-class OCT diagnosis using convolutional architectures with advanced augmentations and regularization. While the computational resources required preclude its routine clinical deployment, its role as a distillation teacher allows for the training of lightweight student models (EfficientNet-B2 in KD-OCT) that achieve comparable accuracy with significantly reduced memory and inference time. This paradigm facilitates practical deployment of deep-learning-based OCT triage in edge and point-of-care environments, thus enabling broad clinical access without substantial loss of diagnostic utility (Nourbakhsh et al., 9 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to ConvNeXtV2-Large Teacher Model.