Vocal Cord Ultrasound (VCUS)
- VCUS is a non-invasive ultrasound modality that assesses the structure and function of the vocal cords, enhancing patient safety and comfort.
- Recent studies integrate deep learning algorithms, such as YOLOv8m, to automate segmentation and classification, significantly boosting accuracy over manual methods.
- The complete diagnostic pipeline processes frames in real time (~30–40 ms/frame), demonstrating high precision for potential clinical deployment despite reliance on synthetic simulations.
Vocal cord ultrasound (VCUS) is a non-invasive diagnostic modality for imaging the laryngeal region to assess vocal cord structure and function. VCUS offers increased patient comfort and safety compared to laryngoscopy, but has historically been limited by substantial operator dependence and inter-observer variability. Recent advances apply deep learning algorithms to segment vocal cord anatomy and automatically classify pathologies such as vocal cord paralysis (VCP), aiming to enhance reliability and throughput in both clinical and research applications (Sebelik-Lassiter et al., 29 Dec 2025).
1. VCUS Data Acquisition and Annotation
VCUS imaging in contemporary research employs high-resolution ultrasound systems (e.g., GE Logiq S7 Pro, linear probe at 8.5 MHz, 5 cm depth). Project VIPR collected 30–60 s videos in standardized planes during rest, phonation, throat-clearing, and laughter from 30 healthy volunteers, with subject demographics balanced by gender (detailed demographics reported in Table 1 of the study).
Frame extraction was formalized by sampling every 20th frame, producing 2,168 raw PNG frames per protocol. Initial image standardization employed Lanczos interpolation (256 × 256 px). Manual region-of-interest (ROI) labeling by three annotators explicitly defined the upper (anterior commissure), lower (arytenoid reverberation), and medial (cord boundaries) box edges. Images without discernible cords were excluded, generating 1,088 annotated images and corresponding bounding boxes for segmentation.
For classification, absence of real pathological VCP frames necessitated synthetic simulation based on geometric manipulation (detailed in Section 3). Data augmentation—rotations (±10°), horizontal flipping, and affine shears—expanded the training set to 34,816 images, balanced 1:1 between healthy and simulated VCP for supervised learning (Sebelik-Lassiter et al., 29 Dec 2025).
2. Segmentation of Vocal Cords Using Deep Learning
Automated ROI detection leveraged the Ultralytics YOLOv8m architecture, a multi-layer CSPDarknet backbone coupled with a decoupled detection head. The network ingested 640 × 640 px inputs with on-the-fly data augmentation (mosaic, mixup, random scaling, HSV shift). Training employed SGD, learning rate 0.01, momentum 0.937, and weight decay 5 × 10⁻⁴ for 4 epochs.
YOLOv8m loss formulation included Complete Intersection-over-Union (CIoU) for box regression, Binary Cross-Entropy (BCE) for class logits, and Distribution Focal Loss (DFL) for localization. Evaluation used precision, recall, F1, [email protected], and [email protected]:0.95 as defined in COCO benchmarks.
Validation metrics (threshold τ = 0.701, maximizing F1):
| Metric | Value |
|---|---|
| Precision | 0.84 |
| Recall | 0.80 |
| [email protected] | 0.78 |
| [email protected]:0.95 | 0.40 |
| Correct detection (%) | 96 |
| Miss (%) | 4 |
Box, class, and DFL losses declined consistently across epochs (2.07→1.43, 2.94→1.43, 2.18→1.68, respectively), reflecting convergence and negligible overfitting.
This performance markedly exceeds the 41–86% human accuracy range for VCUS-based cord localization, with automated segmentation providing highly standardized bounding box outputs for downstream classification (Sebelik-Lassiter et al., 29 Dec 2025).
3. Machine Learning-Based Classification of Vocal Cord Paralysis
Simulated VCP images were generated by vertical compression of either the left or right cord in the ROI by a factor 0.75, alignment of the top boundaries, infilling gaps by sampling from pixels outside the ROI, and interpolation across a 12-px seam. This process produced four image classes of 1,088 each: healthy, healthy2 (seam only, no compression), rightpar, and leftpar. Data augmentation expanded the set prior to model training.
Two principal classifier architectures were evaluated:
YOLOv8n-cls:
Pretrained on ImageNet and fine-tuned on the VCUS dataset, the classifier head was trained for 20 epochs (BCE loss), with held-out validation (≈10–20% split). Validation achieved 92.3% top-1 accuracy, and 100% top-5 accuracy in the binary task. Confusion analysis revealed: 87% true healthy classified correctly, 13% healthy as false positives, 97% VCP detected correctly, 3% false negatives. The precision–recall curve maintained precision >0.8 at recall ~0.9 (Sebelik-Lassiter et al., 29 Dec 2025).
VIPRnet (Custom CNN):
Input: single-channel 256 × 256 ROI. Architecture:
- Convolutional layers: Conv2D [1→32], MaxPool2D → Conv2D [32→64], MaxPool2D → Conv2D [64→128], MaxPool2D
- Fully-connected: flatten → Linear [131,072→128], ReLU, Dropout(0.5) → Linear [128→1], Sigmoid Trained for 50 epochs, batch size 64, using BCE loss. On validation, achieved ≈99.5% accuracy. Training loss curves indicated smooth decline with minimal overfitting.
These classification systems, operating on cropped and segmented ROI images provided by YOLOv8m, demonstrated near-perfect discrimination between healthy and simulated VCP, significantly surpassing non-automated interpretation (Sebelik-Lassiter et al., 29 Dec 2025).
4. End-to-End Diagnostic Pipeline and Computational Performance
The Project VIPR workflow encompasses:
- Video ingestion;
- Frame extraction;
- Resizing to 640 × 640 for YOLOv8m segmentation;
- ROI cropping and resizing to 256 × 256;
- Inference by either YOLOv8n-cls or VIPRnet for final healthy vs. VCP decision.
The complete pipeline achieves ≈30–40 ms per frame processing on an NVIDIA GPU typical of the ROSIE supercomputer, corresponding to roughly 1–1.5 s for a 30-frame video segment. Component model timings: YOLOv8m detection at ≈20–30 ms/frame, VIPRnet classification at ≈5 ms/frame (Sebelik-Lassiter et al., 29 Dec 2025).
A summary of pipeline steps and timings is provided below:
| Step | Model | Timing (ms/frame) |
|---|---|---|
| Segmentation | YOLOv8m | 20–30 |
| ROI Crop & Resize | - | <5 |
| Classification | VIPRnet | ~5 |
| Total Pipeline | - | 30–40 |
This throughput enables integration into real-time or near-real-time VCUS diagnostic workflows.
5. Limitations, Challenges, and Future Directions
Although machine learning enables robust, standardized VCUS interpretation, several limitations persisted in the Project VIPR study:
- Genuine VCP images were unavailable, necessitating reliance on synthetic compression artifacts for classifier development. This introduces potential distributional mismatch between training and real clinical scenarios.
- The model may exhibit bias toward synthetic features, though the inclusion of seam-only (healthy2) classes aimed to mitigate this effect.
- The subject pool comprised 30 healthy volunteers, with demographic representation limited to a university population.
- External validation on clinical datasets remains necessary to confirm generalizability and to retrain models as needed.
Future work will focus on acquiring authentic clinical VCP VCUS datasets, leveraging automatic YOLO-based segmentation for large-scale video annotation, incorporating geometric measurements (e.g., anterior glottic angle) as additional classifier features, and exploring ultrasound scanner integration for real-time augmentation and decision support in routine practice (Sebelik-Lassiter et al., 29 Dec 2025).
6. Clinical and Research Significance
VCUS coupled with machine learning substantially mitigates operator dependence, offering reproducible, fast, and accurate detection of vocal cord paralysis compared to current manual assessment. Automated segmentation and classification pipelines have demonstrated >96% segmentation accuracy and ~99% classification accuracy on synthetic data, benchmarks that are notably higher than reported human rates (Sebelik-Lassiter et al., 29 Dec 2025).
These results suggest immediate utility for standardizing VCUS interpretation across clinical settings. Incorporation of real pathological data, expansion of anatomical feature modeling, and hardware integration remain significant future milestones toward full deployment in otolaryngology and allied fields.