YOLO-NAS: NAS-Optimized Object Detector
- YOLO-NAS is a family of object detection models designed using neural architecture search that integrates RepVGG-style blocks with specialized quantization modules.
- It employs evolutionary NAS methods to optimize latency and accuracy, achieving competitive performance with fewer parameters and reduced FLOPs.
- Advanced training strategies—including large-scale pretraining, self-distillation, and aggressive augmentation—ensure robustness under quantization constraints.
YOLO-NAS (You Only Look Once – Neural Architecture Search) is a family of object detection models designed through hardware-aware neural architecture search, with a focus on both real-time inference efficiency and high accuracy under quantization constraints. Emergent from the lineage of YOLO architectures, YOLO-NAS integrates evolutionary NAS methodologies, specialized quantization-aware building blocks, and advanced training protocols including large-scale pretraining and self-distillation. Its deployment spans standard visual detection pipelines and specialized low-latency domains such as edge computing and indoor assistive navigation.
1. Architecture and Core Building Blocks
YOLO-NAS models are defined by their search-optimized integration of RepVGG-style convolutional blocks and custom quantization modules:
- Backbone: A sequence of RepVGG blocks, augmented post-search with two quantization-aware modules:
- QSP (Quantization-aware Split-Parallel block): Splits activations into high/low bit paths, applies parallel 3×3 convolutions, and re-merges outputs, supporting 8-bit inference with less than a 0.5% accuracy drop.
- QCI (Quantization-aware Convolution-Integer block): Folds quantization into integer arithmetic, inserted before final conv layers.
- Neck: Feature pyramid network stage using the same backbone blocks, with interleaved QSP/QCI modules for quantization fidelity during multiscale feature aggregation; structurally reminiscent of PANet.
- Head: Three-scale head for COCO-style detection (13×13, 26×26, 52×52 outputs), employing further RepVGG blocks and targeted QSP/QCI deployment upstream of the terminal 1×1 detection convolutions.
The architectural blueprint (YOLO-NAS-L) specifies per-stage depth {2,4,6,8} blocks, width choices {64,128,256,512}, and binary quantization module placements, preserving three output resolutions for instance segmentation and detection (Terven et al., 2023).
The “Small” variant (YOLO NAS Small) refines this approach, discovered via AutoNAC and specialized for improved small-object sensitivity. Major differences include a lighter backbone (3 stages), FPN-like neck with two feature scales (higher spatial resolution for detection of small objects), SPP integration in the deepest stage, and quantization-aware RepVGG blocks throughout (BN et al., 2024).
2. Neural Architecture Search Methodologies
YOLO-NAS utilizes evolutionary search methods engineered to balance accuracy and latency – explicitly hardware-aware and quantization-constrained:
- Objective formulation:
where is COCO mAP, is device-measured latency, and controls speed-accuracy trade-off (Terven et al., 2023).
- Search space: Includes block types, block depth per stage, channel count, quant module position, and output scales.
- Algorithm (AutoNAC):
- Population-based evolutionary engine.
- Candidates are compiled with quantization, latency profiled on target hardware, and mAP evaluated on proxy datasets.
- Tournament selection, stage-wise crossover, mutation of quant module positions or block depths.
- No gradient-based updates; score function penalizes latency ( latency ≈ ).
For YOLO NAS Small, the NAS is further tailored to prioritize small-object AP, deploying architectural patterns that sustain high-resolution neck features and quantization-friendly blocks (BN et al., 2024).
Recent research also explores neural architecture search (NAS) for optimizing activation function placement (“ActNAS”), formalized as a 0–1 integer linear program over layer-wise activation choices, leveraging zero-cost proxies (NWOT) for rapid global evaluation. Mixed-activation designs exploit device-specific latency characteristics, with up to 1.67× speedup and 64% RAM savings on NPUs (with <1% mAP drop), denoting a distinct trajectory for future YOLO-NAS variants (Sah et al., 2024).
3. Training Paradigms and Quantization Strategies
Training protocols for YOLO-NAS leverage staged large-data pretraining, label propagation, and self-distillation:
- Multi-task loss:
where:
marks the best anchor per ground-truth box (Terven et al., 2023).
- Quantization-aware modules:
- QSP and QCI enable near-lossless 8-bit integer inference across the backbone, neck, and head, with explicit placement discovered by the search.
- Training sequence:
- Pretraining on Objects365 (2M images, 365 categories)
- Pseudo-labeling on COCO, ensemble predictions
- Feature-map self-distillation (teacher-student matching)
- SGD with momentum 0.937, cosine decay schedule from 0.01 to , 64 batch size on 8×A100 GPUs
- Intensive data augmentation (Mosaic, MixUp, HSV jitter, random scaling/translating)
YOLO NAS Small employs Super Gradients recipes (Adam, EMA, cosine LR annealing, mixed FP16 precision) with PPYoloELoss, cross-entropy, CIoU/DFL components, and rigorous data augmentation on YCB-COCO and Roboflow scenes (BN et al., 2024).
4. Performance Benchmarks and Comparative Analysis
YOLO-NAS demonstrates competitive accuracy and resource efficiency relative to prior YOLO models:
- YOLO-NAS-L (COCO, FP16):
- [email protected]: 62.3%
- mAP@[0.5:0.95]: 52.2%
- Inference: 160 FPS on NVIDIA A100 (TensorRT FP16)
- Model: 24M parameters, 45 GFLOPs (Terven et al., 2023)
| Model | mAP@[0.5:0.95] | FPS | Params (M) | GFLOPs |
|---|---|---|---|---|
| YOLO-NAS-L | 52.2% | 160 | 24 | 45 |
| YOLOv8x | 53.9% | 280 | 50 | — |
| YOLOv5x | 55.8% | 200 | — | 90 |
YOLO-NAS achieves comparable or superior accuracy with 2× fewer parameters and up to 10% lower FLOPs, trading modest peak FPS for computational efficiency. Ablations reveal that NAS provides a +2.4% mAP boost over hand-designed backbones, QSP/QCI outperform uniform 8-bit quantization (+1.2% mAP), and staged training recoups quantization losses (+1.5% mAP) (Terven et al., 2023).
YOLO NAS Small achieves:
- [email protected]: 0.96
- [email protected]: 0.64
- [email protected]: 0.98
- [email protected]: 0.38
- 19.02M parameters, 25 GFLOPs, ~40 FPS on RTX 3070, 18MB model size (BN et al., 2024).
| Model | [email protected] | [email protected] | [email protected] |
|---|---|---|---|
| YOLOv5s | 0.94 | 0.62 | 0.95 |
| YOLOv7-tiny | 0.95 | 0.63 | 0.96 |
| YOLOv8n | 0.95 | 0.65 | 0.97 |
| YOLO NAS Small | 0.96 | 0.64 | 0.98 |
The architecture is particularly adept at small-object recall due to higher neck resolution and context aggregation via SPP (BN et al., 2024).
ActNAS-based YOLO variants reveal hardware gains, e.g., ActNAS5n.1 offers a 22.28% speedup and 58.2% RAM reduction vs. all-SiLU baseline on NPU1 with <1% mAP drop (Sah et al., 2024).
5. Application Domains and Latency Optimizations
YOLO-NAS models are extensively deployed in latency-sensitive domains due to their quantization fidelity and hardware-aware search properties:
- Edge inference: Low-latency deployment on ARM, Jetson, and NPU-class devices, sustained by quantization-friendly design and mixed-precision training (Terven et al., 2023, BN et al., 2024, Sah et al., 2024).
- Assistive technologies: YOLO NAS Small enables real-time indoor guidance for the visually impaired, accurately detecting small items in household scenes with low computational overhead (BN et al., 2024).
- Object detection infrastructure: Robotics, autonomous driving, and video monitoring pipelines benefit from YOLO-NAS’ integration of task-specific architectural variants, dynamic inference, and ultra-low bandwidth deployment (Terven et al., 2023).
Latency optimizations include:
- High-res neck features for small objects
- SPP for enlarged receptive fields
- QSP/QCI enable 8-bit quantization with minimal accuracy loss
- Distillation recoups quantization penalty
- Mixed activations (ActNAS) maximize hardware throughput on NPUs (Sah et al., 2024)
6. Insights, Limitations, and Prospects
Key insights from the literature:
- Non-intuitive quantization module placements: Hardware-aware NAS discovers quantization integration points missed by manual design (Terven et al., 2023).
- Evolutionary, latency-in-the-loop NAS: Real device profiling outperforms proxy signals for detection network search (Terven et al., 2023).
- Activation NAS: Layer-wise mixed activations enable drastic latency/memory reduction with negligible accuracy drop (Sah et al., 2024).
- Importance of scale and data: Large-scale pretraining and feature distillation are crucial for maintaining accuracy in quantized models (Terven et al., 2023).
Open challenges and future directions:
- Extending NAS to per-layer mixed precision (4/8 bit)
- Integrating NMS or attention into NAS
- Accelerating architecture search via one-shot or gradient-based methods
- Task-specific NAS for segmentation, detection, and compression (Terven et al., 2023)
- Depthwise separable convolutions or attention for additional latency reduction in small-object models (BN et al., 2024).
Limitations primarily concern precision trade-offs (e.g., YOLO NAS Small’s moderate precision score at high recall), the need for hardware-specific benchmarking, and current search spaces limited to quantization and activation.
7. Comparative Summary of NAS-Optimized YOLO Families
| Model Variant | Architect. Style | Specialization | Parameters (M) | [email protected] | Latency/Throughput | Notes |
|---|---|---|---|---|---|---|
| YOLO-NAS-L | RepVGG + QSP/QCI | General COCO | 24 | 62.3% | 160 FPS (A100, FP16) | Full NAS, quant-aware |
| YOLO NAS Small | QA-RepVGG, SPP | Small objects, Edge | 19.02 | 0.96 | ~25 ms/image (RTX 3070) | Highest recall SOTA |
| ActNAS5n.1 | Mixed Activations | NPU1 (Edge) | – | 0.3280 | –22.28% latency (vs. SiLU) | Hardware-aware, memory |
| YOLOv8x | CSPDarknet | General COCO | 50 | 53.9% | 280 FPS | Comparison baseline |
| YOLOv5x | CSPDarknet | General COCO | – | 55.8% | 200 FPS | Comparison baseline |
YOLO-NAS architectures, including small-object and activation-optimized variants, establish a new paradigm in real-time object detection, leveraging NAS to systematically balance accuracy, latency, and resource efficiency across diverse application and hardware domains (Terven et al., 2023, BN et al., 2024, Sah et al., 2024).