SODA: Site Object Detection Dataset

Updated 22 June 2026

SODA is a curated dataset comprising annotated images of diverse site environments, including construction, urban, and industrial settings.
It enables comprehensive training and benchmarking of object detection models by providing high-resolution imagery and detailed annotations.
The dataset’s standardized metrics and varied conditions support scalable performance evaluation and drive innovations in object detection research.

YOLOv4 is a one-stage real-time object detection architecture that advances the speed-accuracy trade-off in large-scale vision tasks. Designed as a unified system with a modular backbone, neck, and dense multi-scale head, YOLOv4 integrates extensive architectural innovations with targeted regularization and augmentation strategies. On the COCO benchmark, it achieves 43.5% average precision (AP@[.5:.95]) at 62–65 FPS with 608×608 inputs on a single NVIDIA V100 GPU, outperforming prior YOLO variants and most contemporary detectors in its throughput/accuracy regime (Kotthapalli et al., 4 Aug 2025, Bochkovskiy et al., 2020, Geetha, 6 Feb 2025, Ramos et al., 24 Apr 2025, Terven et al., 2023).

1. Architectural Overview

The YOLOv4 network employs a three-part design composed of:

Backbone: CSPDarknet-53, which augments the original Darknet-53 by introducing Cross-Stage Partial (CSP) connections at each residual stage. Each CSP block splits the feature map into two channel partitions: one is processed through a sequence of residual units, and the other bypasses them before both are concatenated. This configuration reduces parameter redundancy, improves gradient flow, and lowers the overall memory footprint (Kotthapalli et al., 4 Aug 2025, Bochkovskiy et al., 2020, Ramos et al., 24 Apr 2025).
Neck: A combination of Spatial Pyramid Pooling (SPP) and a Path Aggregation Network (PANet). SPP is applied atop the deepest feature map with max pooling at multiple kernel sizes ({5×5, 9×9, 13×13}), expanding the receptive field and promoting context aggregation without additional downsampling. PANet integrates a top-down upsampling path with a bottom-up aggregation route, concatenating features at multiple scales to enhance multi-level fusion, particularly for small object localization (Kotthapalli et al., 4 Aug 2025, Terven et al., 2023, Ramos et al., 24 Apr 2025).
Detection Head: Three parallel, anchor-based detection heads operate at different spatial resolutions derived from PANet outputs. Each head predicts localization offsets (Δx, Δy, Δw, Δh), an objectness score, and C class scores per anchor per cell. Mish activation is used in the backbone and neck, with Leaky ReLU in the head (Kotthapalli et al., 4 Aug 2025).

The overall forward path is structured as:

1	Input → CSPDarknet-53 → SPP (on deepest feature map) → PANet (top-down + bottom-up) → Three detection heads → Aggregated bounding-box predictions

2. Loss Functions

YOLOv4 employs a multi-part loss defined at the cell and anchor box level:

Localization Loss: Complete IoU (CIoU), which combines standard Intersection-over-Union with penalties for center point distance and aspect ratio divergence.

$L_{\text{CIoU}} = 1 - \mathrm{IoU} + \frac{\rho^2(b, b^{gt})}{c^2} + \alpha \nu$

where $\rho^2$ is the squared distance between box centers, $c$ is the diagonal of the smallest enclosing box, $\nu$ quantifies aspect ratio similarity, and $\alpha$ modulates the aspect ratio penalty (Kotthapalli et al., 4 Aug 2025, Bochkovskiy et al., 2020, Ramos et al., 24 Apr 2025).

Objectness and Classification Loss: Binary cross-entropy (BCE) for both objectness (confidence) and multiclass probabilities.
Total Loss: Weighted sum of the above, with background objectness loss downweighted via $\lambda_{\text{noobj}}$ .

Formally:

$L_{YOLOv4} = \lambda_{coord} L_{coord} + \lambda_{obj} L_{obj} + \lambda_{noobj} L_{noobj} + \lambda_{class} L_{class}$

This formulation stabilizes learning and enables improved bounding box regression over earlier mean-square-error (MSE) approaches (Bochkovskiy et al., 2020, Geetha, 6 Feb 2025).

3. Innovations and Key Techniques

YOLOv4’s performance reflects both core architectural advances and a “bag-of-freebies” (BoF) and “bag-of-specials” (BoS) design philosophy (Kotthapalli et al., 4 Aug 2025, Terven et al., 2023, Ramos et al., 24 Apr 2025):

CSP Connections: Reduce computation by 20–30% and improve mean average precision (mAP) by 1–2 points relative to standard residual blocks.
SPP Module: Enlarges effective receptive field, increasing AP by ~1.0%.
PANet Neck: Strong multi-scale feature fusion, contributing ~1.0% AP improvement.
Mish Activation: Smoother gradients than ReLU or Leaky ReLU, providing a ~0.5% AP gain.
DropBlock Regularization: Structured feature-mapped dropout improves robustness, +0.5% AP.
Mosaic Data Augmentation: Four-image composition in training batches, yielding +1.0–1.5% AP.
Self-Adversarial Training (SAT): Gradient-based perturbation added to the input during training, leading to +1.0% AP.
Additional Freebies: CutMix, class-label smoothing, and random HSV augmentations.

Each contributes incrementally, with cumulative effects boosting AP from ≈33% (YOLOv3) to 43.5%, while raising V100 throughput from 30–45 FPS (YOLOv3) up to 62–65 FPS (Kotthapalli et al., 4 Aug 2025, Bochkovskiy et al., 2020, Terven et al., 2023).

4. Training Strategies and Schedule

YOLOv4 uses a suite of regularization and augmentation methods to drive generalization:

Data Augmentation: Mosaic (always activated), CutMix (probabilistic), random HSV shifts, horizontal flips, and multi-scale training (dynamic input resizing in {320, 352, ..., 608}) (Geetha, 6 Feb 2025, Ramos et al., 24 Apr 2025, Terven et al., 2023).
Normalization: Cross mini-Batch Normalization (CmBN) accumulates statistics across consecutive batches, stabilizing feature standardization in low-batch scenarios.
Regularization: DropBlock is deployed in large feature maps, and class-label smoothing is used for soft classification targets.
Optimizer: Stochastic Gradient Descent (SGD) with high momentum (0.949), weight decay, and explicit learning-rate warmup. Schedules include step decay or cosine annealing, sometimes with a genetic search over hyperparameters in the early epochs (Bochkovskiy et al., 2020, Geetha, 6 Feb 2025, Ramos et al., 24 Apr 2025).
Anchor Boxes: K-means clustering on COCO data to optimize anchor shapes for the detection head.
Self-Adversarial Training (SAT): For each batch, a copy of images is adversarially perturbed based on the model’s input gradient with respect to objectness loss, then incorporated as additional training samples. This increases model robustness to input perturbations (Kotthapalli et al., 4 Aug 2025, Bochkovskiy et al., 2020).

5. Empirical Performance and Benchmarking

The following table summarizes YOLOv4’s benchmark results relative to related detectors (Kotthapalli et al., 4 Aug 2025, Ramos et al., 24 Apr 2025, Terven et al., 2023):

Model	Backbone	Feature Fusion	AP@[.5:.95]	FPS (V100)
YOLOv3	Darknet-53	3-scale FPN	33.0%	30–45
YOLOv4	CSPDarknet-53	PANet + SPP	43.5%	62–65
Faster RCNN	ResNet-FPN	FPN	42.1%	<10

Detailed per-size metrics on COCO (Geetha, 6 Feb 2025):

Input Size	AP@[.5:.95]	AP50	FPS
416×416	41.2%	62.8%	96
512×512	43.0%	64.9%	83
608×608	43.5%	65.7%	62

YOLOv4 consistently occupies the Pareto front on accuracy vs. throughput among one-stage detectors (Kotthapalli et al., 4 Aug 2025, Geetha, 6 Feb 2025).

6. Deployment and Practical Considerations

Model Size: Approximately 64M parameters for the full version; footprint is ~110MB in FP32, and smaller in FP16 or quantized INT8 (Kotthapalli et al., 4 Aug 2025, Geetha, 6 Feb 2025).
Inference Speed: 62–65 FPS (608×608) on V100; over 30 FPS on high-end consumer GPUs (e.g., RTX 2080 Ti) (Ramos et al., 24 Apr 2025).
Edge Applications: Quantization and pruning can reduce model size and latency for embedded accelerators (e.g., Jetson TX2, Xavier) or mobile ASICs; smaller “YOLOv4-Tiny” variants are available for extreme resource constraints (Kotthapalli et al., 4 Aug 2025, Ramos et al., 24 Apr 2025).
Optimization for Deployment: TensorRT or OpenVINO pipelines can be used for batch-norm folding, conv+activation fusing, INT8 conversion (with quantization-aware training), or structured pruning, as applicable (Kotthapalli et al., 4 Aug 2025).
Hyperparameters: Recommended settings include SGD (momentum=0.949), weight decay (5e-4), batch size 64, learning rate warmup, DropBlock (block size 7), and complete use of Mosaic augmentation during training (Geetha, 6 Feb 2025).

7. Impact and Legacy

YOLOv4 represents an overview of architectural efficiency (CSPDarknet-53, PANet, SPP), algorithmic regularization (DropBlock, SAT), advanced loss design (CIoU), and highly effective augmentation (Mosaic) (Kotthapalli et al., 4 Aug 2025, Bochkovskiy et al., 2020, Terven et al., 2023, Ramos et al., 24 Apr 2025). Its configuration enabled state-of-the-art object detection at real-time throughput on a single commodity GPU, facilitating wide adoption in both research and industry. Moreover, YOLOv4 established methodological templates for subsequent YOLO and one-stage detector architectures, particularly in backbone design, neck construction, loss function definition, and training pipelines (Kotthapalli et al., 4 Aug 2025, Ramos et al., 24 Apr 2025).

Further developments—including fully anchor-free models, transformer backbones, and end-to-end trainable deployment pipelines—extend principles first consolidated and empirically validated in YOLOv4. Its modularity and scalability continue to inform model design in fast-evolving object detection research (Kotthapalli et al., 4 Aug 2025, Terven et al., 2023).