Papers
Topics
Authors
Recent
Search
2000 character limit reached

Darknet-53: Efficient 53-Layer CNN

Updated 26 December 2025
  • Darknet-53 is a 53-layer convolutional neural network that combines small-kernel stacks with residual connections for robust feature extraction.
  • It employs a modular design using repeated 3x3 and 1x1 convolutions with batch normalization and leaky ReLU to ensure efficient data processing.
  • As the backbone for YOLOv3, it delivers competitive ImageNet accuracy and supports real-time object detection through multi-scale feature maps.

Darknet-53 is a convolutional neural network (CNN) architecture introduced as the principal feature extractor for YOLOv3. It integrates the sequential small-kernel convolution structure of Darknet-19 with identity-mapping residual connections drawn from @@@@6@@@@. The architecture is 53 convolutional layers deep and is designed for ImageNet-scale classification as well as efficient backbone feature extraction for real-time object detection tasks. Darknet-53 is characterized by the repeated use of 3×33 \times 3 and 1×11 \times 1 convolutions, consistent application of batch normalization and leaky ReLU activations, and a modular design supporting multi-scale feature output for detection heads (Redmon et al., 2018).

1. Architectural Overview

The Darknet-53 architecture alternates strided convolutions, which serve as spatial downsamplers, with residual blocks composed of paired 1×11 \times 1 and 3×33 \times 3 convolutions. Each convolutional layer (except the final classification/detection heads) is followed immediately by per-channel batch normalization and a leaky ReLU nonlinearity (negative slope 0.1). No biases are used in convolutional layers since batch normalization renders them unnecessary. The global structure omits pooling operations, relying exclusively on convolutional layers for resolution reduction and feature transformation.

For an input feature map of spatial size H×W×CH \times W \times C, the flow is as follows:

  1. Conv: 3×33 \times 3, 32 filters, stride 1 H×W×32\rightarrow H \times W \times 32
  2. Conv: 3×33 \times 3, 64 filters, stride 2 (H/2)×(W/2)×64\rightarrow (H/2) \times (W/2) \times 64, then 1 residual block:
    • 1×11 \times 1, 32 filters
    • 3×33 \times 3, 64 filters
  3. Conv: 3×33 \times 3, 128 filters, stride 2 (H/4)×(W/4)×128\rightarrow (H/4) \times (W/4) \times 128, then 2 residual blocks:
    • 1×11 \times 1, 64 filters
    • 3×33 \times 3, 128 filters
  4. Conv: 3×33 \times 3, 256 filters, stride 2 (H/8)×(W/8)×256\rightarrow (H/8) \times (W/8) \times 256, then 8 residual blocks:
    • 1×11 \times 1, 128 filters
    • 3×33 \times 3, 256 filters
  5. Conv: 3×33 \times 3, 512 filters, stride 2 (H/16)×(W/16)×512\rightarrow (H/16) \times (W/16) \times 512, then 8 residual blocks:
    • 1×11 \times 1, 256 filters
    • 3×33 \times 3, 512 filters
  6. Conv: 3×33 \times 3, 1024 filters, stride 2 (H/32)×(W/32)×1024\rightarrow (H/32) \times (W/32) \times 1024, then 4 residual blocks:
    • 1×11 \times 1, 512 filters
    • 3×33 \times 3, 1024 filters

As a detector backbone for YOLOv3, intermediate outputs from the end of stages 3, 4, and 5 constitute the respective 256-, 512-, and 1024-channel multi-scale detection heads.

2. Residual Block Formulation

Each residual block within Darknet-53 implements an identity mapping:

y=x+F(x)y = x + F(x)

where F(x)F(x) is defined for a block with depth D2DD \rightarrow 2D as:

F(x)=Conv3×3(BN(σ(Conv1×1(BN(σ(x))))))F(x) = \text{Conv}_{3\times3}(\text{BN}(\sigma(\text{Conv}_{1\times1}(\text{BN}(\sigma(x))))))

or, in compact form:

u=σ(BN(W1x))u = \sigma(\text{BN}(W_1 * x))

v=σ(BN(W2u))v = \sigma(\text{BN}(W_2 * u))

y=x+vy = x + v

Here σ()\sigma(\cdot) is the leaky ReLU, W1W_1 denotes the 1×11 \times 1 convolution (channel reduction), and W2W_2 the 3×33 \times 3 convolution (channel expansion).

Within each residual block, all convolutions have stride 1, guaranteeing spatial alignment of the feature maps and enabling a straightforward elementwise addition for the identity mapping.

3. Layer Composition and Parameterization

The Darknet-53 feature extractor, excluding the classification or detection heads, contains:

  • 6 downsampling 3×33 \times 3 convolutional layers with filter sizes: 32, 64, 128, 256, 512, 1024.
  • Residual stages with a total of (1+2+8+8+4)=23(1 + 2 + 8 + 8 + 4) = 23 blocks, each employing 2 convolutional layers.
  • Total convolutional layers within the extractor: 6+(23×2)=526 + (23 \times 2) = 52.
  • Including the final 1×11 \times 1 convolution for 1000-way classification, the cumulative depth is 53 convolutional layers.

Parameter count comprises approximately 41.6 million convolutional weights and approximately 84,000 batch normalization parameters, resulting in a total of \simeq41.7M parameters. For a 256×256256 \times 256 input, the multiply–accumulate operation count is 18.7 billion (Redmon et al., 2018).

4. Layer-wise Topology and Feature Map Organization

The feature map topology across stages is summarized as follows:

Stage Output Size Conv Layers + Residual Blocks Channels
1 H×WH \times W 3×33 \times 3, stride 1 32
2 H/2×W/2H/2 \times W/2 3×33 \times 3, stride 2, 1 block 64
3 H/4×W/4H/4 \times W/4 3×33 \times 3, stride 2, 2 blocks 128
4 H/8×W/8H/8 \times W/8 3×33 \times 3, stride 2, 8 blocks 256
5 H/16×W/16H/16 \times W/16 3×33 \times 3, stride 2, 8 blocks 512
6 H/32×W/32H/32 \times W/32 3×33 \times 3, stride 2, 4 blocks 1024

At each block, the identity mapping y=x+F(x)y = x + F(x) is maintained, with FF constructed as above.

5. Comparative Performance Analysis

When benchmarked for ImageNet classification (256x256, single crop), Darknet-53 yields Top-1 accuracy of 77.2% and Top-5 accuracy of 93.8%. On a Titan X, it achieves a throughput of 78 FPS at 1457 BFLOP/s using 18.7 G batch-normalized operations. Comparative results to prior architectures are:

Backbone Top-1 Top-5 Ops (G) BFLOP/s FPS
Darknet-19 74.1% 91.8% 7.29 1246 171
ResNet-101 77.1% 93.7% 19.7 1039 53
ResNet-152 77.6% 93.8% 29.4 1090 37
Darknet-53 77.2% 93.8% 18.7 1457 78

Darknet-53 matches or exceeds ResNet-101 on both Top-1 and Top-5 accuracies and is comparable to ResNet-152 for Top-5. It is 1.5× faster than ResNet-101 and 2× faster than ResNet-152. While it uses approximately 2.6× more FLOPs than Darknet-19, it delivers a 3% higher Top-1 accuracy. It is more FLOP-efficient than ResNet-101 and achieves greater GPU utilization.

6. Role in Detection and Empirical Results

As the backbone for YOLOv3, Darknet-53 provides intermediate feature maps for multi-scale detection heads, enabling robust detection performance. On COCO 2017 (input 608×608608 \times 608), YOLOv3 with Darknet-53 achieves:

  • AP (IoU=.5:.95): 33.0
  • AP50_{50}: 57.9
  • AP75_{75}: 34.4
  • APS_S: 18.3, APM_M: 35.4, APL_L: 41.9

This performance places YOLOv3 (with Darknet-53) ahead of one-stage SSD variants on AP50_{50} and competitive with RetinaNet at a 3–4× speed advantage (Redmon et al., 2018).

7. Summary and Synthesis

Darknet-53 realizes a hybrid design that synergizes the efficient small-kernel stacking of Darknet-19 with the optimization benefits of ResNet's identity-mapping residual blocks. At a total of 53 convolutional layers and \sim41.7M parameters, the architecture delivers competitive accuracy at high throughput, serving effectively as a versatile backbone for both classification and detection settings. Its explicit modularity, computational profile, and empirical benchmarks support its adoption as a baseline for real-time detection systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Darknet-53.