Darknet-53: Efficient 53-Layer CNN

Updated 26 December 2025

Darknet-53 is a 53-layer convolutional neural network that combines small-kernel stacks with residual connections for robust feature extraction.
It employs a modular design using repeated 3x3 and 1x1 convolutions with batch normalization and leaky ReLU to ensure efficient data processing.
As the backbone for YOLOv3, it delivers competitive ImageNet accuracy and supports real-time object detection through multi-scale feature maps.

Darknet-53 is a convolutional neural network (CNN) architecture introduced as the principal feature extractor for YOLOv3. It integrates the sequential small-kernel convolution structure of Darknet-19 with identity-mapping residual connections drawn from @@@@6@@@@. The architecture is 53 convolutional layers deep and is designed for ImageNet-scale classification as well as efficient backbone feature extraction for real-time object detection tasks. Darknet-53 is characterized by the repeated use of $3 \times 3$ and $1 \times 1$ convolutions, consistent application of batch normalization and leaky ReLU activations, and a modular design supporting multi-scale feature output for detection heads (Redmon et al., 2018).

1. Architectural Overview

The Darknet-53 architecture alternates strided convolutions, which serve as spatial downsamplers, with residual blocks composed of paired $1 \times 1$ and $3 \times 3$ convolutions. Each convolutional layer (except the final classification/detection heads) is followed immediately by per-channel batch normalization and a leaky ReLU nonlinearity (negative slope 0.1). No biases are used in convolutional layers since batch normalization renders them unnecessary. The global structure omits pooling operations, relying exclusively on convolutional layers for resolution reduction and feature transformation.

For an input feature map of spatial size $H \times W \times C$ , the flow is as follows:

Conv: $3 \times 3$ , 32 filters, stride 1 $\rightarrow H \times W \times 32$
Conv: $3 \times 3$ $3 \times 3$ , 64 filters, stride 2 $\rightarrow (H/2) \times (W/2) \times 64$ $\to (H /2) \times (W /2) \times 64$ , then 1 residual block:
- $1 \times 1$ , 32 filters
- $3 \times 3$ , 64 filters
Conv: $3 \times 3$ $3 \times 3$ , 128 filters, stride 2 $\rightarrow (H/4) \times (W/4) \times 128$ $\to (H /4) \times (W /4) \times 128$ , then 2 residual blocks:
- $1 \times 1$ , 64 filters
- $3 \times 3$ , 128 filters
Conv: $3 \times 3$ $3 \times 3$ , 256 filters, stride 2 $\rightarrow (H/8) \times (W/8) \times 256$ $\to (H /8) \times (W /8) \times 256$ , then 8 residual blocks:
- $1 \times 1$ , 128 filters
- $3 \times 3$ , 256 filters
Conv: $3 \times 3$ $3 \times 3$ , 512 filters, stride 2 $\rightarrow (H/16) \times (W/16) \times 512$ $\to (H /16) \times (W /16) \times 512$ , then 8 residual blocks:
- $1 \times 1$ , 256 filters
- $3 \times 3$ , 512 filters
Conv: $3 \times 3$ $3 \times 3$ , 1024 filters, stride 2 $\rightarrow (H/32) \times (W/32) \times 1024$ $\to (H /32) \times (W /32) \times 1024$ , then 4 residual blocks:
- $1 \times 1$ , 512 filters
- $3 \times 3$ , 1024 filters

As a detector backbone for YOLOv3, intermediate outputs from the end of stages 3, 4, and 5 constitute the respective 256-, 512-, and 1024-channel multi-scale detection heads.

2. Residual Block Formulation

Each residual block within Darknet-53 implements an identity mapping:

$y = x + F(x)$

where $F(x)$ is defined for a block with depth $D \rightarrow 2D$ as:

$F(x) = \text{Conv}_{3\times3}(\text{BN}(\sigma(\text{Conv}_{1\times1}(\text{BN}(\sigma(x))))))$

or, in compact form:

$u = \sigma(\text{BN}(W_1 * x))$

$v = \sigma(\text{BN}(W_2 * u))$

$y = x + v$

Here $\sigma(\cdot)$ is the leaky ReLU, $W_1$ denotes the $1 \times 1$ convolution (channel reduction), and $W_2$ the $3 \times 3$ convolution (channel expansion).

Within each residual block, all convolutions have stride 1, guaranteeing spatial alignment of the feature maps and enabling a straightforward elementwise addition for the identity mapping.

3. Layer Composition and Parameterization

The Darknet-53 feature extractor, excluding the classification or detection heads, contains:

6 downsampling $3 \times 3$ convolutional layers with filter sizes: 32, 64, 128, 256, 512, 1024.
Residual stages with a total of $(1 + 2 + 8 + 8 + 4) = 23$ blocks, each employing 2 convolutional layers.
Total convolutional layers within the extractor: $6 + (23 \times 2) = 52$ .
Including the final $1 \times 1$ convolution for 1000-way classification, the cumulative depth is 53 convolutional layers.

Parameter count comprises approximately 41.6 million convolutional weights and approximately 84,000 batch normalization parameters, resulting in a total of $\simeq$ 41.7M parameters. For a $256 \times 256$ input, the multiply–accumulate operation count is 18.7 billion (Redmon et al., 2018).

4. Layer-wise Topology and Feature Map Organization

The feature map topology across stages is summarized as follows:

Stage	Output Size	Conv Layers + Residual Blocks	Channels
1	$H \times W$	$3 \times 3$ , stride 1	32
2	$H/2 \times W/2$	$3 \times 3$ , stride 2, 1 block	64
3	$H/4 \times W/4$	$3 \times 3$ , stride 2, 2 blocks	128
4	$H/8 \times W/8$	$3 \times 3$ , stride 2, 8 blocks	256
5	$H/16 \times W/16$	$3 \times 3$ , stride 2, 8 blocks	512
6	$H/32 \times W/32$	$3 \times 3$ , stride 2, 4 blocks	1024

At each block, the identity mapping $y = x + F(x)$ is maintained, with $F$ constructed as above.

5. Comparative Performance Analysis

When benchmarked for ImageNet classification (256x256, single crop), Darknet-53 yields Top-1 accuracy of 77.2% and Top-5 accuracy of 93.8%. On a Titan X, it achieves a throughput of 78 FPS at 1457 BFLOP/s using 18.7 G batch-normalized operations. Comparative results to prior architectures are:

Backbone	Top-1	Top-5	Ops (G)	BFLOP/s	FPS
Darknet-19	74.1%	91.8%	7.29	1246	171
ResNet-101	77.1%	93.7%	19.7	1039	53
ResNet-152	77.6%	93.8%	29.4	1090	37
Darknet-53	77.2%	93.8%	18.7	1457	78

Darknet-53 matches or exceeds ResNet-101 on both Top-1 and Top-5 accuracies and is comparable to ResNet-152 for Top-5. It is 1.5× faster than ResNet-101 and 2× faster than ResNet-152. While it uses approximately 2.6× more FLOPs than Darknet-19, it delivers a 3% higher Top-1 accuracy. It is more FLOP-efficient than ResNet-101 and achieves greater GPU utilization.

6. Role in Detection and Empirical Results

As the backbone for YOLOv3, Darknet-53 provides intermediate feature maps for multi-scale detection heads, enabling robust detection performance. On COCO 2017 (input $608 \times 608$ ), YOLOv3 with Darknet-53 achieves:

AP (IoU=.5:.95): 33.0
AP $_{50}$ : 57.9
AP $_{75}$ : 34.4
AP $_S$ : 18.3, AP $_M$ : 35.4, AP $_L$ : 41.9

This performance places YOLOv3 (with Darknet-53) ahead of one-stage SSD variants on AP $_{50}$ and competitive with RetinaNet at a 3–4× speed advantage (Redmon et al., 2018).

7. Summary and Synthesis

Darknet-53 realizes a hybrid design that synergizes the efficient small-kernel stacking of Darknet-19 with the optimization benefits of ResNet's identity-mapping residual blocks. At a total of 53 convolutional layers and $\sim$ 41.7M parameters, the architecture delivers competitive accuracy at high throughput, serving effectively as a versatile backbone for both classification and detection settings. Its explicit modularity, computational profile, and empirical benchmarks support its adoption as a baseline for real-time detection systems.

Markdown Report Issue Upgrade to Chat

References (1)

YOLOv3: An Incremental Improvement (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Darknet-53.

Darknet-53: Efficient 53-Layer CNN

1. Architectural Overview

2. Residual Block Formulation

3. Layer Composition and Parameterization

4. Layer-wise Topology and Feature Map Organization

5. Comparative Performance Analysis

6. Role in Detection and Empirical Results

7. Summary and Synthesis

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Darknet-53: Efficient 53-Layer CNN

1. Architectural Overview

2. Residual Block Formulation

3. Layer Composition and Parameterization

4. Layer-wise Topology and Feature Map Organization

5. Comparative Performance Analysis

6. Role in Detection and Empirical Results

7. Summary and Synthesis

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research