Deep Residual Learning for Image Recognition
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Deep Residual Learning for Image Recognition — Explained Simply
1) What is this paper about?
This paper introduces a new way to build very deep neural networks so computers can recognize images better. The idea is called “residual learning,” and the networks built with it are called “ResNets.” The key trick is to give the network “shortcut paths” so each set of layers learns small fixes instead of trying to learn everything from scratch.
2) What questions did the researchers ask?
They set out to answer simple but important questions:
- Why do deeper neural networks sometimes get worse instead of better?
- Can we design networks where adding more layers actually helps?
- Would it be easier for a network to learn “the changes needed” (the residual) rather than the whole answer?
- How deep can we go if we build networks this new way?
3) How did they do it? (Methods in everyday language)
Normally, a deep network is like stacking many Lego blocks, where each block tries to transform the input into a better representation. The authors noticed that when they simply stacked more blocks, training often got harder and results got worse.
Their fix: give each block a shortcut. Instead of a block learning the entire transformation, it learns only the “difference” (the residual) between the input and the desired output, and then adds it back. In simple math: if the input is and the block learns , the block outputs .
- Analogy: Imagine writing an essay. Instead of rewriting it from scratch every time (hard!), you keep the original and just add small edits. Those edits are the “residual.”
- Another analogy: Climbing stairs with a handrail. The handrail (the shortcut) helps you move up more safely and steadily, especially as the staircase (the network) gets taller.
How they tested it:
- They built “plain” networks (no shortcuts) and “ResNets” (with shortcuts) with different depths: 18, 34, 50, 101, and 152 layers for a large image dataset called ImageNet.
- They also tried very deep versions (110 and even 1202 layers) on a smaller dataset called CIFAR-10.
- They used standard training tools (stochastic gradient descent, batch normalization) and compared errors to see which approach worked better.
- For very deep models, they used a “bottleneck” block: tiny layers that reduce, process, then restore information (using 1×1 and 3×3 filters). Think of it as squeezing information through a narrow hallway to make the whole building more efficient.
4) What did they find, and why does it matter?
Here are the main results:
- Deeper plain networks got worse: When they added more layers without shortcuts, training became harder and accuracy dropped.
- ResNets got better with depth: With shortcuts and residual learning, deeper models actually trained more easily and performed better.
- Record-setting performance: A 152-layer ResNet achieved top results on ImageNet. Combining several ResNets (an ensemble) reached a very low 3.57% top-5 error, winning the ImageNet 2015 classification competition.
- Works beyond classification: Using ResNets improved object detection and segmentation too. On the COCO detection benchmark, they saw about a 28% relative improvement in a key metric—showing that the features learned by ResNets are broadly useful.
- Extremely deep is possible: They trained networks with over 1000 layers on CIFAR-10. These trained successfully, though the very deepest model overfit a bit on that small dataset (great training accuracy, but not as good on test data), which is a common sign the model is larger than needed.
- Small “fixes” in practice: They measured the size of the residual signals and found they’re generally small. This supports the idea that layers mostly make small improvements when shortcuts are present, which is easier to learn.
Why it matters:
- It solved a core training problem: Before this, simply stacking more layers often made things worse. Residual learning flipped that—now depth usually helps.
- It was more efficient: Their 152-layer ResNet was deeper yet used fewer computations than older popular models (like VGG), making it both powerful and practical.
5) What’s the impact and what could this lead to?
- Foundation for modern AI vision: ResNets quickly became a standard building block in computer vision, influencing many later architectures.
- Better performance on many tasks: From classifying images to finding and segmenting objects, residual learning improved accuracy across the board.
- A general idea: “Learn the change, not the whole thing.” This mindset can help in many areas beyond images—any problem where making small corrections step-by-step is easier than starting from scratch.
- Enables very deep models: With residual shortcuts, researchers can safely explore much deeper networks, unlocking new performance gains and ideas.
In short: By letting layers learn small fixes and adding shortcut connections, the authors made deep networks both easier to train and more accurate. This simple idea reshaped how we build neural networks and led to major advances in computer vision.
Collections
Sign up for free to add this paper to one or more collections.