Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 90 tok/s

Gemini 2.5 Pro 29 tok/s Pro

GPT-5 Medium 14 tok/s Pro

GPT-5 High 17 tok/s Pro

GPT-4o 101 tok/s Pro

Kimi K2 195 tok/s Pro

GPT OSS 120B 456 tok/s Pro

Claude Sonnet 4 39 tok/s Pro

2000 character limit reached

HarDNet: A Low Memory Traffic Network (1909.00948v1)

Published 3 Sep 2019 in cs.CV

Abstract: State-of-the-art neural network architectures such as ResNet, MobileNet, and DenseNet have achieved outstanding accuracy over low MACs and small model size counterparts. However, these metrics might not be accurate for predicting the inference time. We suggest that memory traffic for accessing intermediate feature maps can be a factor dominating the inference latency, especially in such tasks as real-time object detection and semantic segmentation of high-resolution video. We propose a Harmonic Densely Connected Network to achieve high efficiency in terms of both low MACs and memory traffic. The new network achieves 35%, 36%, 30%, 32%, and 45% inference time reduction compared with FC-DenseNet-103, DenseNet-264, ResNet-50, ResNet-152, and SSD-VGG, respectively. We use tools including Nvidia profiler and ARM Scale-Sim to measure the memory traffic and verify that the inference latency is indeed proportional to the memory traffic consumption and the proposed network consumes low memory traffic. We conclude that one should take memory traffic into consideration when designing neural network architectures for high-resolution applications at the edge.

Citations (253)

View on Semantic Scholar

Summary

The paper introduces HarDNet, which minimizes DRAM memory traffic and lowers inference latency by up to 45% while preserving accuracy.
It employs Harmonic Densely Connected Blocks that reduce concatenation overhead and optimize channel ratios for improved efficiency on edge devices.
Empirical results show a 30%-50% CIO reduction and competitive accuracy on benchmarks like ImageNet and CamVid, highlighting its practical benefits.

HarDNet: A Low Memory Traffic Network

This paper presents the development of the Harmonic Densely Connected Network (HarDNet), a neural network architecture aimed at achieving both low memory traffic and high computational efficiency. The paper addresses a significant challenge in deploying neural networks for tasks like real-time object detection and semantic segmentation on edge devices: managing the often high inference latency caused by memory traffic associated with accessing intermediate feature maps, rather than just considering Multi-Accumulate Operations (MACs) or model size alone.

Key Contributions and Methodology

The authors introduce a new metric for evaluating CNN architectures: Convolutional Input/Output (CIO). CIO approximates the DRAM traffic required for feature map access, with the insight being that inference latency correlates more strongly with DRAM traffic than with computational operations. Taking a novel approach, HarDNet optimizes this metric by incorporating sparsified layer connections drawing inspiration from DenseNet but reducing the concatenation overhead that typically results in high memory traffic.

Architectural Insights

The HarDNet architecture employs a structured connection scheme known as Harmonic Densely Connected Blocks (HDBs). This design reduces the number of layer connections when compared to traditional DenseNets, effectively decreasing the concatenation cost. Additionally, by tuning the input/output channel ratio to maintain a balance, the architecture avoids inefficiently low computational density. This makes the architecture particularly suitable for edge devices, where bandwidth limitations of DRAM significantly affect performance.

Empirical Results

The paper reports substantial gains in reducing inference time. HarDNet achieves between 35% and 45% inference time reduction across various state-of-the-art architectures such as FC-DenseNet-103, DenseNet-264, and ResNet-152. Notably, these improvements do not come at the cost of accuracy, as HarDNet sustains competitive accuracy rates on standard benchmark datasets such as ImageNet and CamVid. Specifically, HarDNet shows a 30%-50% reduction in CIO and inference time compared to DenseNet and ResNet while maintaining equivalent accuracy.

Practical and Theoretical Implications

These results have significant implications in the field of real-time image processing on resource-constrained devices. By focusing on minimizing memory traffic, rather than just the number of calculations (MACs) or model size, HarDNet provides a more holistic optimization strategy for CNNs, especially in applications requiring high throughput and low latency.

The authors open pathways for further research into architectures optimized for memory traffic efficiency. As hardware evolves, particularly with emerging architectures that may support fused computations or decreased reliance on traditional DRAM, such an approach could lead to even greater reductions in inference time and energy consumption.

Future Developments

This work suggests promising directions for future developments in optimizing neural network architectures. One potential area of exploration is the integration of HarDNet's concepts into other types of neural networks, such as those used in Natural Language Processing or Reinforcement Learning, where memory bottlenecks also present significant challenges. Furthermore, exploring adaptive methods to adjust the connection densities dynamically based on data characteristics could yield additional performance gains.

In conclusion, HarDNet presents a compelling case for incorporating memory traffic considerations into the design of CNNs, highlighting a sophisticated understanding of the trade-offs in architecture design that go beyond traditional metrics. This paper underscores the potential for such optimized networks to enable more efficient edge computing applications, signifying a step forward in deploying deep learning solutions in resource-constrained environments.