HoughNet: Integrating near and long-range evidence for bottom-up object detection (2007.02355v3)

Published 5 Jul 2020 in cs.CV

Abstract: This paper presents HoughNet, a one-stage, anchor-free, voting-based, bottom-up object detection method. Inspired by the Generalized Hough Transform, HoughNet determines the presence of an object at a certain location by the sum of the votes cast on that location. Votes are collected from both near and long-distance locations based on a log-polar vote field. Thanks to this voting mechanism, HoughNet is able to integrate both near and long-range, class-conditional evidence for visual recognition, thereby generalizing and enhancing current object detection methodology, which typically relies on only local evidence. On the COCO dataset, HoughNet's best model achieves 46.4 $AP$ (and 65.1 $AP_{50}$), performing on par with the state-of-the-art in bottom-up object detection and outperforming most major one-stage and two-stage methods. We further validate the effectiveness of our proposal in another task, namely, "labels to photo" image generation by integrating the voting module of HoughNet to two different GAN models and showing that the accuracy is significantly improved in both cases. Code is available at https://github.com/nerminsamet/houghnet.

Citations (43)

View on Semantic Scholar

Summary

The paper introduces a voting-based detection framework that integrates local and long-range visual evidence through a novel log-polar grid approach.
It achieves competitive performance on COCO with an AP of 46.4 and improves detection of small and occluded objects.
The versatile voting module boosts performance in image-to-image translation when integrated with GAN frameworks like CycleGAN and Pix2Pix.

HoughNet: Integrating Near and Long-Range Evidence for Bottom-Up Object Detection

The paper "HoughNet: Integrating near and long-range evidence for bottom-up object detection" introduces HoughNet, an innovative one-stage, anchor-free, voting-based bottom-up object detection framework. This approach is inspired by the Generalized Hough Transform (GHT) and seeks to harness both near and long-range visual evidence through a unique voting mechanism, enhancing traditional object detection methodologies that predominantly rely on localized evidence.

Core Contributions and Approach

HoughNet distinguishes itself by employing a voting-based strategy whereby the presence of an object is established through accumulated votes at certain locations. This method diverges from traditional top-down object detectors and some existing anchor-free models that predominantly focus on local visual cues. The voting mechanism in HoughNet leverages a log-polar "vote field," a novel adaptation that permits capturing evidence from varying spatial ranges, akin to foveated vision systems.

Votes in HoughNet's framework are accumulated through spatial regions defined by a log-polar grid that reduces spatial precision with increasing distance. This design choice is pivotal, as it enables the integration of both local and contextual visual evidence, improving the robustness of object detection. The implementation uses convolutional neural networks (CNNs) to process input images into class-conditional visual evidence maps, which are then used to vote across the predetermined vote field regions.

Experimental Evaluation

The empirical results presented in this paper are substantial. On the COCO dataset, HoughNet achieves an average precision (AP) of 46.4 (AP at IoU threshold of 0.5 is 65.1) which is competitive with state-of-the-art bottom-up detectors such as CenterNet. Furthermore, HoughNet outperforms major one-stage and two-stage methodologies, particularly in integrating contextual cues from background objects, leading to improved detection of small and occluded objects.

The paper also ventures beyond object detection into applying HoughNet's voting module in image-to-image translation tasks. Specifically, when integrated with CycleGAN and Pix2Pix, two popular GAN models, the voting module significantly boosts performance, as evidenced by enhanced precision and class IoU metrics on the Cityscapes dataset. This extension underscores the module's versatility and potential for broader applications.

Implications and Future Directions

This work pushes forward the boundary of what can be achieved with bottom-up detection models by stressing the importance of leveraging both short-range and contextually rich long-range evidence. The introduction of the log-polar vote field within a deep learning framework adds a compelling dimension to the voting-based detection paradigm, offering new avenues for research and practical applications.

In terms of future research, integrating HoughNet-type architectures could extend to real-time applications where understanding context within scenes enhances detection accuracy. More complex integration of temporal information in video sequences might also be explored, leveraging contextual evidence across multiple frames.

HoughNet's demonstration of its voting framework in tasks beyond object detection hints at a cross-domain applicability. Future studies may further generalize the voting paradigm to other complex tasks in computer vision and potentially other domains requiring robust decision-making from spatially distributed data.

In conclusion, HoughNet represents a marked advancement in object detection methodologies, emphasizing the utility of versatile, context-aware visual evidence accumulation, and is poised to inspire future work in bottom-up recognition and beyond.

PDF Markdown

Related Papers

GitHub

GitHub - nerminsamet/houghnet: [ECCV-20] Official PyTorch implementation of HoughNet, a voting-based object detector. (176 stars)

Tweets

https://twitter.com/metu_imagelab/status/1280448975766786060

https://twitter.com/PapersTrending/status/1282978641832337409

https://twitter.com/PapersTrending/status/1282253739961835520

https://twitter.com/PapersTrending/status/1281891335159586817

https://twitter.com/PapersTrending/status/1281166538540830726