Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning To Count Everything (2104.08391v1)

Published 16 Apr 2021 in cs.CV

Abstract: Existing works on visual counting primarily focus on one specific category at a time, such as people, animals, and cells. In this paper, we are interested in counting everything, that is to count objects from any category given only a few annotated instances from that category. To this end, we pose counting as a few-shot regression task. To tackle this task, we present a novel method that takes a query image together with a few exemplar objects from the query image and predicts a density map for the presence of all objects of interest in the query image. We also present a novel adaptation strategy to adapt our network to any novel visual category at test time, using only a few exemplar objects from the novel category. We also introduce a dataset of 147 object categories containing over 6000 images that are suitable for the few-shot counting task. The images are annotated with two types of annotation, dots and bounding boxes, and they can be used for developing few-shot counting models. Experiments on this dataset shows that our method outperforms several state-of-the-art object detectors and few-shot counting approaches. Our code and dataset can be found at https://github.com/cvlab-stonybrook/LearningToCountEverything.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Viresh Ranjan (10 papers)
  2. Udbhav Sharma (1 paper)
  3. Thu Nguyen (27 papers)
  4. Minh Hoai (48 papers)
Citations (115)

Summary

The paper "Learning To Count Everything" (Ranjan et al., 2021 ) introduces the task of few-shot visual counting. Unlike traditional counting methods that focus on a single object category (like crowds or cells) and require vast amounts of labeled data for that specific category, this work aims to count objects from any category using only a few examples from the target image itself. This addresses two major challenges in scaling counting to a large number of visual categories: the high cost of data annotation for each category and the lack of diverse counting datasets.

To tackle this, the authors propose a novel approach called Few Shot Adaptation and Matching Network (FamNet). FamNet is designed as a few-shot regression model that takes a query image and a few exemplar bounding boxes of the objects to be counted from that image as input. The output is a density map representing the presence of the objects, where the sum of the density values yields the estimated count.

The FamNet architecture consists of two main components:

  1. Feature Extraction Module: This uses the first four blocks of a pre-trained ResNet-50 backbone (with frozen parameters). It extracts multi-scale convolutional features from both the query image and the exemplar objects (using ROI pooling for the exemplars).
  2. Density Prediction Module: This module is designed to be category-agnostic. Instead of directly processing raw image features, it operates on correlation maps computed between the image features and the exemplar features. To handle objects at different scales, exemplar features are scaled (e.g., by 0.9x, 1.1x) before correlation, and the resulting correlation maps are concatenated. This concatenated representation is then fed into a series of convolution and upsampling layers to predict the final density map at the resolution of the input image.

The network is trained on a dataset where exemplar objects have bounding box annotations and other objects have dot annotations. A key aspect of training is generating the ground truth density map. Since object sizes vary greatly across different categories in their dataset, the authors use an adaptive Gaussian smoothing approach. For each image, they estimate the average distance between nearest neighboring dot annotations. This average distance is used as the size of the Gaussian kernel window (with standard deviation set to a quarter of the window size) to generate a smoothed density map from the dot annotations. The network is trained to minimize the mean squared error (MSE) between its predicted density map and this adaptive ground truth density map. Training uses the Adam optimizer with a learning rate of 10510^{-5} and a batch size of 1. Images are resized to a fixed height of 384 while preserving aspect ratio.

A novel contribution is the test-time adaptation scheme. While the pre-trained FamNet can generalize to novel categories, its performance is further boosted by adapting it using the provided exemplar bounding boxes at test time. This adaptation leverages the precise locations of the exemplars through two specific loss functions applied to the predicted density map:

  • Min-Count Loss (LMinCountL_{MinCount}): Encourages the sum of predicted density values within each exemplar bounding box to be at least one. This ensures that the model predicts the presence of the known objects. LMinCount=bBmax(0,1Zb1)L_{MinCount} = \sum_{b \in B} \max(0, 1 - ||Z_b||_1), where BB is the set of exemplar bounding boxes, ZZ is the predicted density map, and ZbZ_b is the crop of ZZ corresponding to bounding box bb.
  • Perturbation Loss (LPerL_{Per}): Inspired by correlation filter tracking, this loss encourages the density values around each exemplar location to resemble a 2D Gaussian distribution. LPer=bBZbGh×w22L_{Per} = \sum_{b \in B} ||Z_b - G_{h \times w}||_2^2, where Gh×wG_{h \times w} is a 2D Gaussian window of the same size as ZbZ_b.

These two losses are combined into a total adaptation loss LAdapt=λ1LMinCount+λ2LPerL_{Adapt} = \lambda_1 L_{MinCount} + \lambda_2 L_{Per}. At test time, a few gradient descent steps are performed on the density prediction module parameters using LAdaptL_{Adapt} with a small learning rate (10710^{-7}). This process fine-tunes the network specifically to the visual characteristics and locations of the exemplars in the current test image.

To enable research in few-shot counting, the authors introduce the FSC-147 dataset. This dataset comprises 6135 images across 147 diverse object categories. Each image contains dot annotations for all object instances and bounding box annotations for three randomly selected exemplars. The number of objects per image varies widely, from 7 to over 3700. The dataset is split into disjoint training, validation, and test sets based on object categories to ensure novelty at test time.

Experiments demonstrate that FamNet significantly outperforms several baselines, including simple mean/median counting, adapted few-shot detectors (Feature Reweighting, FSOD), Generic Matching Network (GMN), and MAML, on the FSC-147 validation and test sets (Table 2 in the paper). Notably, FamNet also achieves lower MAE and RMSE than object detectors (Faster R-CNN, RetinaNet, Mask R-CNN) pre-trained on the large COCO dataset, even when evaluated on object categories covered by COCO (Table 3). Ablation studies confirm that both multi-scale features and test-time adaptation contribute positively to FamNet's performance, and the model's accuracy improves with more exemplars (Table 4, Table 5).

For practical implementation, the model relies on standard deep learning frameworks (e.g., PyTorch, TensorFlow) supporting ResNet backbones, ROI pooling, convolutional layers, and custom loss functions. The test-time adaptation adds computational overhead per image compared to a standard forward pass, but the small number of gradient steps makes it feasible. The data preprocessing includes resizing and adaptive Gaussian kernel generation, which needs to be implemented carefully. The provided dataset and code repository (https://github.com/cvlab-stonybrook/LearningToCountEverything) are valuable resources for practitioners.

In summary, "Learning To Count Everything" presents few-shot counting as a practical problem, provides a novel dataset (FSC-147) to benchmark it, and proposes FamNet, an effective architecture with a unique test-time adaptation strategy, demonstrating strong performance across diverse categories with minimal examples.

Github Logo Streamline Icon: https://streamlinehq.com