FeatUp: A Model-Agnostic Framework for Features at Any Resolution

Published 15 Mar 2024 in cs.CV, cs.AI, cs.IR, and cs.LG | (2403.10516v2)

Abstract: Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction because models aggressively pool information over large areas. In this work, we introduce FeatUp, a task- and model-agnostic framework to restore lost spatial information in deep features. We introduce two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. Both approaches use a multi-view consistency loss with deep analogies to NeRFs. Our features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains even without re-training. We show that FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation.

Abstract PDF HTML Upgrade to Chat

Authors (6)

References (89)

Citations (24)

View on Semantic Scholar

Summary

The paper demonstrates that FeatUp significantly improves the spatial fidelity of deep features without compromising semantic integrity.
It introduces a model-agnostic framework that leverages learned downsampler and upsampler architectures, including attention mechanisms and joint bilateral upsampling.
Experimental validation across segmentation, depth prediction, and CAM generation shows that FeatUp outperforms baselines without requiring model re-training.

Enhancing Spatial Resolution of Deep Features with FeatUp: A Model-Agnostic Framework

Overview

Deep features, the essence of modern computer vision tasks, encapsulate rich semantic information, aiding in numerous downstream applications from classification to segmentation. However, their utility is often limited by the drastic reduction in spatial resolution - a byproduct of the models that generate them. To counteract this, we introduce FeatUp, a framework designed to enhance the spatial resolution of deep features across any model, without tampering with their semantic integrity. FeatUp adopts a model-agnostic approach, making it highly versatile and capable of improving feature utility in a broad range of tasks.

Architecture

FeatUp is fundamentally based on leveraging multiview consistency, a concept inspired by NeRF, to reconstruct high-resolution signals from multiple low-resolution observations stemmed from slight input transformations. This framework consists of two core components: a learned downsampler and a learned upsampler.

The learned downsampler operates under the principle that a latent high-resolution feature map should, when downsampled, closely match the observed low-resolution features from varied views. Two variants are proposed: a simple learned blur kernel and a more sophisticated attention-augmented downsampler. The latter is capable of capturing varying receptive fields and salience, providing a more adaptive and content-aware downsampling scheme.

For upsampling, FeatUp introduces two architectures: one is founded on Joint Bilateral Upsampling (JBU), and the other leverages an implicit network representation. The JBU-based architecture is designed for rapid, feedforward upsampling utilizing an efficient CUDA kernel for JBU, significantly boosting performance over conventional implementations. On the other hand, the implicit architecture offers the ability to reconstruct high-resolution features at arbitrary resolutions by overfitting to a single image, drawing parallels to methods akin to NeRF for 3D scene reconstruction.

Experimental Validation

Extensive evaluation across various tasks, including class activation map (CAM) generation, transfer learning for segmentation and depth prediction, as well as end-to-end training for semantic segmentation, reveals FeatUp's superiority over a host of baseline methods. Notably, FeatUp not only excels in enriching the spatial fidelity of features but also does so without compromising their original semantics. This is evidenced by its top-tier performance in downstream tasks using features upsampled by FeatUp without necessitating model re-training.

Implications and Future Directions

The introduction of FeatUp opens new avenues in the usage of deep features for dense prediction tasks, significantly enhancing the output quality with minimal computational overhead. Its model-agnostic nature further broadens its applicability, promising substantial improvements in spatial resolution regardless of the backbone architecture used. Future work could explore the optimization of FeatUp's components for even larger upsampling factors, integration with real-time systems, and expansion to other data modalities beyond images.

In summary, FeatUp presents a compelling solution to the prevalent issue of reduced spatial resolution in deep feature representations, facilitating more refined and precise applications in the domain of computer vision and beyond.

Markdown Report Issue