HiP: Hierarchical Perceiver (2202.10890v2)

Published 22 Feb 2022 in cs.CV

Abstract: General perception systems such as Perceivers can process arbitrary modalities in any combination and are able to handle up to a few hundred thousand inputs. They achieve this generality by using exclusively global attention operations. This however hinders them from scaling up to the inputs sizes required to process raw high-resolution images or video. In this paper, we show that some degree of locality can be introduced back into these models, greatly improving their efficiency while preserving their generality. To scale them further, we introduce a self-supervised approach that enables learning dense low-dimensional positional embeddings for very large signals. We call the resulting model a Hierarchical Perceiver (HiP). In sum our contributions are: 1) scaling Perceiver-type models to raw high-resolution images and audio+video, 2) showing the feasibility of learning 1M+ positional embeddings from scratch using masked auto-encoding, 3) demonstrating competitive performance on raw data from ImageNet, AudioSet, PASCAL VOC, ModelNet40 and Kinetics datasets with the same exact, unchanged model and without specialized preprocessing or any tokenization.

Citations (14)

View on Semantic Scholar

Summary

The paper introduces a Hierarchical Perceiver (HiP) model that integrates local attention mechanisms to scale processing of high-resolution and multimodal data.
It leverages self-supervised learning to derive dense low-dimensional positional embeddings, eliminating the need for handcrafted Fourier features.
The HiP model demonstrates robust performance across diverse datasets, including ImageNet, AudioSet, and Kinetics, with no modality-specific preprocessing.

An Examination of the Hierarchical Perceiver (HiP) Model

The paper "HiP: Hierarchical Perceiver" introduces an advancement in machine learning architectures known as the Hierarchical Perceiver (HiP). Building on the foundational Perceiver and Perceiver IO models, the authors propose enhancements aimed at scaling these architectures to handle the vast input spaces required for processing raw high-resolution data such as images and video. The central innovation lies in the integration of locality into the Perceiver's attention mechanism, thereby improving efficiency while maintaining the model's versatility across modalities.

Key Contributions

Incorporation of Locality in Attention Mechanisms: Traditional Perceiver models employ global attention, which impedes efficient scaling to very large input sizes. The HiP model reintroduces a form of local attention without sacrificing generalization capabilities, making it feasible to scale to high-resolution inputs effectively.
Self-Supervised Learning of Positional Embeddings: HiP leverages self-supervised masked auto-encoding to learn dense, low-dimensional positional embeddings. This approach bypasses the need for hand-engineered Fourier features, which are computationally expensive and don't scale well with the number of inputs.
Demonstration of Robust Performance Across Modalities: The HiP model shows competitive performance on raw data from various datasets, including ImageNet, AudioSet, PASCAL VOC, ModelNet40, and Kinetics. The same architecture, with no changes or need for specialized preprocessing, handles diverse inputs like video and audio.

Architectural Innovations

The introduction of a hierarchical structure in the HiP model is central to its performance improvements. By organizing computation into grouped blocks, the architecture can exploit the intrinsic locality that remains after data flattening. This is achieved without imposing modality-specific design constraints, preserving the general applicability of the network. Each block handles inputs separately, followed by transformations that adapt the internal representation while allowing global computations at the bottleneck.

Additionally, by increasing the complexity in deeper layers, HiP benefits from efficient processing while maintaining its capability to encode expansive input spaces.

Implications and Future Directions

The HiP model sets a precedent for designing perception models that can process multiple modalities at different resolutions without explicit data serialization. The self-supervised learning of positional embeddings represents a significant departure from fixed or handcrafted embeddings, suggesting a shift towards more adaptable architectures suitable for varied and large-scale data environments.

The proposed architecture's ability to preclude domain-specific preprocessing such as convolutions or patching poses intriguing opportunities for further exploration in truly anymodal learning and AutoML applications. Moreover, HiP raises questions about the efficacy of self-supervised learning approaches in other domains and tasks, potentially paving the way for further harmonization with contrastive learning and dense labeling challenges.

Looking towards the horizon, the HiP architecture exemplifies an evolution in perceiver-based models, advocating for efficient and flexible learning structures capable of handling increasingly large and complex datasets with minimal manual intervention in preprocessing. As AI systems are tasked with more intricate and diverse data, architectures like HiP might guide the development of robust, scalable, and versatile AI tools.

PDF Markdown

Related Papers

YouTube

Show All Videos