Papers
Topics
Authors
Recent
Search
2000 character limit reached

Perceiver Encoder: Scalable, Modality-Agnostic Model

Updated 13 April 2026
  • The Perceiver encoder is a neural module that compresses large, diverse inputs into a fixed-size latent space using a cross-attention bottleneck followed by self-attention layers.
  • It decouples input size from model depth, offering computational efficiency and scalability for high-resolution images, audio, multimodal data, and relational graphs.
  • It enables uniform processing across various modalities, facilitating zero-, few-, and multi-task learning through a shared, expressive latent representation.

A Perceiver encoder is a neural module that projects arbitrarily large, modality-agnostic input arrays into a fixed-size latent representation using a combination of cross-attention and self-attention operations. This design achieves a computational bottleneck that decouples input size from model depth, enabling the scaling of expressive attention-based architectures to previously intractable domains such as raw images, audio, multimodal data, relational graphs, and high-resolution signals (Jaegle et al., 2021, Jaegle et al., 2021, Meyer et al., 22 Oct 2025, Carreira et al., 2022, Zhu et al., 2021).

1. Core Architecture and Principles

The Perceiver encoder operates in two primary stages:

  1. Cross-attention bottleneck: A learnable latent array ZRN×DZ \in \mathbb{R}^{N \times D}, with NMN \ll M (MM = input token count), attends to the input array XRM×CX \in \mathbb{R}^{M \times C}. This drastically reduces the effective input dimensionality by mapping the entire input into a compact latent space.
  2. Latent self-attention stack: The NN-vector latent array is refined by LL layers of standard Transformer blocks (self-attention plus MLP), repeatedly updating the latents without requiring further input access.

After encoding, the resulting N×DN \times D latent array provides a summary embedding of the input that can be directly consumed by task decoders (Jaegle et al., 2021).

2. Mathematical Formulation

Cross-attention step:

Let XRM×CX \in \mathbb{R}^{M \times C}, ZRN×DZ \in \mathbb{R}^{N \times D}, projections WQRD×dW_Q \in \mathbb{R}^{D \times d}, NMN \ll M0, NMN \ll M1.

NMN \ll M2

Latent self-attention block (for each of NMN \ll M3 layers):

NMN \ll M4

Latents are randomly initialized parameters, e.g., with truncated normal NMN \ll M5 (Jaegle et al., 2021, Jaegle et al., 2021, Meyer et al., 22 Oct 2025).

3. Computational Complexity and Scaling

Let NMN \ll M6 (feature/channel size), NMN \ll M7 the number of input tokens, NMN \ll M8 the number of latents, NMN \ll M9 latent self-attention layers:

  • Cross-attention cost: MM0 (dominant for large MM1)
  • Self-attention over latents: MM2

For image-level settings (MM3) this yields total cost MM4, linearly scaling in the number of input tokens. This decoupling contrasts with Transformer architectures, where self-attention has MM5 cost per layer (Jaegle et al., 2021).

4. Extensions and Architectural Variants

Uni-Perceiver

The Uni-Perceiver encoder applies the Perceiver-style design to unified multimodal and multi-task representation. All inputs and candidate outputs from arbitrary modalities (text, images, video) are tokenized through lightweight, modality-specific tokenizers and processed by a single Transformer encoder, mapping them to a unified representation space. All tasks are converted to joint input-target likelihood maximization via cosine similarity in that space, enabling flexible zero-shot, few-shot, and fully supervised settings without task-specific heads (Zhu et al., 2021).

RELATE

RELATE demonstrates schema-agnostic Perceiver encoders in graph-structured relational data. Per-node sets of multimodal, heterogeneously-typed column embeddings are compressed through a cross-attention bottleneck (latents to feature set), followed by self-attention over shared latents, to yield fixed-size, permutation-invariant node embeddings. RELATE maintains high efficiency and parameter sharing across varying data schemas and modalities (Meyer et al., 22 Oct 2025).

Hierarchical Perceiver

The Hierarchical Perceiver (HiP) inserts fine-grained locality and hierarchy: input tokens are grouped, with each group attended by local latent arrays, followed by hierarchy-wise reduction (merging groups, increasing channel size, reducing token counts), permitting scaling to multimillion-token settings. Masked auto-encoding enables training of locally meaningful positional embeddings for high-resolution domains (Carreira et al., 2022).

5. Input Representation and Embedding

Inputs must be embedded into vectors suitable for cross-attention. Strategies include:

  • Spatial/temporal data: Concatenation or addition of 1D/2D/3D Fourier feature positional encodings (MM6 bands per axis typical), often with learned modulation for different modalities (Jaegle et al., 2021).
  • Categorial/numerical/text/temporal (RELATE): Modality-specific encoders map each feature to a shared MM7-dimensional embedding, with conditioning on column-level metadata via MLP or gating (Meyer et al., 22 Oct 2025).
  • Multi-modal/cross-task: Learned modality tag vectors appended or added to enable attention mechanisms to distinguish and integrate modality-specific context (Zhu et al., 2021).

After embedding, the token array MM8 feeds directly into the Perceiver encoder's cross-attention step.

6. Applications and Empirical Evaluation

The Perceiver encoder and its variants demonstrate competitive or superior performance in:

Domain Example Tasks Performance/Notes
Vision ImageNet classification, segmentation Comparable or better than ResNet-50, ViT, DeiT, Perceiver IO: 79-81% top-1 at 224² (Jaegle et al., 2021, Carreira et al., 2022)
Multimodal AudioSet, raw video + audio Flexible, no task-specific adaptation required. AudioSet: 41.3–43.8 mAP (Carreira et al., 2022)
Large-scale dense signals Hi-res images, video HiP-16 runs ∼3–4× faster than Perceiver IO at high resolution, feasible up to 1024² pixels (Carreira et al., 2022)
Relational Graphs Node encoding for GNNs RELATE achieves within 3% of schema-specific encoders, up to 5× smaller (Meyer et al., 22 Oct 2025)
Unified Perception Multi-task zero/few-shot Uni-Perceiver delivers strong zero-shot and prompt-tuned performance across new tasks and modalities (Zhu et al., 2021)

7. Significance and Theoretical Properties

The Perceiver encoder enables:

  • Decoupling token count and depth: Due to cross-attention bottleneck, large and highly structured or unstructured input arrays can be summarized linearly in input size but with fixed-depth, expressive Transformer processing over latents (Jaegle et al., 2021, Jaegle et al., 2021).
  • Modality-agnostic processing: All input modalities—images, text, tabular, graphs—can be handled with uniform encoder logic and parameter sharing, facilitating efficient multi-task and multi-modal learning (Zhu et al., 2021, Meyer et al., 22 Oct 2025).
  • Permutation-invariance: The encoder is invariant to reordering of its input tokens within the cross-attended set, critical for applications like relational graphs and tabular data (Meyer et al., 22 Oct 2025).
  • Parameter efficiency and scalability: By compressing large, variable-length inputs to a fixed-length latent array, parameter and compute requirements are controlled, and the architecture proves extensible to multi-million-token regimes (Carreira et al., 2022).

A plausible implication is that the Perceiver encoder architecture is a general mechanism for structured attention-based “bottlenecking” suitable for broad classes of perceptual, relational, and multi-modal problems previously inaccessible to standard Transformer architectures (Jaegle et al., 2021, Jaegle et al., 2021, Meyer et al., 22 Oct 2025, Carreira et al., 2022, Zhu et al., 2021).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Perceiver Encoder.