Perception Encoder: The best visual embeddings are not at the output of the network (2504.13181v2)

Published 17 Apr 2025 in cs.CV

Abstract: We introduce Perception Encoder (PE), a state-of-the-art vision encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods: language alignment for multimodal LLMing, and spatial alignment for dense prediction. Together, our PE family of models achieves best-in-class results on a wide variety of tasks, including (1) zero-shot image and video classification and retrieval, simultaneously obtaining 86.6 average zero-shot ImageNet robustness and 76.9 zero-shot Kinetics-400 video classification; (2) document, image, and video Q&A, enabling 94.6 DocVQA, 80.9 InfographicVQA, and 82.7 PerceptionTest with an 8B LLM; and (3) spatial tasks such as detection, tracking, and depth estimation, setting a new COCO state-of-the-art of 66.0 box mAP. To foster further research, we release our models, code, and novel dataset of synthetically and human-annotated videos: https://github.com/facebookresearch/perception_models

Authors (18)

Daniel Bolya (14 papers)
Po-Yao Huang (31 papers)
Peize Sun (33 papers)
Jang Hyun Cho (9 papers)
Andrea Madotto (65 papers)
Chen Wei (72 papers)
Tengyu Ma (117 papers)
Jiale Zhi (4 papers)
Jathushan Rajasegaran (26 papers)
Hanoona Rasheed (13 papers)
Junke Wang (18 papers)
Marco Monteiro (3 papers)
Hu Xu (87 papers)
Shiyu Dong (2 papers)
Nikhila Ravi (15 papers)
Daniel Li (42 papers)
Piotr Dollár (49 papers)
Christoph Feichtenhofer (52 papers)

Summary

The paper demonstrates that key visual embeddings are hidden in intermediate layers rather than at the network’s output, challenging conventional design.
It introduces a robust contrastive pretraining approach using large-scale image-text and recaptioned video data that significantly enhances zero-shot classification and retrieval.
Language and spatial alignment tuning successfully extract these hidden features for specialized tasks, achieving state-of-the-art results across multiple benchmarks.

The paper "Perception Encoder: The best visual embeddings are not at the output of the network" (Bolya et al., 17 Apr 2025 ) introduces the Perception Encoder (PE) family of large-scale vision encoder models. The core finding is that while traditional vision encoders often expose task-specific features at their output layer based on their pretraining objective (e.g., classification, captioning, self-supervised learning), a strong, general visual encoder can be built solely through contrastive vision-language learning. Surprisingly, the most useful embeddings for various downstream tasks are often hidden within the intermediate layers of such a contrastively trained network, not necessarily at the final output. The paper proposes alignment tuning methods to draw out these general features, creating specialized versions of the encoder for language (\PElang{}) and spatial (\PEspat{}) tasks, while the core contrastive model (\PEcore{}) excels at zero-shot classification and retrieval.

The development of the Perception Encoder involves several key steps:

Perception Encoder: Core (\PEcore{}) The foundation is a state-of-the-art contrastive vision-LLM for images and videos. Building this model required addressing scalability and data efficiency.
- Robust Image Pretraining: The authors developed a refined image-only contrastive pretraining recipe on a large dataset (5.4B image-text pairs). Key elements included a long training schedule enabled by regularization, progressive resolution scaling (e.g., 98px to 448px), larger batch sizes (up to 131K), use of the LAMB optimizer with a high learning rate, 2D Rotary Position Embedding (RoPE), attention pooling, tuned data augmentation (heavy random cropping, color jitter, horizontal flip), and Mask Regularization (aligning masked tokens to unmasked counterparts). These modifications significantly improved zero-shot performance, particularly robustness metrics (ObjectNet, ImageNet Adversarial, etc.), indicating better generalization beyond standard ImageNet classification. The recipe demonstrated strong scaling behavior with model size (up to 1.9B parameters for the vision tower in \PEcore{G}) and training steps.
- Bootstrapping a Video Data Engine: To train a unified image-video encoder with limited video-text data, a video data engine was created. This engine uses an early PE-based multimodal LLM (Perception LLM, PLM) to generate video captions. This process is enhanced by incorporating image-level frame captions and video metadata (titles, descriptions), which are then summarized by a large text-only LLM (Llama 3 70B) into concise, aligned captions suitable for contrastive training. This process generates 22M recaptioned videos.
- Training with Recaptioned Videos: The image-pretrained PE encoder is then finetuned on this synthetically captioned video data. A simple frame-based approach is used: 8 uniformly sampled frames are encoded by the image encoder, averaged to create a video embedding, and aligned with the video caption using the contrastive objective. This short finetuning step significantly boosted both image (especially robustness and retrieval) and video (classification and retrieval) zero-shot performance.
- PE Video Dataset (PVD): As part of this effort, the authors release the PE Video Dataset, comprising 1M diverse videos with tags and descriptions, including a 120K subset with human-refined synthetic captions. 15K of these are designated as the PVD Benchmark for fine-grained video retrieval.
- A Unified Encoder: Scaling the robust image pretraining and video finetuning to 2B parameters resulted in \PEcore{G}. Smaller B and L scale models are trained using distillation from \PEcore{G}. \PEcore{} models achieve state-of-the-art zero-shot image classification (outperforming models trained on proprietary datasets like JFT-3B and WebLI), image-text retrieval, and fine-grained image classification. They also set state-of-the-art for video classification and are competitive in video retrieval, despite using a simpler frame-averaging approach compared to native video models. Frozen encoder probing results also show state-of-the-art performance on ImageNet for KNN, linear, and attention probing.
General Features in a Contrastive Disguise Analyzing the intermediate layers of \PEcore{G} revealed that strong features for tasks beyond standard zero-shot classification/retrieval exist deep within the network.
- Layerwise Feature Analysis: By evaluating frozen features from different layers of \PEcore{G} on tasks like visual Q&A, captioning, grounding, detection, depth estimation, and tracking, the authors show that certain intermediate layers perform as well as or better than state-of-the-art models specifically pretrained for those domains (e.g., matching AIMv2 on language tasks and DINOv2 on spatial tasks).
- An Alignment Problem: The key issue is that these strong, general features diminish towards the final output layer for many tasks. This suggests that the final layers, while useful for the global contrastive objective, obfuscate the task-specific general features learned internally. This phenomenon is less pronounced for tasks closer to the original pretraining objective (zero-shot classification). The robust pretraining recipe played a crucial role in developing these strong internal features and enabling their scaling, unlike previous findings where CLIP models did not scale well for downstream tasks.
Perception Encoder: Language Alignment (\PElang{}) To expose the strong language-understanding features within \PEcore{}, the authors developed a language alignment tuning process.
- Method: The alignment is achieved by adapting the \PEcore{} encoder to a pretrained decoder-only LLM (Llama 3) via a vision projector (a 2-layer MLP). Training involves a warmup stage (projector only) followed by finetuning all parameters (projector and LLM) using an autoregressive next-token prediction loss on a diverse dataset (70M samples, including images, documents, charts, and videos). Ablations showed that using a larger LLM (3B vs 1B), an MLP projector, aligning to an intermediate layer (\PEcore{G} layer 47), and applying regularization (LayerScale, DropPath) yielded the best performance.
- Effects: This alignment successfully lifts the best-performing layer for language tasks to the last layer of the resulting \PElang{} model. Notably, performance on tasks like grounding, even without explicit grounding data in the alignment tuning mix, was significantly improved, demonstrating the alignment of features already present internally in \PEcore{}.
- Results: \PElang{} models, evaluated by connecting them to LLMs like Llama 3.1 8B or QwenLM 2.5 7B and finetuning on MLLM tasks (OCR/Chart/Doc QA, Visual QA, Captioning, Video Understanding, Grounding), significantly outperform other popular vision encoders across all categories and different resolutions (native or tiled). A system-level comparison (PLM-8B using \PElang{G} with Llama 3.1 8B) also demonstrates state-of-the-art performance among open-access MLLMs.
Perception Encoder: Spatial Alignment (\PEspat{}) To expose the strong spatial features within \PEcore{}, a novel spatial alignment method was developed, addressing the dichotomy where features optimal for dense prediction tasks (like detection/depth, layer ~40) differed from those for pure spatial correspondence (like tracking, layer ~30).
- Method: The alignment combines two objectives: (1) retaining the strong semantic information (including global tokens) by self-distilling from \PEcore{G}'s frozen layer 41 features (minimizing cosine similarity). (2) Encouraging locality by distilling spatial correspondence information from Segment Anything Model 2.1 (SAM 2.1) mask logits. Instead of using SAM's internal features (which also have global tokens), the authors leverage the dense $H \times W \times 1024$ mask logit output for a grid of 1024 query points, which inherently captures local spatial coherence without global token artifacts. The student model's pairwise token cosine similarity map ( $HW \times HW$ ) is aligned via MSE loss to that derived from the SAM mask logits. Heavy regularization (MaskFeat, DropPath, LayerScale) is applied to the student during this finetuning atop \PEcore{G}.
- Effects: This dual alignment successfully lifts the best layer performance for spatial tasks (detection, segmentation, depth, tracking) to the last layer of \PEspat{G}. Visualizations confirm that \PEspat{}'s last layer features retain the semantics of \PEcore{} while achieving high-quality spatial coherence, outperforming models aligned only to internal PE features or SAM mask logits individually.
- Results: \PEspat{G} achieves state-of-the-art performance among frozen features on dense prediction tasks (tracking, segmentation, depth). In end-to-end finetuning for object detection and segmentation (Mask R-CNN with ViTDet backbone on COCO and LVIS), \PEspat{G} is state-of-the-art among vision backbones in controlled settings. At a system level (using DETA detector on COCO with Object365 pre-finetuning), \PEspat{G} matches the absolute state-of-the-art while using a simpler decoder architecture, being the first contrastively pretrained model to achieve this without dedicated detection data for alignment.

In conclusion, the Perception Encoder family demonstrates that simple contrastive vision-language pretraining, when scaled and refined with robust techniques and a powerful video data engine, can learn highly general visual embeddings. The key to unlocking the full potential of these embeddings for diverse downstream tasks is the introduction of dedicated language and spatial alignment tuning stages, which effectively surface strong, task-relevant features from intermediate layers to the network's output. The authors release models, code, and the PE Video Dataset to foster further research.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/AlphaRealcat/status/1921193757224923590

https://twitter.com/ducha_aiki/status/1916784089232773225

https://twitter.com/iScienceLuvr/status/1913560267751043391

https://twitter.com/AryanPa66861306/status/1931301041288446233

https://twitter.com/Ellight_/status/1917607371377107254

https://twitter.com/shxf0072/status/1917991154240610483