JAFAR: Jack up Any Feature at Any Resolution (2506.11136v1)

Published 10 Jun 2025 in cs.CV and eess.IV

Abstract: Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream tasks. In this work, we introduce JAFAR, a lightweight and flexible feature upsampler that enhances the spatial resolution of visual features from any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs an attention-based module designed to promote semantic alignment between high-resolution queries, derived from low-level image features, and semantically enriched low-resolution keys, using Spatial Feature Transform (SFT) modulation. Notably, despite the absence of high-resolution supervision, we demonstrate that learning at low upsampling ratios and resolutions generalizes remarkably well to significantly higher output scales. Extensive experiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks. Project page at https://jafar-upsampler.github.io

Summary

The paper introduces a resolution-agnostic upsampler that uses cross-attention to transform low-resolution encoder features into any target resolution, leveraging high-res image guidance.
It employs a self-supervised, task-agnostic training strategy without high-resolution labels, achieving effective generalization from low- to high-resolution inference.
JAFAR outperforms existing upsampling methods in dense prediction tasks like semantic segmentation, depth estimation, and open-vocabulary segmentation, proving its practical impact.

Foundation vision encoders are powerful tools for various vision tasks, but they typically output feature maps at a significantly lower resolution than the input image (e.g., 1/14th or 1/16th of the spatial dimensions). This spatial compression is necessary for handling high-resolution inputs efficiently, especially with attention mechanisms, but it poses a challenge for dense prediction tasks like semantic segmentation and depth estimation that require pixel-level detail. JAFAR (2506.11136) is introduced as a lightweight and flexible feature upsampler designed to address this bottleneck. It can take low-resolution features from any foundation vision encoder and upsample them to an arbitrary target resolution, guided by the original high-resolution input image.

Existing methods to tackle this include:

Training-free interpolation (e.g., bilinear): Computationally cheap but produce blurry outputs as they don't use the high-resolution image content.
Processing high-resolution inputs directly: Feeding a larger image into the encoder increases feature resolution but is computationally expensive due to attention's quadratic complexity and can introduce artifacts.
Learned upsamplers:
- Task-dependent: Trained with task-specific labels (e.g., segmentation masks). Lightweight but lack generalization across tasks (e.g., CARAFE (Wang et al., 2020), DySample (Lopes, 2020), SAPA (Dere, 2022), ReSFU (Rinot et al., 2022)).
- Task-agnostic: Trained independently of downstream tasks. LiFT (Lin et al., 2022) uses a CNN for fixed 2x upsampling. FeatUp (Pathak, 2023) supports higher ratios but its implicit variant is slow (per-image optimization), and the JBU variant can be blurry.

JAFAR's key contributions and practical implications are:

Arbitrary Resolution Upsampling: Unlike fixed-scale methods, JAFAR uses a cross-attention mechanism formulated as a global interpolation, allowing it to upsample to any desired output resolution. This is achieved by designing the architecture and training process to be resolution-agnostic.
Leverages High-Resolution Guidance: It uses the input image as a high-resolution guide to inform the upsampling process, leading to sharper, boundary-aligned features compared to methods relying solely on low-resolution features.
Task-Agnostic Training: Trained without any high-resolution ground truth or task-specific labels, using a simple objective based on multi-resolution views of the same image. This makes it broadly applicable as a drop-in module.
Generalization from Low-Resolution Training: A notable finding is that training JAFAR on low upsampling factors at low resolutions (e.g., 8x8 to 32x32 features) generalizes effectively to significantly higher output scales (e.g., 32x32 to 448x448) during inference, reducing training memory requirements.
Improved Downstream Performance: JAFAR consistently outperforms existing feature upsamplers when used as a drop-in module for various dense prediction tasks, including semantic segmentation, depth estimation, CAM evaluation, and bird's-eye-view segmentation.

Implementation Details & Architecture

JAFAR's architecture takes a high-resolution image $I \in \mathbb{R}^{3 \times H \times W}$ and low-resolution features $F_{lr} \in \mathbb{R}^{C \times h_k \times w_k}$ from a frozen vision encoder $f$ . The core mechanism involves generating high-resolution queries and low-resolution keys from a shared intermediate image representation.

Image Encoding: The input image $I$ is processed by a lightweight encoder $E_\theta$ to produce an intermediate representation $I_E \in \mathbb{R}^{d \times H \times W}$ . RoPE positional embeddings (Su et al., 2021) are added to $I_E$ .
Query Branch: Query features $Q \in \mathbb{R}^{d \times h_q \times w_q}$ are derived from $I_E$ via a query encoder ( $I_Q$ ) and then downsampled using adaptive average pooling to the target output resolution $(h_q \times w_q)$ . This pooling is only applied during training to make queries more semantically aligned with keys. At inference, queries can be generated at the desired high resolution.
Key Branch: Preliminary key features $\tilde{K} \in \mathbb{R}^{d \times h_k \times w_k}$ are also derived from $I_E$ via a key encoder ( $I_K$ ) and downsampled to match the spatial resolution of $F_{lr}$ . These preliminary keys $\tilde{K}$ are then modulated by the low-resolution vision encoder features $F_{lr}$ using Spatial Feature Transform (SFT) (Aitken et al., 2017) to create the final keys $K$ :

$K = \gamma_F \cdot \tilde{K} + \beta_F$

where $\gamma_F, \beta_F \in \mathbb{R}^{d \times h_k \times w_k}$ are learned parameters spatially derived from $F_{lr}$ via linear projections. This modulation injects high-level semantic context into the keys.
Similarity-Based Upsampling: A cross-attention mechanism computes an attention map $A$ between queries $Q$ and keys $K$ :

$A = \text{Softmax}\left(\frac{Q \cdot K^\top}{\sqrt{d}\right)$

Both $Q$ and $K$ are enriched with RoPE. Multiple attention heads are used, and their softmax scores are averaged. The low-resolution features $F_{lr}$ are then interpolated using the attention map $A$ via a matrix product to produce the upsampled features $\hat{F}_{HR}$ :

$\hat{F}_{HR} = A \cdot F_{lr}$

Crucially, there is no learned value projection, which helps preserve the original feature content and enables the resolution-agnostic design.

A simplified architectural flow:

Input: High-res Image I, Low-res Features Flr (from frozen encoder f(I_LR))

1. Image Encoding:
   IE = LightweightEncoder(I) + RoPE

2. Query Branch:
   IQ = QueryEncoder(IE)
   Q = AdaptiveAvgPool(IQ)  # Pool to target spatial resolution (hq, wq)

3. Key Branch:
   IK = KeyEncoder(IE)
   K_tilde = Downsample(IK) # Downsample to Flr spatial resolution (hk, wk)
   gamma_F, beta_F = LinearProjections(Flr)
   K = gamma_F * K_tilde + beta_F # SFT Modulation

4. Upsampling:
   Attention_map = Softmax((Q @ K.T) / sqrt(d))
   F_HR_hat = Attention_map @ Flr

Output: Upsampled High-res Features F_HR_hat

Training Pipeline

JAFAR is trained without high-resolution ground truth. It uses a self-supervised approach based on multi-resolution views. Given a high-resolution image $I_{HR}$ , a downsampled version $I_{LR}$ is created with a random factor $\delta \in [2, 4]$ . Features are extracted from both using the frozen encoder: $F_{hr} = f(I_{HR})$ and $F_{lr} = f(I_{LR})$ . JAFAR takes $I_{HR}$ and $F_{lr}$ as input to predict $\hat{F}_{hr}$ . The training objective is an alignment loss between $\hat{F}_{hr}$ and $F_{hr}$ :

$\mathcal{L}(\hat{F}_{hr}, F_{hr}) = 1 - \text{cos}(\hat{F}_{hr}, F_{hr}) + ||\hat{F}_{hr} - F_{hr}||_2$

This setup allows JAFAR to learn upsampling by predicting features at a higher resolution ( $F_{hr}$ ) from lower-resolution features ( $F_{lr}$ ), guided by the original image ( $I_{HR}$ ), without needing the encoder to process high-resolution images during training target generation.

Experimental Results and Applications

JAFAR was evaluated by pre-training the upsampler and then freezing it and training a lightweight linear probe on the upsampled features for various tasks.

Semantic Segmentation (Table 1): JAFAR significantly outperforms all baselines (training-free, task-dependent, task-agnostic) across COCO-Stuff, Pascal VOC, ADE20K, and Cityscapes datasets, achieving the highest mIoU and accuracy. This demonstrates its ability to recover fine-grained spatial details critical for dense prediction.
Depth Estimation (Table 1): Trained on pseudo-labels from Depth Anything V2 (Milinanni et al., 13 Mar 2024), JAFAR achieves competitive results, ranking second among all methods despite not being specifically optimized for depth estimation.
Class Activation Maps (Table 2): Integrating JAFAR into CAM analysis (Grad-CAM) leads to sharper, more faithful explanations. JAFAR scores highest on the aggregate ADCC metric (Damas et al., 2022), indicating better coherency, lower complexity, and greater confidence preservation.
Zero-Shot Open-Vocabulary Segmentation (Table 3): Using MaskCLIP (Lüddecke et al., 2021) with a CLIP backbone, JAFAR improves performance across VOC, ADE20K, and Cityscapes compared to baselines, showing its upsampled features maintain semantic consistency for zero-shot transfer.
Bird's-Eye View Segmentation (Table 4): When integrated into complex BeV architectures (SimpleBeV (Conti et al., 2023), PointBeV (Lanier et al., 2022), BeVFormer (Li et al., 2022)), JAFAR consistently improves vehicle-IoU, showing its utility even within multi-view pipelines.

Implementation Considerations

Computational Cost: JAFAR is lightweight (0.7M parameters) compared to processing larger images directly or modifying encoder strides (Table 6). Inference time scales with output resolution due to the attention mechanism.
Memory Usage: Memory scales significantly with output resolution during both forward and backward passes (Table 7), especially the backward pass required during training. Training at lower target resolutions mitigates this.
Trade-offs: While faster than methods requiring per-image optimization (like FeatUp Implicit), JAFAR's inference time is higher than simple bilinear interpolation or lightweight CNNs like LiFT, especially at very high resolutions, due to the attention matrix size. However, its superior performance justifies this.
Scalability: The design allows scaling to arbitrary resolutions. The training strategy using low upsampling factors is key for practical training.
Limitations & Future Work: JAFAR currently requires training a separate upsampler for each foundation backbone. Future work includes making it backbone-independent and further reducing feature-level artifacts for even sharper outputs. The ablation studies (Table 5) highlight the importance of SFT modulation for key generation and tuning the number of attention heads.

In summary, JAFAR provides a practical, high-performing, and flexible solution for the critical problem of feature upsampling in modern vision pipelines relying on foundation encoders. Its ability to generalize from low-resolution training to high-resolution inference without task-specific supervision makes it a valuable component for a wide range of dense vision applications.