- The paper introduces a resolution-agnostic upsampler that uses cross-attention to transform low-resolution encoder features into any target resolution, leveraging high-res image guidance.
- It employs a self-supervised, task-agnostic training strategy without high-resolution labels, achieving effective generalization from low- to high-resolution inference.
- JAFAR outperforms existing upsampling methods in dense prediction tasks like semantic segmentation, depth estimation, and open-vocabulary segmentation, proving its practical impact.
Foundation vision encoders are powerful tools for various vision tasks, but they typically output feature maps at a significantly lower resolution than the input image (e.g., 1/14th or 1/16th of the spatial dimensions). This spatial compression is necessary for handling high-resolution inputs efficiently, especially with attention mechanisms, but it poses a challenge for dense prediction tasks like semantic segmentation and depth estimation that require pixel-level detail. JAFAR (2506.11136) is introduced as a lightweight and flexible feature upsampler designed to address this bottleneck. It can take low-resolution features from any foundation vision encoder and upsample them to an arbitrary target resolution, guided by the original high-resolution input image.
Existing methods to tackle this include:
- Training-free interpolation (e.g., bilinear): Computationally cheap but produce blurry outputs as they don't use the high-resolution image content.
- Processing high-resolution inputs directly: Feeding a larger image into the encoder increases feature resolution but is computationally expensive due to attention's quadratic complexity and can introduce artifacts.
- Learned upsamplers:
- Task-dependent: Trained with task-specific labels (e.g., segmentation masks). Lightweight but lack generalization across tasks (e.g., CARAFE (Wang et al., 2020), DySample (Lopes, 2020), SAPA (Dere, 2022), ReSFU (Rinot et al., 2022)).
- Task-agnostic: Trained independently of downstream tasks. LiFT (Lin et al., 2022) uses a CNN for fixed 2x upsampling. FeatUp (Pathak, 2023) supports higher ratios but its implicit variant is slow (per-image optimization), and the JBU variant can be blurry.
JAFAR's key contributions and practical implications are:
- Arbitrary Resolution Upsampling: Unlike fixed-scale methods, JAFAR uses a cross-attention mechanism formulated as a global interpolation, allowing it to upsample to any desired output resolution. This is achieved by designing the architecture and training process to be resolution-agnostic.
- Leverages High-Resolution Guidance: It uses the input image as a high-resolution guide to inform the upsampling process, leading to sharper, boundary-aligned features compared to methods relying solely on low-resolution features.
- Task-Agnostic Training: Trained without any high-resolution ground truth or task-specific labels, using a simple objective based on multi-resolution views of the same image. This makes it broadly applicable as a drop-in module.
- Generalization from Low-Resolution Training: A notable finding is that training JAFAR on low upsampling factors at low resolutions (e.g., 8x8 to 32x32 features) generalizes effectively to significantly higher output scales (e.g., 32x32 to 448x448) during inference, reducing training memory requirements.
- Improved Downstream Performance: JAFAR consistently outperforms existing feature upsamplers when used as a drop-in module for various dense prediction tasks, including semantic segmentation, depth estimation, CAM evaluation, and bird's-eye-view segmentation.
Implementation Details & Architecture
JAFAR's architecture takes a high-resolution image I∈R3×H×W and low-resolution features Flr​∈RC×hk​×wk​ from a frozen vision encoder f. The core mechanism involves generating high-resolution queries and low-resolution keys from a shared intermediate image representation.
- Image Encoding: The input image I is processed by a lightweight encoder Eθ​ to produce an intermediate representation IE​∈Rd×H×W. RoPE positional embeddings (Su et al., 2021) are added to IE​.
- Query Branch: Query features Q∈Rd×hq​×wq​ are derived from IE​ via a query encoder (IQ​) and then downsampled using adaptive average pooling to the target output resolution (hq​×wq​). This pooling is only applied during training to make queries more semantically aligned with keys. At inference, queries can be generated at the desired high resolution.
- Key Branch: Preliminary key features K~∈Rd×hk​×wk​ are also derived from IE​ via a key encoder (IK​) and downsampled to match the spatial resolution of Flr​. These preliminary keys K~ are then modulated by the low-resolution vision encoder features Flr​ using Spatial Feature Transform (SFT) (Aitken et al., 2017) to create the final keys K:
K=γF​⋅K~+βF​
where γF​,βF​∈Rd×hk​×wk​ are learned parameters spatially derived from Flr​ via linear projections. This modulation injects high-level semantic context into the keys.
- Similarity-Based Upsampling: A cross-attention mechanism computes an attention map A between queries Q and keys K:
$A = \text{Softmax}\left(\frac{Q \cdot K^\top}{\sqrt{d}\right)$
Both Q and K are enriched with RoPE. Multiple attention heads are used, and their softmax scores are averaged. The low-resolution features Flr​ are then interpolated using the attention map A via a matrix product to produce the upsampled features F^HR​:
F^HR​=A⋅Flr​
Crucially, there is no learned value projection, which helps preserve the original feature content and enables the resolution-agnostic design.
A simplified architectural flow:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
Input: High-res Image I, Low-res Features Flr (from frozen encoder f(I_LR))
1. Image Encoding:
IE = LightweightEncoder(I) + RoPE
2. Query Branch:
IQ = QueryEncoder(IE)
Q = AdaptiveAvgPool(IQ) # Pool to target spatial resolution (hq, wq)
3. Key Branch:
IK = KeyEncoder(IE)
K_tilde = Downsample(IK) # Downsample to Flr spatial resolution (hk, wk)
gamma_F, beta_F = LinearProjections(Flr)
K = gamma_F * K_tilde + beta_F # SFT Modulation
4. Upsampling:
Attention_map = Softmax((Q @ K.T) / sqrt(d))
F_HR_hat = Attention_map @ Flr
Output: Upsampled High-res Features F_HR_hat |
Training Pipeline
JAFAR is trained without high-resolution ground truth. It uses a self-supervised approach based on multi-resolution views. Given a high-resolution image IHR​, a downsampled version ILR​ is created with a random factor δ∈[2,4]. Features are extracted from both using the frozen encoder: Fhr​=f(IHR​) and Flr​=f(ILR​). JAFAR takes IHR​ and Flr​ as input to predict F^hr​. The training objective is an alignment loss between F^hr​ and Fhr​:
L(F^hr​,Fhr​)=1−cos(F^hr​,Fhr​)+∣∣F^hr​−Fhr​∣∣2​
This setup allows JAFAR to learn upsampling by predicting features at a higher resolution (Fhr​) from lower-resolution features (Flr​), guided by the original image (IHR​), without needing the encoder to process high-resolution images during training target generation.
Experimental Results and Applications
JAFAR was evaluated by pre-training the upsampler and then freezing it and training a lightweight linear probe on the upsampled features for various tasks.
- Semantic Segmentation (Table 1): JAFAR significantly outperforms all baselines (training-free, task-dependent, task-agnostic) across COCO-Stuff, Pascal VOC, ADE20K, and Cityscapes datasets, achieving the highest mIoU and accuracy. This demonstrates its ability to recover fine-grained spatial details critical for dense prediction.
- Depth Estimation (Table 1): Trained on pseudo-labels from Depth Anything V2 (Milinanni et al., 13 Mar 2024), JAFAR achieves competitive results, ranking second among all methods despite not being specifically optimized for depth estimation.
- Class Activation Maps (Table 2): Integrating JAFAR into CAM analysis (Grad-CAM) leads to sharper, more faithful explanations. JAFAR scores highest on the aggregate ADCC metric (Damas et al., 2022), indicating better coherency, lower complexity, and greater confidence preservation.
- Zero-Shot Open-Vocabulary Segmentation (Table 3): Using MaskCLIP (Lüddecke et al., 2021) with a CLIP backbone, JAFAR improves performance across VOC, ADE20K, and Cityscapes compared to baselines, showing its upsampled features maintain semantic consistency for zero-shot transfer.
- Bird's-Eye View Segmentation (Table 4): When integrated into complex BeV architectures (SimpleBeV (Conti et al., 2023), PointBeV (Lanier et al., 2022), BeVFormer (Li et al., 2022)), JAFAR consistently improves vehicle-IoU, showing its utility even within multi-view pipelines.
Implementation Considerations
- Computational Cost: JAFAR is lightweight (0.7M parameters) compared to processing larger images directly or modifying encoder strides (Table 6). Inference time scales with output resolution due to the attention mechanism.
- Memory Usage: Memory scales significantly with output resolution during both forward and backward passes (Table 7), especially the backward pass required during training. Training at lower target resolutions mitigates this.
- Trade-offs: While faster than methods requiring per-image optimization (like FeatUp Implicit), JAFAR's inference time is higher than simple bilinear interpolation or lightweight CNNs like LiFT, especially at very high resolutions, due to the attention matrix size. However, its superior performance justifies this.
- Scalability: The design allows scaling to arbitrary resolutions. The training strategy using low upsampling factors is key for practical training.
- Limitations & Future Work: JAFAR currently requires training a separate upsampler for each foundation backbone. Future work includes making it backbone-independent and further reducing feature-level artifacts for even sharper outputs. The ablation studies (Table 5) highlight the importance of SFT modulation for key generation and tuning the number of attention heads.
In summary, JAFAR provides a practical, high-performing, and flexible solution for the critical problem of feature upsampling in modern vision pipelines relying on foundation encoders. Its ability to generalize from low-resolution training to high-resolution inference without task-specific supervision makes it a valuable component for a wide range of dense vision applications.