NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering (2511.18452v1)

Published 23 Nov 2025 in cs.CV

Abstract: Vision Foundation Models (VFMs) extract spatially downsampled representations, posing challenges for pixel-level tasks. Existing upsampling approaches face a fundamental trade-off: classical filters are fast and broadly applicable but rely on fixed forms, while modern upsamplers achieve superior accuracy through learnable, VFM-specific forms at the cost of retraining for each VFM. We introduce Neighborhood Attention Filtering (NAF), which bridges this gap by learning adaptive spatial-and-content weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE), guided solely by the high-resolution input image. NAF operates zero-shot: it upsamples features from any VFM without retraining, making it the first VFM-agnostic architecture to outperform VFM-specific upsamplers and achieve state-of-the-art performance across multiple downstream tasks. It maintains high efficiency, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS. Beyond feature upsampling, NAF demonstrates strong performance on image restoration, highlighting its versatility. Code and checkpoints are available at https://github.com/valeoai/NAF.

Summary

The paper introduces a VFM-agnostic, zero-shot upsampling method via Neighborhood Attention Filtering that reconstructs high-resolution features using image guidance.
It employs a dual-branch encoder and cross-scale attention to generate adaptive, spatially-varying upsampling kernels, validated by robust Fourier analysis.
Experimental results demonstrate significant improvements in semantic segmentation, depth estimation, and image restoration, with enhanced efficiency and scalability.

NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering

Introduction and Motivation

Vision Foundation Models (VFMs) have become the backbone of many computer vision pipelines, extracting powerful but inherently low-resolution feature maps. This downsampling, a byproduct of the quadratic scaling of transformer-based VFMs with respect to input resolution, poses significant challenges for dense prediction tasks requiring spatial detail, such as semantic segmentation or depth estimation. While traditional upsampling filters (e.g., bilinear or bicubic) are efficient and VFM-agnostic, their fixed spatial kernels lack adaptability, often leading to suboptimal results. Recent learning-based upsamplers provide better accuracy but require retraining for each VFM, limiting generalization and practical deployment.

The paper "NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering" (2511.18452) introduces Neighborhood Attention Filtering (NAF), a VFM-agnostic, learning-based upsampler guided solely by the original high-resolution image. NAF generalizes classical joint filtering through a learnable cross-scale attention mechanism, enabling spatial- and content-aware upsampling applicable to any VFM without retraining, while retaining the interpretability and efficiency of traditional filters.

Methodology

Neighborhood Attention Filtering Overview

NAF performs high-resolution feature reconstruction by aggregating local neighborhoods of low-resolution features, reweighting them with cross-scale, content- and position-aware attention. The essential novelty is that both queries (high-resolution) and keys (low-resolution) for attention are computed entirely from a lightweight encoder applied to the high-resolution input image, discarding any explicit dependence on the VFM features themselves. This design delivers full VFM-agnosticism.

Figure 1: NAF architecture allows to upsample low-resolution VFM features to any resolution, guided solely by the original high-resolution image.

Dual-Branch Guidance Encoder and RoPE Application

NAF's guidance encoder employs two branches: a pixel-wise (1×1 convolutions) branch to retain local detail and a contextual (3×3 convolutions) branch to capture neighborhood information. The outputs are concatenated, and 2D Rotary Positional Embeddings (RoPE) are applied for spatial awareness. After RoPE encoding, the high-res features serve as queries; the low-res keys are generated by average pooling the RoPE-encoded features to the low-resolution grid.

Figure 2: Details of the dual-branch image encoder. NAF encoder considers both a pixel-wise branch and a local-contextual branch.

Cross-Scale Neighborhood Attention

Attention weights are computed for a compact local neighborhood, minimizing memory and computation, and achieving significant savings over full global attention. Critically, NAF’s architecture establishes a strict analogy to classical joint filtering, such as the Joint Bilateral Filter, where its spatial and content kernels are both learned and derived from image guidance only, resulting in spatially-varying, data-adaptive upsampling kernels.

Mathematical Analysis

The authors provide a Fourier analysis of the attention formulation, demonstrating that NAF effectively predicts the spectral coefficients of an adaptive upsampling kernel, and reconstructs the interpolator via an inverse discrete Fourier transform (IDFT). This construction enables NAF to synthesize spatially-varying, content-adaptive filters in a principled, interpretable fashion.

Figure 3: Illustration of the mean and channel-specific cosine and sine of $\Delta \Phi_c$ ; the cosine exhibits a decreasing pattern with distance from the center.

Figure 4: Illustration of radial wavelets induced by NAF for a $9\times9$ neighborhood, showing cosines for multiple wavelengths $\lambda$ .

Experimental Evaluation

Linear Probing on Semantic Segmentation and Depth Estimation

NAF consistently outperforms baseline and state-of-the-art VFM-specific (e.g., JAFAR) and VFM-agnostic (e.g., AnyUp) upsamplers across a range of backbone VFMs (DINOv2, DINOv3, RADIOv2.5, Franca) and scales (including large 7B-parameter models). NAF achieves the best or second-best results in mean intersection over union (mIoU) for segmentation and $\delta_1$ for depth estimation, with average gains of +5.58 mIoU and +3.16 $\delta_1$ over nearest-neighbor baselines across all models and datasets. Its applicability extends to large VFMs (e.g., DINOv3-7B), where previous learning-based upsamplers fail due to memory constraints.

Generalization Across Datasets and Model Sizes

NAF demonstrates robust generalization on semantic segmentation benchmarks (COCO, Pascal VOC, ADE20K, Cityscapes) and across VFM sizes, outperforming even strong conventional upsamplers (e.g., JAFAR, FeatUp, AnyUp) and classical upsamplers (bilinear, bicubic, joint bilateral). Its feature maps display improved smoothness and boundary delineation.

Figure 5: PCA plots of different upsamplers for random images, showing that NAF produces sharp and well-structured upsampled features.

Figure 6: Segmentation predictions using different upsamplers, with NAF providing superior structure preservation on Cityscapes.

Figure 7: Depth estimation using different upsamplers on NYUv2; NAF yields smoother predictions and better boundary localization.

Impact on Downstream Tasks

NAF’s performance improvements translate to downstream open-vocabulary segmentation and video label propagation, consistently outperforming other upsamplers as a drop-in replacement without any retraining. It yields +1.04 mIoU over bilinear in open-vocabulary segmentation and +3.37 mIoU in video segmentation, indicating enhanced generalization and temporal consistency.

Ablation Studies

Systematic ablations validate the dual-branch encoder, average pooling for key computation, and the use of RoPE for positional encoding. Larger encoders (more channels, more layers) bring diminishing but consistent benefits. Alternative designs (e.g., applying convolutions after pooling, removing RoPE) demonstrably degrade performance.

Image Restoration Extension

NAF's architecture is not restricted to upsampling. When applied to image denoising (both Gaussian and channel-wise salt & pepper corruptions), it matches or exceeds the performance of convolutional denoisers despite using an order of magnitude fewer parameters than transformer-based image restoration models (e.g., Restormer). Notably, it generalizes to unseen noise regimes due to its locality and adaptivity.

Figure 8: Image restoration using NAF; even on severe channel-wise salt and pepper noise, NAF restores images with strong fidelity.

Computation and Scalability

Compared to other attention-based upsamplers, NAF achieves substantial reductions in GFLOPs and peak memory use, owing to its strict locality and VFM-agnostic design. It is unique in handling arbitrarily large upsampling ratios (up to 72×), producing high-resolution ( $2048\times2048$ ) feature maps at up to 18 FPS, all on commodity GPUs.

Analysis and Theoretical Implications

The Fourier perspective clarifies why NAF's adaptive, image-guided design balances the interpretability and efficiency of classical joint filters with the flexibility and accuracy of learned upsamplers. By discarding explicit dependence on VFM feature distributions during guidance calculation, NAF divorces the upsampling process from any particular VFM architecture or semantics, a property with significant implications for the future of modular, plug-and-play dense prediction pipelines.

Attention Map Visualization

Figure 9: Attention maps between queries and their local patch neighborhoods; NAF learns to focus across semantic boundaries and complex structures, mimicking the effect of classical filtering in homogeneous regions.

Limitations and Future Directions

While NAF introduces a theoretically principled and practically effective approach, kernel size remains fixed, whereas dynamic, learned neighborhood definitions (akin to deformable convolutions/attention) may further improve both performance and efficiency. Additionally, optimal selection or adaptation to the characteristics of VFMs used during training requires further exploration. The versatility of NAF’s architecture suggests potential for broader applications, including medical imaging and complex restoration tasks.

Conclusion

NAF advances the state of the art in VFM-agnostic feature upsampling, offering a unique combination of modularity, computational efficiency, and performance through its cross-scale, image-guided neighborhood attention and Fourier-inspired kernel synthesis. Its demonstrated utility across semantic segmentation, depth prediction, and image restoration, including on very large-scale foundation models, signals a robust path for future research in resolution-adaptive, general-purpose vision pipelines.

PDF Markdown

Whiteboard

Generate a whiteboard explanation of this paper.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper introduces a new method called NAF (Neighborhood Attention Filtering) that makes blurry, low-detail “feature maps” from big vision models sharp and detailed again. It does this without needing to retrain for each model, and it works fast—even on very large models. The goal is to help tasks that need pixel-level precision, like segmentation (labeling each pixel) or estimating depth (how far things are), by improving the quality of the features those models produce.

Key Objectives and Research Questions

The authors wanted to answer simple but important questions:

Can we upsample (enlarge and sharpen) the features from any vision model without retraining for each one?
Can we do this both accurately and efficiently?
Can a single, general method beat specialized methods that are tailor-made for particular models?
Will this approach also help other tasks, like cleaning up noisy images?

Methods and Approach

To understand the approach, here are the core ideas in everyday language.

What are “features” and why upsample them?

Vision Foundation Models (VFMs) are powerful models that analyze images and produce “feature maps,” which are like compressed versions of the image that capture important information (edges, textures, objects).
These feature maps are smaller than the original image because that makes them faster to compute. But small feature maps make it harder to do precise pixel-level tasks.
Upsampling is the process of making those feature maps larger and more detailed, so we can make accurate pixel-level predictions.

What is NAF?

NAF is a smart upsampler that:

Looks at the original high-resolution image to guide how it should enlarge the feature map.
Focuses its attention on small neighborhoods (nearby pixels) across different scales to find the most relevant information.
Uses a trick called Rotary Position Embeddings (RoPE) to understand where pixels are relative to each other, so it doesn’t get confused when matching areas at different resolutions.
Is “zero-shot” and “VFM-agnostic,” meaning once trained, it can be applied to features from any vision model without retraining.

Think of NAF like a super-smart “magnifying glass” that compares what the original image looks like in different nearby regions and decides how to mix and enhance the features to match the actual details.

How does the attention work?

The method uses “queries” and “keys” (common terms in attention-based models) derived only from the original high-res image.
For each high-res pixel, NAF compares it to nearby low-res areas and assigns weights (how much each neighbor should contribute).
It then blends feature values based on those weights, producing a detailed high-res feature map.

A helpful analogy

Imagine trying to redraw a blurry picture. You look at the original photo (the guidance image) to understand where edges and textures are. Then you carefully fill in details in a smaller, blurry version by borrowing clues from the sharp photo, especially from nearby regions that look similar. That’s what NAF does—guided by the real image, it smartly refines and enlarges the model’s features.

Training in simple terms

During training, the model sees pairs of high-res and artificially downsampled (low-res) images.
It learns to upsample the low-res features back to the high-res features using the original image as a guide.
The only loss is “make the upsampled features look like the real high-res features”—no extra bells and whistles.

Why is NAF efficient?

It only attends to local neighborhoods, not the entire image, making it much faster and lighter on memory.
It uses a simple dual-branch encoder that captures both per-pixel detail and local context.
It doesn’t depend on the feature format of any specific VFM, so there’s no retraining or heavy customization.

Main Findings and Results

The paper reports strong results across different tasks and models. In short:

NAF beats both classic upsampling (like bilinear or bicubic) and modern, model-specific upsamplers on key tasks.
It works across many VFMs (like DINOv2, RADIOv2.5, Franca, DINOv3)—even huge 7-billion-parameter models—without retraining.
It improves:
- Semantic segmentation: higher accuracy in labeling each pixel.
- Depth estimation: better precision in predicting how far objects are.
- Open-vocabulary segmentation and video segmentation: better transfer and consistency across frames.
It’s fast and scales well: up to 2K resolutions with reconstruction at about 18 frames per second, and supports big upscaling ratios.
Beyond upsampling, it also works impressively on image denoising (cleaning noisy images), despite having far fewer parameters than some specialized restoration models.

These results matter because they show you can have one simple, general tool that works better than specialized ones in many situations.

Implications and Potential Impact

For developers and researchers: You can plug NAF into different vision systems without retraining for each model, saving time and compute.
For real-world applications: Better pixel-level performance in areas like medical imaging, autonomous driving, mapping, and video analysis—especially when handling large models or high resolutions.
For the field: NAF bridges a gap between old-school filters (fast but rigid) and modern learned methods (accurate but model-specific), offering a fast, flexible, and powerful alternative.
Future directions: Its success in denoising hints that the same neighborhood-attention idea could help in other image restoration tasks like super-resolution, deblurring, or enhancing low-light images.

In short, NAF is a practical, general-purpose way to make visual features precise and useful for detailed tasks—quickly and across many different models.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

The paper leaves several aspects unresolved; below is a list of concrete gaps that future work could address.

Training regimen transparency and VFM dependence: It is unclear on which VFM(s), datasets, and scales NAF is trained “once,” and whether the reported zero-shot generalization stems from training on one VFM, multiple VFMs, or a VFM-agnostic objective. Provide a clear description and test generalization to truly unseen VFMs (different architectures, patch sizes, tokenization, and normalization) not used during training.
Scale generalization mismatch: NAF is trained with downsampling factor 2 but evaluated at upsampling ratios as high as ×16–×72. Quantify performance as a function of scale, evaluate multi-scale training, and analyze failure modes at extreme ratios (e.g., aliasing, ringing, boundary artifacts).
Sensitivity to guidance quality: The upsampler relies solely on the input image for guidance. Assess robustness to real-world degradations (noise, blur, compression, lighting changes, occlusions) and domain shifts (e.g., medical, aerial, thermal, multispectral), including cases where guidance is partially mismatched to the LR features (e.g., rolling shutter, motion).
Alignment assumptions between image and VFM features: The key construction via average pooling assumes perfect geometric alignment between the HR image pixels and the LR feature grid. Investigate misalignment due to VFM preprocessing (resizing/cropping), patch boundary biases, non-integer strides, or non-grid tokens, and develop alignment correction (offset estimation, anti-aliasing/pre-filtering).
Neighborhood size and topology: The locality window is fixed (e.g., 9 for upsampling, 15 for restoration). Systematically paper the trade-offs of neighborhood radius, shape (square vs. circular), dilation, and adaptive windows per pixel, including learned or content-adaptive neighborhood selection.
Need for long-range context: Local attention may be insufficient for repetitive textures, large structures, or remote dependencies. Evaluate hierarchical/multi-scale neighborhoods or sparse global connections and quantify gains vs. compute overhead.
Positional encoding choice and stability: RoPE is favored in ablation, but its behavior under very large relative offsets (high upsampling ratios) and different image sizes/aspect ratios is not analyzed. Compare RoPE to alternative 2D relative encodings (ALiBi, learned relative bias) for stability, accuracy, and numerical conditioning.
Frequency-domain claim (IDFT interpretation): The paper asserts NAF implicitly predicts spectral coefficients to reconstruct spatial kernels but provides no main-text derivation or empirical verification. Rigorously prove the IDFT formulation, visualize learned kernels, measure frequency responses, and test whether spectral regularization improves performance.
Channel-wise and multi-head design: Current aggregation applies scalar attention weights to all feature channels. Explore per-channel or per-head weighting, cross-channel mixing strategies, and multi-head neighborhood attention to handle heterogeneous feature dimensions and semantics.
Loss design beyond L2: Training uses only an L2 reconstruction loss. Evaluate edge-aware, spectral, perceptual, or task-driven losses (e.g., boundary-preserving terms) and multi-task supervision to reduce oversmoothing and enhance fine structures.
Efficiency on diverse hardware: Claims report GFLOPs/FPS on A100 GPUs. Profile memory bandwidth, latency, and energy on varied hardware (consumer GPUs, CPU, mobile/edge), assess mixed precision/quantization, and provide deployment guidelines.
Quality vs. maximum upsampling ratio: Table reports maximum ratios that fit in GPU memory but not accuracy at those ratios. Provide accuracy–ratio curves, artifact analysis, and guidelines for safe operating ranges per VFM and task.
End-to-end training integration: NAF is evaluated mostly as a frozen module with linear probes or downstream pipelines that replace bilinear upsampling. Study end-to-end training (e.g., segmentation networks) with NAF in the loop to assess gradient stability, convergence, and task-specific gains.
Baseline coverage completeness: Task-specific upsamplers (CARAFE, DySample, SAPA) are discussed but not compared in the VFM-agnostic setting. Include these baselines—adapted appropriately—to validate NAF’s advantages beyond classical filters and VFM-specific methods.
Dataset and task breadth: Evaluations focus on segmentation, depth, open-vocabulary segmentation, and video propagation. Extend to instance segmentation, optical flow, stereo, surface normals, boundary detection, tracking, and retrieval to test generality of spatial detail recovery.
Guidance modalities and fusion: NAF uses only RGB guidance. Investigate fusion with auxiliary guidance (edges, gradients, saliency, depth priors), multi-modal inputs (IR, LiDAR), or self-supervised guidance to improve edge preservation and robustness.
Boundary handling and padding: The paper does not detail treatment near image borders. Analyze border effects, padding strategies, and their impact on edge fidelity and artifacts.
Scaling laws and hyperparameter selection: Ablations cover C and L but do not provide scaling laws or principled selection strategies. Offer performance–compute trade-off curves and automatic hyperparameter tuning for diverse VFMs and resolutions.
Pooling strategy nuances: Although average pooling outperforms max and bilinear for keys, the channel alignment rationale is hypothesized rather than tested. Investigate keyed channel alignment, learned pooling, and content-aware aggregation for keys.
Restoration extension breadth: Image restoration experiments target denoising only. Evaluate deblurring, super-resolution, deraining, dehazing, compression artifact removal, and mixed degradations to validate generality of neighborhood attention filtering.
Large-model evaluations beyond linear probes: For 7B VFMs, experiments show gains in linear probing and two downstream tasks. Assess broader tasks (e.g., detection, panoptic segmentation, retrieval, in-context segmentation) and measure scalability limits in memory/time for end-to-end pipelines.
Adversarial sensitivity: Guidance-driven weights may be vulnerable to adversarial perturbations of the input image. Quantify adversarial robustness and explore defenses (e.g., robust encoders, certified bounds on attention weights).
VFM preprocessing interaction: Different VFMs use different normalization/resizing pipelines, yet NAF’s guidance encoder operates on the raw image. Study whether matching the VFM’s preprocessing in NAF improves alignment, and provide a principled recipe for cross-VFM compatibility.
Multi-stage feature integration: NAF upscales final-stage features; effects on intermediate layers or multi-scale fusion are not explored. Evaluate applying NAF to earlier stages or aggregating multiple VFM scales to improve fine-grained detail.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below is a concise set of concrete, deployable use cases that leverage NAF’s zero-shot, VFM-agnostic feature upsampling and image-restoration capabilities.

Vision pipelines upgrade without retraining
- Sectors: software, robotics, automotive, healthcare, geospatial, retail
- Use case: Replace bilinear/nearest upsampling with NAF in existing segmentation and depth-estimation pipelines (e.g., DINOv2/DINOv3/RADIO/Franca backbones) to immediately improve pixel-level accuracy in semantic segmentation, depth, open-vocabulary segmentation, and video object segmentation.
- Tools/workflows: “NAF UpSampler” as a drop‑in PyTorch/TensorRT module in inference graphs; swap-in for ProxyCLIP or video mask propagation tools.
- Assumptions/dependencies: Access to HR input images and LR VFM features; GPU resources for 18 FPS at target resolutions; compatibility with current preprocessing of chosen VFMs.
Autonomous driving and robotics perception refinement
- Sectors: robotics, automotive
- Use case: Improve road-scene segmentation and depth estimation using NAF, enhancing drivable-area delineation, lane boundaries, pedestrians/vehicles, and short-range geometry for planning.
- Tools/workflows: ROS/Autoware module that upsamples backbone features before heads; tested on Cityscapes-style scenes.
- Assumptions/dependencies: Real-time constraints (hardware budget, memory); validation under domain shifts (night, weather); integration with current perception stacks.
Medical image analysis with VFM-agnostic upsampling
- Sectors: healthcare
- Use case: Apply NAF with RADIOv2.5 or similar VFMs to sharpen organ/tumor boundaries for segmentation in radiology or endoscopy without retraining the upsampler per backbone.
- Tools/workflows: PACS-integrated inference where NAF precedes segmentation heads; research-grade pipelines for linear probing and downstream heads.
- Assumptions/dependencies: Clinical validation and regulatory review; image modality differences (grayscale, multi-channel); GPU availability.
Remote sensing and mapping
- Sectors: geospatial, energy, public works
- Use case: Use NAF to improve feature fidelity for land-cover segmentation and depth/height proxies from aerial/satellite imagery, aiding urban planning and infrastructure monitoring.
- Tools/workflows: Geospatial tiling pipelines (e.g., Raster Vision, TorchGeo) with NAF as the upsampling stage; post-processing for vectorization and GIS ingestion.
- Assumptions/dependencies: Handling very large images via tiling; alignment with georeferencing workflows; compute/memory sizing.
Industrial inspection and manufacturing QA
- Sectors: manufacturing
- Use case: Enhance fine defect detection (scratches, micro-cracks) by upsampling VFM features for pixel-precise segmentation/classification on production lines.
- Tools/workflows: NAF embedded in vision nodes on edge PCs; integration with MES for defect logging.
- Assumptions/dependencies: Throughput requirements; robustness to lighting/reflectivity variations; camera calibration.
Video editing and surveillance mask propagation
- Sectors: media, public safety
- Use case: Improve temporal consistency of segmentation masks in video object tracking/propagation tasks.
- Tools/workflows: Plugins for After Effects/Nuke or VOS toolchains where NAF replaces bilinear upsampling; batch processing pipelines for surveillance footage.
- Assumptions/dependencies: Frame rate and latency constraints; dataset-specific tuning of neighborhood size.
Annotation tooling for dataset creation
- Sectors: software, education, research
- Use case: Sharpen boundaries in interactive annotation tools (e.g., Label Studio) by upsampling features before mask proposals, reducing manual corrections.
- Tools/workflows: “NAF-assisted pre-annotations” in labeling UIs; export masks with better edges.
- Assumptions/dependencies: Integration with tool APIs; human-in-the-loop verification.
Lightweight image denoising in consumer and enterprise apps
- Sectors: mobile, photography, healthcare
- Use case: Use NAF’s restoration mode for Gaussian and salt‑and‑pepper denoising in camera pipelines, archival image cleanup, or medical image preprocessing.
- Tools/workflows: On-device denoising library with neighborhood attention; server-side cleanup in DAM/CMS systems.
- Assumptions/dependencies: Trained restoration weights for target noise regimes; runtime on mobile NPUs/GPUs; user privacy and image retention policies.
MLOps resilience to backbone changes
- Sectors: software, finance, e-commerce
- Use case: Standardize an upsampling layer that remains VFM-agnostic, enabling rapid backbone upgrades (e.g., swapping DINOv2→DINOv3) without retraining the upsampler.
- Tools/workflows: CI/CD with NAF in model graphs; A/B testing of backbones with consistent upsampling.
- Assumptions/dependencies: Versioning and compatibility with framework runtimes; reproducible preprocessing across VFMs.
Academic probing and interpretability
- Sectors: academia
- Use case: Use NAF for feature probing (segmentation/depth linear heads) and for teaching the bridge between classical filtering (JBU, bilateral) and attention via RoPE/IDFT interpretation.
- Tools/workflows: Course labs and benchmarks; ablation suites leveraging NAF’s minimal training setup.
- Assumptions/dependencies: Access to open-source code/checkpoints; datasets and compute.

Long-Term Applications

The following applications are plausible but require further research, scaling, engineering, or validation.

Real-time HD/4K perception with large VFMs
- Sectors: automotive, robotics, AR/VR
- Use case: End-to-end 4K semantic/depth inference with 7B VFMs and NAF at high FPS for autonomy and AR overlays.
- Tools/workflows: Hardware-accelerated neighborhood attention kernels; mixed-precision inference; model sharding.
- Assumptions/dependencies: Specialized kernels, memory optimization, thermal/power constraints, extensive field testing.
General image restoration suite (deblur, dehaze, derain, super-resolution)
- Sectors: mobile, media, healthcare
- Use case: Extend NAF’s restoration mode beyond denoising to a unified, interpretable filter-attention framework for multiple degradations.
- Tools/workflows: Modular “NAF-Restore” library with task-specific training heads; camera ISP integration.
- Assumptions/dependencies: Training on diverse degradation distributions; benchmarking vs. SOTA restorers; handling domain shifts.
Multi-modal sensor fusion via neighborhood attention
- Sectors: robotics, defense, smart cities
- Use case: Use image-guided filtering to fuse LiDAR/RGB, thermal/RGB, radar/RGB features for robust perception under adverse conditions.
- Tools/workflows: Multi-sensor encoders feeding NAF; cross-scale alignment layers; calibration-aware attention.
- Assumptions/dependencies: Accurate sensor calibration; synchronization; robustness to missing modalities.
Geospatial “NAF-Tiler” for continent-scale inference
- Sectors: geospatial, climate, agriculture
- Use case: Distributed pipelines that run NAF across massive tiled imagery for land-use mapping, crop monitoring, and disaster assessment.
- Tools/workflows: Cloud-native microservices; streaming inference; Dask/Ray orchestration; vector export to GIS.
- Assumptions/dependencies: Efficient tile stitching and halo handling; cost controls; data governance.
Medical-grade, interpretable upsampling modules
- Sectors: healthcare, policy
- Use case: Regulatory-compliant upsamplers with filter-based interpretability (via IDFT analogy) embedded in clinical decision systems.
- Tools/workflows: Model cards with interpretability reports; audit logs for spatial kernels; clinical trials.
- Assumptions/dependencies: Regulatory pathways (FDA/CE); bias and safety assessments; secure deployment.
Graphics and game engines: content-aware upsampling
- Sectors: graphics, gaming
- Use case: Integrate NAF to improve texture/detail upsampling and occlusion-aware AR compositing without heavy re-training for different vision backbones.
- Tools/workflows: Engine plugins (Unreal/Unity) with runtime NAF passes; shader-language implementations.
- Assumptions/dependencies: Real-time budgets; GPU-friendly kernels; visual artifact testing.
Edge AI acceleration and silicon support
- Sectors: semiconductors, mobile, IoT
- Use case: Design NAF-friendly hardware blocks (local attention + RoPE) for efficient feature upsampling/restoration on NPUs.
- Tools/workflows: Compiler support (TVM, XLA), kernel fusion, quantization-aware training.
- Assumptions/dependencies: Vendor adoption; performance/area trade-offs; toolchain maturity.
Standardization in CV frameworks
- Sectors: software ecosystem
- Use case: Inclusion of NAF as a standard VFM-agnostic upsampler in OpenMMLab/Detectron2/TorchVision, enabling consistent benchmarking and deployment.
- Tools/workflows: API design, ops registration, tutorials, model zoo integration.
- Assumptions/dependencies: Community governance; maintenance and backward compatibility.
Document understanding and OCR layout refinement
- Sectors: finance, legal, public sector
- Use case: Use NAF to improve fine-grained segmentation of page elements (tables, stamps, signatures) for compliance and automation workflows.
- Tools/workflows: NAF-enhanced vision backbones in OCR stacks (LayoutLMv3 variants + CV VFMs).
- Assumptions/dependencies: Domain adaptation to scans/photos; throughput on enterprise servers.
Policy and procurement guidance for interoperable AI modules
- Sectors: policy, public procurement
- Use case: Recommend VFM-agnostic, interoperable components (like NAF) to reduce lock-in and compute costs when upgrading models in public-sector AI systems.
- Tools/workflows: Interoperability checklists; cost-efficiency benchmarks; best-practice guides.
- Assumptions/dependencies: Stakeholder coordination; measurable ROI; alignment with privacy and audit requirements.

View Paper Prompt View All Prompts

Glossary

Average pooling: A pooling operation that computes the average value within a region, often used to downscale or align features. "Attention keys $K$ are obtained by average pooling the same features to the low-resolution grid, ensuring geometric alignment with the low-resolution features $\mathbf{F}^\text{LR}$ :"
Bicubic interpolation: A classical image resizing method using cubic polynomials, often used as a strong baseline for upsampling. "surprisingly, recent upsamplers such as JAFAR \citep{jafar}, FeatUp \citep{featup}, and AnyUp \citep{wimmer2025anyup} fail to outperform the bicubic baseline"
Bilateral filter: An edge-preserving filter that weights neighbors by spatial and intensity similarity. "Classic examples include the bilateral filter~\citep{tomasi1998bilateralfilter}"
Bilinear interpolation: A simple image resizing method that uses linear interpolation in two dimensions. "We evaluate NAF against both VFM-specific upsamplers \citep{featup,jafar} and VFM-agnostic approaches such as bilinear interpolation"
CARAFE: A content-aware reassembly module that predicts upsampling kernels conditioned on features. "CARAFE \citep{carafe} and DySample \citep{dysample} predict content-aware kernels or sampling points"
Channel-wise salt-and-pepper noise: A corruption that randomly flips or nulls individual color channels with some probability. "channel-wise salt-and-pepper noise, meaning we randomly activate or desactivate some channels with corruption probability $p$ ."
Cross-Scale Neighborhood Attention: An attention mechanism that restricts interactions to local neighborhoods across different resolutions. "NAF employs a Cross-Scale Neighborhood Attention mechanism where each high-resolution query attends only to a compact neighborhood around its corresponding low-resolution location"
Denoising: The process of recovering a clean image from a noisy observation. "We focus on image denoising, where the goal is to recover a clean RGB image of size $448 \times 448$ from its corrupted version."
Depth estimation: Predicting per-pixel scene depth from visual features or images. "A linear layer is then trained on top of the upsampled features to predict per-pixel semantic labels or depth values, depending on the task."
Delta-1 (δ1): A depth accuracy metric measuring the percentage of predictions within a relative threshold of ground truth. "Depth $\delta_1 (\%)$ (NYUv2)"
Dot-product attention: An attention scoring mechanism using the dot product between queries and keys. "The dot-product attention weights between high-resolution queries and low-resolution keys define an adaptive content kernel"
Encoder–decoder architecture: A neural design with an encoder that compresses inputs and a decoder that reconstructs spatial details. "LiFT uses a CNN-based encoder-decoder architecture."
Feature map: A multi-channel spatial representation produced by a neural network. "a vision foundation model (VFM) produces a low-resolution feature map $\mathbf{F}^{\mathrm{LR}$"
GFLOPs: Giga floating-point operations; a measure of computational cost. "achieving about 40\% fewer GFLOPs compared to JAFAR ~\cite{jafar}"
Guidance image: A high-resolution signal used to steer filtering or upsampling of features. "NAF then upsamples $\mathbf{F}^{\mathrm{LR}$ into $\widehat{\mathbf{F}^{\mathrm{HR} := \operatorname{NAF}(\mathbf{I^\mathrm{HR}, \mathbf{F}^\mathrm{LR})$, using $\mathbf{I}^{\mathrm{HR}$ as the guidance image."
Image restoration: A family of tasks (e.g., denoising) that recover high-quality images from degraded inputs. "Beyond feature upsampling, NAF demonstrates strong performance on image restoration"
Inverse Discrete Fourier Transform (IDFT): A transform that reconstructs a spatial signal from frequency-domain coefficients. "We show that this design implicitly learns an Inverse Discrete Fourier Transform (IDFT) of the aggregation"
Joint Bilateral Filtering: A bilateral filter that uses a separate guidance image to compute weights. "Our architecture parallels classical joint filtering methods, such as Joint Bilateral Filtering and its upsampling variant (JBU)"
Joint Bilateral Upsampling (JBU): An upsampling method that transfers edge-aware structure from a high-resolution guidance image. "The Joint Bilateral Upsampling (JBU) module \citep{featup} is inherently VFM-agnostic"
Linear probing: Evaluating feature quality by training a simple linear layer on top of frozen features. "we conduct linear probing experiments on semantic segmentation and depth estimation."
mIoU (mean Intersection over Union): A common metric for semantic segmentation accuracy. "Semantic segmentation (mIoU $\uparrow$ )"
Neighborhood Attention Filtering (NAF): The proposed method that performs zero-shot, VFM-agnostic feature upsampling via local attention guided by the image. "Neighborhood Attention Filtering (NAF) as a Zero-Shot Feature Upsampler: train once, apply efficiently to any Vision Foundation Model"
Open-vocabulary segmentation: Segmenting categories specified by text prompts rather than a fixed label set. "the transferability of upsampled features to open-vocabulary segmentation"
Pixel unshuffling: A re-arrangement operation used in up/downsampling pipelines to change spatial resolution. "transposed convolutions or pixel unshuffling \citep{li2018pyramid,ronneberger2015unet,shi2016pixelunshuffling,zhao2017pspnet,long2015fcn}"
Positional embeddings: Representations added to encode location information for attention mechanisms. "To enable spatial relationship reasoning in the attention formulation, $\text{NAF}$ uses positional embeddings on its guidance features."
ProxyCLIP: A model that evaluates open-vocabulary segmentation using CLIP-like proxies. "We use ProxyCLIP \citep{lan2024proxyclip} to evaluate upsampled representations"
PSNR (Peak Signal-to-Noise Ratio): A fidelity metric measuring reconstruction quality relative to noise. "Gaussian Denoising ({PSNR} / {SSIM} (\%))"
Queries and keys: Components of attention; queries retrieve relevant keys to compute weighted combinations. "where $Q$ , $K$ denote queries and keys"
Receptive fields: The spatial extent of input that influences a feature's response. "aligning receptive fields across resolutions while keeping attention localized and efficient."
Restormer: A transformer-based image restoration model used as a strong baseline. "a state-of-the-art transformer-based model, Restormer~\citep{Zamir2021Restormer}"
Rotary Position Embeddings (RoPE): A positional encoding method that rotates features to encode relative positions. "To encode relative positional information, we apply 2D Rotary Positional Embeddings (RoPE)"
Scale invariance: The property of producing consistent outputs across different input scales. "only few VFMs exhibit scale invariance and handle non-standard resolutions effectively"
Semantic segmentation: Assigning a semantic label to each pixel in an image. "we conduct linear probing experiments on semantic segmentation and depth estimation."
Spectral coefficients: Frequency-domain parameters used to reconstruct spatial filters via inverse transforms. "NAF predicts spectral coefficients that reconstruct an adaptive, data-dependent upsampling filter."
SSIM (Structural Similarity Index): A perceptual metric for assessing image similarity and quality. "Gaussian Denoising ({PSNR} / {SSIM} (\%))"
Transposed convolutions: Learnable upsampling layers that invert the spatial effect of convolutions. "Early techniques rely on standard methods like transposed convolutions or pixel unshuffling"
Upsampling factor: The scale by which a feature map’s spatial resolution is increased. "for an upsampling factor $s$ ."
VFM-agnostic: Independent of the specific Vision Foundation Model used to produce features. "We introduce Neighborhood Attention Filtering (NAF), a VFM-agnostic upsampling module"
Vision Foundation Models (VFMs): Large pre-trained vision backbones used across diverse tasks. "Vision Foundation Models (VFMs) extract spatially downsampled representations"
Video object segmentation: Tracking and segmenting objects across frames in a video. "Video Object Segmentation"
Zero-shot: Performing a task without task-specific retraining or supervision on the target model/data. "NAF operates zero-shot: it upsamples features from any VFM without retraining"

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (6)

Collections

GitHub

GitHub - valeoai/NAF: Official Implementation of NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering (3 stars)

NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering (2511.18452v1)

Sponsor

Summary

NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering

Introduction and Motivation

Methodology

Neighborhood Attention Filtering Overview

Dual-Branch Guidance Encoder and RoPE Application

Cross-Scale Neighborhood Attention

Mathematical Analysis

Experimental Evaluation

Linear Probing on Semantic Segmentation and Depth Estimation

Generalization Across Datasets and Model Sizes

Impact on Downstream Tasks

Ablation Studies

Image Restoration Extension

Computation and Scalability

Analysis and Theoretical Implications

Attention Map Visualization

Limitations and Future Directions

Conclusion

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Objectives and Research Questions

Methods and Approach

What are “features” and why upsample them?

What is NAF?

How does the attention work?

A helpful analogy

Training in simple terms

Why is NAF efficient?

Main Findings and Results

Implications and Potential Impact

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (6)

Collections

GitHub

Tweets