FilterViT: Efficient Vision Transformer
- FilterViT is a Vision Transformer variant that incorporates a Filter Block for dynamic token screening, significantly reducing computational cost.
- It uses a lightweight per-pixel saliency scoring mechanism and top-K selection to transform dense attention into a sparse, efficient process.
- Empirical evaluations show that FilterViT achieves high accuracy with fewer parameters and built-in saliency maps for improved interpretability.
FilterViT is a Vision Transformer (ViT) variant designed to address the computational bottleneck of standard attention in high-resolution feature maps. It selectively filters the most salient image positions for attention through a lightweight scoring mechanism, resulting in a marked reduction in the number of tokens processed at each attention block. This sparse attention formulation maintains high accuracy, yields interpretable saliency masks, and significantly improves parameter and computational efficiency compared to fully-dense ViT models and lightweight convolutional baselines (Sun, 2024).
1. Architectural Foundations
FilterViT is constructed by interposing a “Filter Block” immediately preceding its attention (QKV) stages. The typical ViT attends over all tokens in a dense matrix, where is the number of spatial positions in a high-resolution feature map. This results in a quadratic computational and memory burden. FilterViT addresses this by evaluating the importance of each pixel or token, sorting these, and retaining only the top- for attention, where and may vary by layer.
The pipeline can be abstracted as:
- Input image → CNN feature extractor → Filter Block (scoring and selection) → Transformer attention on top- tokens → Sparse outputs merged into the full map → Downstream heads.
Compared to complete token filtering strategies that precede the encoder entirely (Naruko et al., 2 Jun 2025) and to convolutional filter hybrids (Shi et al., 2024), FilterViT integrates the filtering operation tightly with QKV attention—enabling dynamic relevance estimation and attentional interpretability within the transformer backbone.
2. The Filter Block: Scoring and Top-K Selection
Let denote a feature map (omitting batch size for clarity), where is the channel dimension. The filter mechanism operates as follows:
- Per-pixel saliency scoring: Each token is assigned a scalar score via a lightweight parametric function:
where denotes the sigmoid activation and . Each reflects the predicted saliency of pixel .
- Sorting and selecting: The scores are sorted, and the indices of the top- entries (layer-dependent) are retained. Let be a predefined retention ratio:
where is the spatial size at layer . Typically, decreases in deeper layers as higher-level semantics become more sparse.
- Feature masking and gathering: A mask is constructed such that if is in the top-, else $0$. The remaining features are gathered and augmented with their respective positional embeddings.
3. Computational Complexity and Attention Modification
Attention is then computed exclusively over the filtered sequence . Standard QKV projections are performed:
with , .
Sparse self-attention is then given by
The computational order per layer shifts from for full attention to , realizing a substantial reduction—especially at higher resolutions. The total attention cost across layers is , as opposed to in dense ViT.
4. Integration within Vision Transformer Pipelines
FilterViT is commonly instantiated in a MobileViT-style architecture, which interleaves convolutional feature processing and transformer encoding.
A canonical dataflow is:
- Extract feature map .
- Compute saliency .
- For each input, select top- indices.
- Gather selected tokens (add positional embeddings).
- Process via transformer encoder.
- Scatter transformed tokens back to their original positions in the feature map.
- Continue downstream.
This process preserves the spatial topology of features for downstream dense prediction or classification heads, and enables token selection to adapt dynamically per sample.
5. Interpretability Attributes via Saliency Masks
The saliency map or its binarized top- mask can be upsampled to the input resolution and visualized as a heatmap. Early-layer masks typically highlight fundamental image structures (edges, simple shapes), mid-layer masks concentrate on principal objects or regions of interest, and deeper layers emphasize context or scene background.
The TopK mask serves as a direct visual explanation of model focus, obviating the need for post-hoc interpretability methods such as Grad-CAM or attention rollout. These built-in masks have been shown qualitatively to coincide with interpretable and semantically meaningful regions, enhancing transparency of the inference process (Sun, 2024).
6. Empirical Evaluation and Comparative Performance
FilterViT achieves state-of-the-art efficiency on stringent benchmarks. On subsets of ImageNet-1K (img-100), FilterViT attains 86.1% top-1 accuracy with only 1.89M parameters. It outperforms MobileNetV2 (83.3% at 2.35M) and MobileViT-S (85.1% at 5.00M) in accuracy/parameter tradeoff. Hardware throughput measurements are highly competitive, achieving 410 FPS (CUDA) and 5.6 FPS (CPU).
An ablation variant, DropoutViT, randomly selects pixels per step (omitting scoring). This leads to slightly faster initial convergence but lower final accuracy, demonstrating that learned masking is essential for optimal performance.
| Model | #Params (M) | FPS (CPU) | FPS (CUDA) | Top-1 Acc |
|---|---|---|---|---|
| MobileNetV2 | 2.35 | 11.29 | 848 | 0.833 |
| MobileViT-S | 5.00 | 6.14 | 485 | 0.851 |
| Tiny-ViT | 10.59 | 6.19 | 524 | 0.876 |
| FilterViT | 1.89 | 5.60 | 410 | 0.861 |
7. Comparative Context: Token Filtering in Transformers
The core principle underlying FilterViT—namely, accelerating attention via dynamic or static token filtering—has been independently explored in related research. Attention-aware Token Filtering (ATF) (Naruko et al., 2 Jun 2025) interposes a frozen token filter outside the encoder, based on statically computed attention maps and/or dynamic object detectors, yielding speed-up with negligible retrieval accuracy loss in zero-shot image retrieval. ATF, however, does not integrate the filter with trainable per-input saliency and retains the original transformer graph unmodified.
Meanwhile, FViT (Shi et al., 2024) achieves complexity reduction by replacing self-attention with fully-convolutional blocks inspired by Gabor filters and neuroscience-motivated feed-forward designs, differing from FilterViT’s approach of sparse but explicit QKV attention.
A plausible implication is that the interplay between learnable filter blocks and attention modules, as instantiated by FilterViT, may offer greater accuracy/runtime trade-off flexibility than static or hand-crafted token filtering, and deeper integration of filtering and transformer attention appears essential for maintaining context sensitivity and interpretability across vision tasks.