FilterViT: Efficient Vision Transformer

Updated 9 February 2026

FilterViT is a Vision Transformer variant that incorporates a Filter Block for dynamic token screening, significantly reducing computational cost.
It uses a lightweight per-pixel saliency scoring mechanism and top-K selection to transform dense attention into a sparse, efficient process.
Empirical evaluations show that FilterViT achieves high accuracy with fewer parameters and built-in saliency maps for improved interpretability.

FilterViT is a Vision Transformer (ViT) variant designed to address the computational bottleneck of standard attention in high-resolution feature maps. It selectively filters the most salient image positions for attention through a lightweight scoring mechanism, resulting in a marked reduction in the number of tokens processed at each attention block. This sparse attention formulation maintains high accuracy, yields interpretable saliency masks, and significantly improves parameter and computational efficiency compared to fully-dense ViT models and lightweight convolutional baselines (Sun, 2024).

1. Architectural Foundations

FilterViT is constructed by interposing a “Filter Block” immediately preceding its attention (QKV) stages. The typical ViT attends over all tokens in a dense $T \times T$ matrix, where $T = H \cdot W$ is the number of spatial positions in a high-resolution feature map. This results in a quadratic computational and memory burden. FilterViT addresses this by evaluating the importance of each pixel or token, sorting these, and retaining only the top- $K$ for attention, where $K \ll T$ and may vary by layer.

The pipeline can be abstracted as:

Input image → CNN feature extractor → Filter Block (scoring and selection) → Transformer attention on top- $K$ tokens → Sparse outputs merged into the full map → Downstream heads.

Compared to complete token filtering strategies that precede the encoder entirely (Naruko et al., 2 Jun 2025) and to convolutional filter hybrids (Shi et al., 2024), FilterViT integrates the filtering operation tightly with QKV attention—enabling dynamic relevance estimation and attentional interpretability within the transformer backbone.

Let $X \in \mathbb{R}^{H \times W \times C}$ denote a feature map (omitting batch size for clarity), where $C$ is the channel dimension. The filter mechanism operates as follows:

Per-pixel saliency scoring: Each token $x_i \in \mathbb{R}^C$ is assigned a scalar score via a lightweight parametric function:

$s = f_{\text{score}}(X) = \sigma(\mathrm{Conv}_{1\times1}(X)),$

where $\sigma$ denotes the sigmoid activation and $w \in \mathbb{R}^C, b \in \mathbb{R}$ . Each $s_i$ reflects the predicted saliency of pixel $i$ .

Sorting and selecting: The scores are sorted, and the indices of the top- $K_l$ entries (layer-dependent) are retained. Let $r_l \in (0,1]$ be a predefined retention ratio:

$K_l = \left\lfloor r_l \cdot (H_l \cdot W_l) \right\rfloor,$

where $H_l \times W_l$ is the spatial size at layer $l$ . Typically, $r_l$ decreases in deeper layers as higher-level semantics become more sparse.

Feature masking and gathering: A mask $M$ is constructed such that $M_i = 1$ if $i$ is in the top- $K$ , else $0$. The remaining $K$ features are gathered and augmented with their respective positional embeddings.

3. Computational Complexity and Attention Modification

Attention is then computed exclusively over the filtered sequence $X_{\text{sel}} \in \mathbb{R}^{K \times C}$ . Standard QKV projections are performed:

$Q = X_{\text{sel}} W_Q, \quad K = X_{\text{sel}} W_K, \quad V = X_{\text{sel}} W_V,$

with $W_Q, W_K \in \mathbb{R}^{C \times d_k}$ , $W_V \in \mathbb{R}^{C \times d_v}$ .

Sparse self-attention is then given by

$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V$

The computational order per layer shifts from $O(T^2 d_k)$ for full attention to $O(K^2 d_k)$ , realizing a substantial reduction—especially at higher resolutions. The total attention cost across $L$ layers is $O\left(\sum_{l=1}^L K_l^2 d_k\right)$ , as opposed to $O\left(\sum_{l=1}^L T_l^2 d_k\right)$ in dense ViT.

4. Integration within Vision Transformer Pipelines

FilterViT is commonly instantiated in a MobileViT-style architecture, which interleaves convolutional feature processing and transformer encoding.

A canonical dataflow is:

Extract feature map $x \leftarrow \text{CNN\_Backbone}(x)$ .
Compute saliency $imp \leftarrow \sigma(\text{Conv}_{1\times1}(x))$ .
For each input, select top- $K$ indices.
Gather selected tokens (add positional embeddings).
Process via transformer encoder.
Scatter transformed tokens back to their original positions in the feature map.
Continue downstream.

This process preserves the spatial topology of features for downstream dense prediction or classification heads, and enables token selection to adapt dynamically per sample.

5. Interpretability Attributes via Saliency Masks

The saliency map $s$ or its binarized top- $K$ mask can be upsampled to the input resolution and visualized as a heatmap. Early-layer masks typically highlight fundamental image structures (edges, simple shapes), mid-layer masks concentrate on principal objects or regions of interest, and deeper layers emphasize context or scene background.

The TopK mask serves as a direct visual explanation of model focus, obviating the need for post-hoc interpretability methods such as Grad-CAM or attention rollout. These built-in masks have been shown qualitatively to coincide with interpretable and semantically meaningful regions, enhancing transparency of the inference process (Sun, 2024).

6. Empirical Evaluation and Comparative Performance

FilterViT achieves state-of-the-art efficiency on stringent benchmarks. On subsets of ImageNet-1K (img-100), FilterViT attains 86.1% top-1 accuracy with only 1.89M parameters. It outperforms MobileNetV2 (83.3% at 2.35M) and MobileViT-S (85.1% at 5.00M) in accuracy/parameter tradeoff. Hardware throughput measurements are highly competitive, achieving 410 FPS (CUDA) and 5.6 FPS (CPU).

An ablation variant, DropoutViT, randomly selects $K$ pixels per step (omitting scoring). This leads to slightly faster initial convergence but lower final accuracy, demonstrating that learned masking is essential for optimal performance.

Model	#Params (M)	FPS (CPU)	FPS (CUDA)	Top-1 Acc
MobileNetV2	2.35	11.29	848	0.833
MobileViT-S	5.00	6.14	485	0.851
Tiny-ViT	10.59	6.19	524	0.876
FilterViT	1.89	5.60	410	0.861

7. Comparative Context: Token Filtering in Transformers

The core principle underlying FilterViT—namely, accelerating attention via dynamic or static token filtering—has been independently explored in related research. Attention-aware Token Filtering (ATF) (Naruko et al., 2 Jun 2025) interposes a frozen token filter outside the encoder, based on statically computed attention maps and/or dynamic object detectors, yielding $2.8\times$ speed-up with negligible retrieval accuracy loss in zero-shot image retrieval. ATF, however, does not integrate the filter with trainable per-input saliency and retains the original transformer graph unmodified.

Meanwhile, FViT (Shi et al., 2024) achieves complexity reduction by replacing self-attention with fully-convolutional blocks inspired by Gabor filters and neuroscience-motivated feed-forward designs, differing from FilterViT’s approach of sparse but explicit QKV attention.

A plausible implication is that the interplay between learnable filter blocks and attention modules, as instantiated by FilterViT, may offer greater accuracy/runtime trade-off flexibility than static or hand-crafted token filtering, and deeper integration of filtering and transformer attention appears essential for maintaining context sensitivity and interpretability across vision tasks.

Markdown Report Issue Upgrade to Chat

References (3)

FilterViT and DropoutViT (2024)

Speed-up of Vision Transformer Models by Attention-aware Token Filtering (2025)

FViT: A Focal Vision Transformer with Gabor Filter (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FilterViT.

FilterViT: Efficient Vision Transformer

1. Architectural Foundations

2. The Filter Block: Scoring and Top-K Selection

3. Computational Complexity and Attention Modification

4. Integration within Vision Transformer Pipelines

5. Interpretability Attributes via Saliency Masks

6. Empirical Evaluation and Comparative Performance

7. Comparative Context: Token Filtering in Transformers

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

FilterViT: Efficient Vision Transformer

1. Architectural Foundations

2. The Filter Block: Scoring and Top-K Selection

3. Computational Complexity and Attention Modification

4. Integration within Vision Transformer Pipelines

5. Interpretability Attributes via Saliency Masks

6. Empirical Evaluation and Comparative Performance

7. Comparative Context: Token Filtering in Transformers

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research