Co-Me: Confidence-Guided Token Merging

Updated 20 November 2025

The paper introduces a lightweight confidence predictor distilled from a frozen ViT encoder to rank patch embeddings and merge low-confidence tokens.
The token merging scheme groups tokens spatially and averages low-confidence groups before each attention or MLP block, reducing computational complexity.
Empirical results demonstrate speedups up to 11.3× with minimal accuracy loss, enabling practical real-time high-resolution 3D perception.

Confidence-Guided Token Merging (Co-Me) is an acceleration mechanism for visual geometric transformers (ViTs) that reduces inference latency by selectively merging low-confidence tokens without requiring any retraining or finetuning of the base model. Co-Me employs a distilled confidence predictor to identify which tokens are least critical to downstream geometric prediction, merging these before each attention or MLP block and restoring original shapes for dense prediction. This strategy enables substantial reductions in both computational complexity and runtime, making high-resolution, long-sequence 3D perception practical in real-time systems while preserving accuracy in depth, pose, and reconstruction tasks (Chen et al., 18 Nov 2025).

1. Motivation in Visual Geometric Transformers

Modern single-pass 3D reconstruction frameworks such as VGGT [Wang et al. 2024] and MapAnything [Keetha et al. 2025] are based on Vision Transformer (ViT) architectures. These ingest sequences of image tokens (patch embeddings) originating from multi-view or video-frame stacks, and produce dense outputs including depth maps, point clouds, and camera pose estimates. The self-attention mechanism in ViTs incurs quadratic computational complexity, $\mathcal{O}(N^2d)$ , where $N$ is the number of tokens and $d$ the feature dimension. As $N$ increases due to higher spatial resolutions or longer frame sequences, inference latency rapidly becomes impractical for real-time or edge deployments.

Efforts to reduce $N$ , the sequence length processed by attention layers, are thus central for enabling fast 3D perception. Prior approaches—token pruning (e.g., DynamicViT, A-ViT) and similarity-based merging (e.g., ToMe, FastVGGT)—have limitations: pruning methods typically incur severe spatial information loss and require retraining, while similarity-based merging only modestly reduces compute and does not align with the regions of geometric interest highlighted by transformer attention. Notably, similarity heuristics fail to capture task-relevant coverage except at extreme sequence lengths.

2. Confidence Predictor Architecture and Training

Co-Me introduces a lightweight “confidence predictor,” $f'$ , distilled from the frozen ViT encoder. The overall model is denoted $\mathcal F = f_2 \circ f_1$ , with the predictor inserted after layer 15 of the encoder stack. The architecture of $f'$ is as follows:

A single-layer MLP projects patch embeddings to a low-dimensional latent space.
A single-head self-attention module aggregates spatial context across the latent tokens.
A compact Conv2D “head” produces per-patch confidences $\mathcal C' \in \mathbb{R}^{H\times W}$ . This module introduces approximately 0.2% overhead relative to full ViT runtime.

Training the predictor avoids any retraining of $\mathcal F$ . Instead, $f'$ is supervised to reproduce the ranking of confidences from the original model’s output. For each patch $i$ , the average downstream confidence from $\mathcal C = f_2(f_1(x))$ is $\bar c_i$ ; for the predictor, $c_i' = \mathrm{Avg}(\mathcal C')_i$ . The set of ordered pairs is

$\mathcal{P} = \{(i, j) \mid \bar c_i > \bar c_j\}.$

The training objective is logistic ranking loss:

$\mathcal{L}_{\text{conf}} = \frac{1}{|\mathcal{P}|} \sum_{(i,j)\in\mathcal{P}} \log\bigl(1+\exp(c_j' - c_i')\bigr).$

This loss emphasizes correct ordering rather than magnitude, aligning predictor outputs with the subset of patches least relied upon by the transformer’s attention. Empirically, high-confidence patch predictions localize to image regions rich in geometric information (texture, stable cues), whereas low-confidence outputs correspond to background or occlusion areas.

3. Token Merging Scheme

Tokens are grouped into $N/n$ contiguous spatial groups $G_1,\dots,G_{N/n}$ of size $n$ (typically $n=4$ ). For each group, group-average confidence is computed:

$\tilde c_i = \frac{1}{n}\sum_{k\in G_i} c_k'.$

Given a merge ratio $p \in (0,1)$ , the $p$ -fraction of groups with the lowest $\tilde c_i$ are marked for merging, i.e., $\tilde c_i < \tau$ where $\tau$ is the $p$ -th percentile.

Prior to each attention or MLP block in $f_2$ , tokens within each group are merged as follows:

For groups marked to be merged ( $m_i=1$ ): replace $G_i$ with its mean, $G_i \gets \{\frac{1}{n}\sum_{x\in G_i} x\}$ .
For unmerged groups: preserve original tokens. Concatenate the resulting tokens across groups. After the block, the merged tokens are split back to the original sequence length: merged tokens are replicated $n$ times, while unmerged groups pass unchanged.

This “merge and split” paradigm preserves the full spatial coverage, ensuring all downstream heads operate on the original token count. The process is entirely differentiable and does not affect downstream dense prediction outputs.

Inference Pseudocode

z = f1(x)
c_pred = f'(z)
group_conf = [mean(c_pred[i*n:(i+1)*n]) for i in range(N//n)]
tau = percentile(group_conf, p*100)
m = [1 if conf<tau else 0 for conf in group_conf]
h = z
for block in f2.blocks:
    h_merged = []
    for i, m_i in enumerate(m):
        G_i = h[i*n:(i+1)*n]
        if m_i:
            h_merged.append(mean(G_i, axis=0, keepdim=True))
        else:
            h_merged.extend(G_i)
    h_block = block(h_merged)
    h = []
    idx=0
    for i, m_i in enumerate(m):
        if m_i:
            h.append(repeat(h_block[idx], n))
            idx+=1
        else:
            h.append(h_block[idx:idx+n])
            idx+=n
    h = concat(h)
y = decode(h)

4. Computational Gains and Theoretical Speedup

The effective token count after merging is

$N' = (1-p)N + p\frac{N}{n} = N(1-p+\tfrac{p}{n}).$

Attention cost is quadratic, so theoretical speedup over standard attention is

$S_{\text{attn}} = \frac{N^2}{(N')^2} = \frac{1}{[1-p+\frac{p}{n}]^2}.$

MLP cost is linear:

$S_{\text{mlp}} = \frac{N}{N'} = \frac{1}{1-p+\frac{p}{n}}.$

Overall speedup for the whole model lies between these bounds, with negligible predictor overhead. As $p\to1$ , $S_{\text{attn}}\approx n^2$ and $S_{\text{mlp}}\approx n$ . Empirical results indicate optimal trade-offs occur at $p=0.5$ –$0.7$ and $n=4$ .

5. Evaluation Protocols and Empirical Results

Co-Me was validated on a comprehensive suite of vision benchmarks, including:

Monocular depth (NYUv2, ETH3D), multi-view depth (DTU-MVS, KITTI)
Pose estimation (DTU, RealEstate-10K)
Point-cloud reconstruction (DTU, ETH3D)
Online streaming (StreamVGGT)

Key baselines included VGGT with FlexAttention, MapAnything, FastVGGT, and a “Merge by Sim” bottom- $p$ cosine similarity approach.

Notable results:

DTU multi-view depth (32 frames): latency reduced from 8.8 s (VGGT) to 3.15 s (Co-Me), a 2.79 $\times$ speedup, with $\Delta$ L1 ≈ +0.044 cm and $\Delta\delta_{1.25}<0.001$ .
Pose (DTU, 32 frames): 8.56 s → 3.00 s, a 2.85 $\times$ speedup, with negligible accuracy drop.
Point cloud (DTU): 8.74 s → 3.12 s (2.79 $\times$ ), no meaningful change in Chamfer distance.
Large-scale (VGGT, 512 frames): up to 11.3 $\times$ at $p=0.5$ , and 26.65 $\times$ at $p=0.9$ .
Edge deployment (NVIDIA Jetson Thor): MapAnything achieves 1.5 $\times$ faster throughput (3.5 FPS) with negligible quality loss.

Ablation studies demonstrated:

Best predictor placement after encoder layer 15.
Logistic ranking loss yields 2–3 $\times$ higher merge-mask IoU than MSE.
Optimal group size $n=4$ for speed/accuracy trade-off.
Merge (averaging) is superior to pick-one or drop-all.
Co-Me consistently outperforms similarity-based merging across the full speed/error Pareto curve.

Task/Setting	VGGT (s)	Co-Me (s)	Speedup ( $\times$ )	Accuracy Drop
Multi-view Depth (DTU, 32f)	8.8	3.15	2.79	$\Delta$ L1≈0.044 cm
Pose (DTU, 32f)	8.56	3.00	2.85	$\Delta$ AUC $<0.003$
Point Cloud (DTU)	8.74	3.12	2.79	No drop
Large-scale (512f)	–	–	11.3 @ $p=0.5$	–

6. Integration, Limitations, and Future Prospects

Co-Me can be incorporated into existing visual geometric transformer pipelines with no retraining or architectural modifications to the ViT backbone. The mechanism is orthogonal to other acceleration strategies such as efficient attention kernels and quantization, and is compatible with streaming and multi-view inference paradigms.

Limitations include a tendency to oversmooth thin, low-confidence structures (such as poles and fine leaves) due to group merging, a constraint of requiring a fixed batch-wide merge ratio $p$ , and the focus on spatial rather than temporal grouping. Future directions include per-sample adaptive merge ratios, time-dimension merging, and integration into the training loop for large-scale models.

Co-Me leverages a minimal overhead, distilled confidence predictor for ranking redundancy in feature tokens, merging only those least influential to geometric predictions. Merging with averaging, followed by shape restoration, allows aggressive reduction in both quadratic-attention and linear-MLP computation—achieving up to an order of magnitude speedup, while sustaining nearly unchanged accuracy—without ViT retraining or architectural change (Chen et al., 18 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Confidence-Guided Token Merging (Co-Me).

Co-Me: Confidence-Guided Token Merging

1. Motivation in Visual Geometric Transformers

2. Confidence Predictor Architecture and Training

3. Token Merging Scheme

Inference Pseudocode

4. Computational Gains and Theoretical Speedup

5. Evaluation Protocols and Empirical Results

6. Integration, Limitations, and Future Prospects

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Co-Me: Confidence-Guided Token Merging

1. Motivation in Visual Geometric Transformers

2. Confidence Predictor Architecture and Training

3. Token Merging Scheme

Inference Pseudocode

4. Computational Gains and Theoretical Speedup

5. Evaluation Protocols and Empirical Results

6. Integration, Limitations, and Future Prospects

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research