Papers
Topics
Authors
Recent
2000 character limit reached

Co-Me: Confidence-Guided Token Merging

Updated 20 November 2025
  • The paper introduces a lightweight confidence predictor distilled from a frozen ViT encoder to rank patch embeddings and merge low-confidence tokens.
  • The token merging scheme groups tokens spatially and averages low-confidence groups before each attention or MLP block, reducing computational complexity.
  • Empirical results demonstrate speedups up to 11.3× with minimal accuracy loss, enabling practical real-time high-resolution 3D perception.

Confidence-Guided Token Merging (Co-Me) is an acceleration mechanism for visual geometric transformers (ViTs) that reduces inference latency by selectively merging low-confidence tokens without requiring any retraining or finetuning of the base model. Co-Me employs a distilled confidence predictor to identify which tokens are least critical to downstream geometric prediction, merging these before each attention or MLP block and restoring original shapes for dense prediction. This strategy enables substantial reductions in both computational complexity and runtime, making high-resolution, long-sequence 3D perception practical in real-time systems while preserving accuracy in depth, pose, and reconstruction tasks (Chen et al., 18 Nov 2025).

1. Motivation in Visual Geometric Transformers

Modern single-pass 3D reconstruction frameworks such as VGGT [Wang et al. 2024] and MapAnything [Keetha et al. 2025] are based on Vision Transformer (ViT) architectures. These ingest sequences of image tokens (patch embeddings) originating from multi-view or video-frame stacks, and produce dense outputs including depth maps, point clouds, and camera pose estimates. The self-attention mechanism in ViTs incurs quadratic computational complexity, O(N2d)\mathcal{O}(N^2d), where NN is the number of tokens and dd the feature dimension. As NN increases due to higher spatial resolutions or longer frame sequences, inference latency rapidly becomes impractical for real-time or edge deployments.

Efforts to reduce NN, the sequence length processed by attention layers, are thus central for enabling fast 3D perception. Prior approaches—token pruning (e.g., DynamicViT, A-ViT) and similarity-based merging (e.g., ToMe, FastVGGT)—have limitations: pruning methods typically incur severe spatial information loss and require retraining, while similarity-based merging only modestly reduces compute and does not align with the regions of geometric interest highlighted by transformer attention. Notably, similarity heuristics fail to capture task-relevant coverage except at extreme sequence lengths.

2. Confidence Predictor Architecture and Training

Co-Me introduces a lightweight “confidence predictor,” ff', distilled from the frozen ViT encoder. The overall model is denoted F=f2f1\mathcal F = f_2 \circ f_1, with the predictor inserted after layer 15 of the encoder stack. The architecture of ff' is as follows:

  • A single-layer MLP projects patch embeddings to a low-dimensional latent space.
  • A single-head self-attention module aggregates spatial context across the latent tokens.
  • A compact Conv2D “head” produces per-patch confidences CRH×W\mathcal C' \in \mathbb{R}^{H\times W}. This module introduces approximately 0.2% overhead relative to full ViT runtime.

Training the predictor avoids any retraining of F\mathcal F. Instead, ff' is supervised to reproduce the ranking of confidences from the original model’s output. For each patch ii, the average downstream confidence from C=f2(f1(x))\mathcal C = f_2(f_1(x)) is cˉi\bar c_i; for the predictor, ci=Avg(C)ic_i' = \mathrm{Avg}(\mathcal C')_i. The set of ordered pairs is

P={(i,j)cˉi>cˉj}.\mathcal{P} = \{(i, j) \mid \bar c_i > \bar c_j\}.

The training objective is logistic ranking loss:

Lconf=1P(i,j)Plog(1+exp(cjci)).\mathcal{L}_{\text{conf}} = \frac{1}{|\mathcal{P}|} \sum_{(i,j)\in\mathcal{P}} \log\bigl(1+\exp(c_j' - c_i')\bigr).

This loss emphasizes correct ordering rather than magnitude, aligning predictor outputs with the subset of patches least relied upon by the transformer’s attention. Empirically, high-confidence patch predictions localize to image regions rich in geometric information (texture, stable cues), whereas low-confidence outputs correspond to background or occlusion areas.

3. Token Merging Scheme

Tokens are grouped into N/nN/n contiguous spatial groups G1,,GN/nG_1,\dots,G_{N/n} of size nn (typically n=4n=4). For each group, group-average confidence is computed:

c~i=1nkGick.\tilde c_i = \frac{1}{n}\sum_{k\in G_i} c_k'.

Given a merge ratio p(0,1)p \in (0,1), the pp-fraction of groups with the lowest c~i\tilde c_i are marked for merging, i.e., c~i<τ\tilde c_i < \tau where τ\tau is the pp-th percentile.

Prior to each attention or MLP block in f2f_2, tokens within each group are merged as follows:

  • For groups marked to be merged (mi=1m_i=1): replace GiG_i with its mean, Gi{1nxGix}G_i \gets \{\frac{1}{n}\sum_{x\in G_i} x\}.
  • For unmerged groups: preserve original tokens. Concatenate the resulting tokens across groups. After the block, the merged tokens are split back to the original sequence length: merged tokens are replicated nn times, while unmerged groups pass unchanged.

This “merge and split” paradigm preserves the full spatial coverage, ensuring all downstream heads operate on the original token count. The process is entirely differentiable and does not affect downstream dense prediction outputs.

Inference Pseudocode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
z = f1(x)
c_pred = f'(z)
group_conf = [mean(c_pred[i*n:(i+1)*n]) for i in range(N//n)]
tau = percentile(group_conf, p*100)
m = [1 if conf<tau else 0 for conf in group_conf]
h = z
for block in f2.blocks:
    h_merged = []
    for i, m_i in enumerate(m):
        G_i = h[i*n:(i+1)*n]
        if m_i:
            h_merged.append(mean(G_i, axis=0, keepdim=True))
        else:
            h_merged.extend(G_i)
    h_block = block(h_merged)
    h = []
    idx=0
    for i, m_i in enumerate(m):
        if m_i:
            h.append(repeat(h_block[idx], n))
            idx+=1
        else:
            h.append(h_block[idx:idx+n])
            idx+=n
    h = concat(h)
y = decode(h)

4. Computational Gains and Theoretical Speedup

The effective token count after merging is

N=(1p)N+pNn=N(1p+pn).N' = (1-p)N + p\frac{N}{n} = N(1-p+\tfrac{p}{n}).

Attention cost is quadratic, so theoretical speedup over standard attention is

Sattn=N2(N)2=1[1p+pn]2.S_{\text{attn}} = \frac{N^2}{(N')^2} = \frac{1}{[1-p+\frac{p}{n}]^2}.

MLP cost is linear:

Smlp=NN=11p+pn.S_{\text{mlp}} = \frac{N}{N'} = \frac{1}{1-p+\frac{p}{n}}.

Overall speedup for the whole model lies between these bounds, with negligible predictor overhead. As p1p\to1, Sattnn2S_{\text{attn}}\approx n^2 and SmlpnS_{\text{mlp}}\approx n. Empirical results indicate optimal trade-offs occur at p=0.5p=0.5–$0.7$ and n=4n=4.

5. Evaluation Protocols and Empirical Results

Co-Me was validated on a comprehensive suite of vision benchmarks, including:

  • Monocular depth (NYUv2, ETH3D), multi-view depth (DTU-MVS, KITTI)
  • Pose estimation (DTU, RealEstate-10K)
  • Point-cloud reconstruction (DTU, ETH3D)
  • Online streaming (StreamVGGT)

Key baselines included VGGT with FlexAttention, MapAnything, FastVGGT, and a “Merge by Sim” bottom-pp cosine similarity approach.

Notable results:

  • DTU multi-view depth (32 frames): latency reduced from 8.8 s (VGGT) to 3.15 s (Co-Me), a 2.79×\times speedup, with Δ\DeltaL1 ≈ +0.044 cm and Δδ1.25<0.001\Delta\delta_{1.25}<0.001.
  • Pose (DTU, 32 frames): 8.56 s → 3.00 s, a 2.85×\times speedup, with negligible accuracy drop.
  • Point cloud (DTU): 8.74 s → 3.12 s (2.79×\times), no meaningful change in Chamfer distance.
  • Large-scale (VGGT, 512 frames): up to 11.3×\times at p=0.5p=0.5, and 26.65×\times at p=0.9p=0.9.
  • Edge deployment (NVIDIA Jetson Thor): MapAnything achieves 1.5×\times faster throughput (3.5 FPS) with negligible quality loss.

Ablation studies demonstrated:

  • Best predictor placement after encoder layer 15.
  • Logistic ranking loss yields 2–3×\times higher merge-mask IoU than MSE.
  • Optimal group size n=4n=4 for speed/accuracy trade-off.
  • Merge (averaging) is superior to pick-one or drop-all.
  • Co-Me consistently outperforms similarity-based merging across the full speed/error Pareto curve.
Task/Setting VGGT (s) Co-Me (s) Speedup (×\times) Accuracy Drop
Multi-view Depth (DTU, 32f) 8.8 3.15 2.79 Δ\DeltaL1≈0.044 cm
Pose (DTU, 32f) 8.56 3.00 2.85 Δ\DeltaAUC<0.003<0.003
Point Cloud (DTU) 8.74 3.12 2.79 No drop
Large-scale (512f) 11.3 @ p=0.5p=0.5

6. Integration, Limitations, and Future Prospects

Co-Me can be incorporated into existing visual geometric transformer pipelines with no retraining or architectural modifications to the ViT backbone. The mechanism is orthogonal to other acceleration strategies such as efficient attention kernels and quantization, and is compatible with streaming and multi-view inference paradigms.

Limitations include a tendency to oversmooth thin, low-confidence structures (such as poles and fine leaves) due to group merging, a constraint of requiring a fixed batch-wide merge ratio pp, and the focus on spatial rather than temporal grouping. Future directions include per-sample adaptive merge ratios, time-dimension merging, and integration into the training loop for large-scale models.

Co-Me leverages a minimal overhead, distilled confidence predictor for ranking redundancy in feature tokens, merging only those least influential to geometric predictions. Merging with averaging, followed by shape restoration, allows aggressive reduction in both quadratic-attention and linear-MLP computation—achieving up to an order of magnitude speedup, while sustaining nearly unchanged accuracy—without ViT retraining or architectural change (Chen et al., 18 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Confidence-Guided Token Merging (Co-Me).