Chain-of-Sight Sequence in Vision-Language Models

Updated 29 November 2025

Chain-of-Sight (CoS) sequences are a structured paradigm that orders visual tokens to integrate 2D grounding with 3D attribute extraction.
They employ explicit serializations—combining 2D boxes and detailed 3D cues in an autoregressive chain—to optimize spatial reasoning and model context.
CoS enhances MLLM training efficiency by leveraging multi-scale resampling, yielding significant time savings and improved detection accuracy.

Chain-of-Sight (CoS) sequences define a paradigm for vision-LLM interfaces that leverage structured token ordering and multi-scale feature reasoning. In both 3D detection and multimodal LLM (MLLM) acceleration, CoS introduces explicit serializations that blend 2D grounding, spatial inference, and context compression. The approach situates itself as a disciplined next-token problem, imposing token orderings and factorization routines that match both computational optimality and human visual reasoning patterns. The following sections detail the architectural design, token ordering mechanisms, mathematical foundations, operational pipelines, quantitative effects, and the rationale for CoS’s stability and generalizability.

1. CoS Sequence Design and Architectural Placement

The core principle of CoS is to enforce a sequential, interpretable process when reasoning about objects or visual details. In LocateAnything3D, a vision-language decoder is taught to perform multi-object open-vocabulary 3D detection by emitting a short, explicit CoS sequence per object: first, a 2D box, then a lifted set of 3D attributes (distance, size, pose). Each object’s information flows through an autoregressive chain, formalized by the token sequence $\mathcal{S} = (\mathbf q_1,\mathbf b_1,\mathbf q_2,\mathbf b_2,\dots,\langle\texttt{eos}\rangle)$ . This modularity lets the decoder operate without explicit, specialized 3D heads; instead, all predictions emerge from unified token streams (Man et al., 25 Nov 2025).

For MLLM training acceleration, the CoS bridge module sits between a visual encoder (e.g., CLIP‐ViT‐L/14) and a LLM (e.g., Vicuna). Instead of forwarding a flat grid of visual tokens, CoS compresses and orders visual features using multi-scale Perceiver-style resamplers. Each resampler extracts global and local features through cross-attention, yielding a chain of tokens that, when concatenated, present the LLM with increasingly detailed cues—mirroring coarse-to-fine progression in psychophysical vision models (Huang et al., 22 Jul 2024).

2. Token Ordering, Numeric Factorization, and Compound Scaling

In 3D detection, the CoS sequence imposes two ordering constraints:

Inter-object order: Objects are serialized by near-to-far distance (increasing $Z$ ), focusing model context on high-confidence, utility-rich regions first.
Intra-object order: Attribute prediction proceeds as a block sequence—2D box, 3D center, dimensions, rotation—matching relative ease and learnability.

Explicitly, for each object $i$ :

Step	Tokens	Domain
2D box	$x_i^{\min}, y_i^{\min}, x_i^{\max}, y_i^{\max}$	Integer, [0,1000]
3D center	$X_i, Y_i, Z_i$	Float, meters, 2-decimal
3D dims	$W_i, H_i, L_i$	Float, meters
Rotation	$\theta_i$ (or SO(3))	Quantized angle

Every quantity is discretized to ensure token feed compatibility and controlled entropy, with cross-entropy loss optimized for each token in the autoregressive chain (Man et al., 25 Nov 2025).

For MLLM pre-training, CoS utilizes a compound token scaling scheme:

During pre-training, visual tokens are minimized via large resampler windows and coarse partitioning.
Upon fine-tuning, input resolution and the granularity of resampler windows are increased, expanding the visual token count by up to $16\times$ .
The total number of tokens generated is governed by

$N = \sum_{i=1}^{m} \frac{L^2}{W_i^2}N_i$

and post-scaling,

$N' = r_r^2 \sum_{i=1}^m \frac{L_0^2}{(W_i / r_w)^2} N_i \approx 16 N$

where $r_r$ is the resolution scaling factor, $r_w$ is the window scaling factor, and $N_i$ is the number of queries per scale (Huang et al., 22 Jul 2024).

3. Mathematical Foundations of CoS

A central tenet of CoS for 3D detection is the use of pinhole camera projection to convert 3D coordinates into quantized 2D image positions:

For camera intrinsic matrix

$\mathbf K = \begin{pmatrix}f_x & 0 & c_x \ 0 & f_y & c_y \ 0 & 0 & 1\end{pmatrix},$

the image-plane coordinates are

$u = f_x \frac{X}{Z} + c_x,\quad v = f_y \frac{Y}{Z} + c_y.$

2D bounding box tokens are formed by quantizing the projected cuboid corners.
Depth $Z_i$ is directly predicted as part of the token stream, not inferred from ancillary depth maps. Training rounds $Z_i$ to two decimals, and all numeric tokens are optimized via cross-entropy (Man et al., 25 Nov 2025).

In MLLM contexts, CoS leverages hierarchical resamplers operating on ViT features at varying window sizes $W_i$ , each contributing a set of condensed tokens via cross-attention and learnable queries. The coarse-to-fine ordering of tokens is empirically validated for optimal performance (Huang et al., 22 Jul 2024).

4. Pre-training and Fine-tuning Pipelines

The operational pipeline for CoS-empowered MLLMs includes the following regimes:

Pre-training (coarse, fast):
- Input: $224 \times 224$ image, ViT feature map $X_0$ of $14 \times 14 \times C$
- Windows: $W = [16, 4]$ , Queries $[16, 64]$
- Token output: 80 tokens
- Pass visual and text tokens to LLM for multi-task objective optimization
Fine-tuning (rich, slow):
- Input: $448 \times 448$ image, $28 \times 28 \times C$ feature map
- Windows: $W' = [32, 8, 4]$ (or $[32, 16, 8, 4]$ )
- Token output: up to 1,296 tokens
- Tokens initialized by "inflating" resampler weights and nearest-neighbor copying queries; original CoS parameters reused.

This schema delivers up to $\sim73\%$ pre-training time savings and $\sim65\%$ overall clock savings on large pre-train datasets, as well as 2.5 $\times$ batch-size boosts and 30% faster training steps (Huang et al., 22 Jul 2024). Importantly, performance on downstream benchmarks matches or exceeds full-token baselines, with no accuracy trade-off.

5. Empirical Impact and Performance Characteristics

LocateAnything3D achieves state-of-the-art results on the Omni3D benchmark, registering $49.89$ AP $_{3D}$ , exceeding previous methods by $+15.51$ points, even when baselines leverage ground-truth 2D boxes. Zero-shot generalization to held-out categories remains robust (Man et al., 25 Nov 2025).

In MLLM settings:

CoS-32 matches a full-token baseline (resampler-336) across 12 multimodal tasks.
CoS-80 $\rightarrow$ CoS-336 configuration outperforms standard pipelines by $+1.8$ CIDEr on captioning, $+2.1\%$ on text tasks, and $+1.3\%$ on grounding.
Further scaling to 528 or 1,296 tokens yields incremental accuracy improvements.
Multi-scale resampling outstrips single-scale compression, and local tokens drive most downstream accuracy gains (Huang et al., 22 Jul 2024).

6. Rationale and Theoretical Justification

CoS sequences leverage structured token ordering for enhanced learnability and contextual stability:

Token Structure: By alternating easy (2D detection) and hard (3D estimation) tokens, CoS aligns decoder attention with visually grounded regions, reducing hallucination and smoothing the likelihood surface. Empirically, alternative orderings (all-2D then all-3D, or randomized) degrade AP $_{3D}$ by $15$–$20$ points (Man et al., 25 Nov 2025).
Promptability and Open Vocabulary: CoS operates entirely within the decoder's token space, retaining the ability to invoke open-vocabulary queries ("red chair", "my toy") without 3D-specialized components (Man et al., 25 Nov 2025).
Human Reasoning Alignment: Psychophysical evidence supports a coarse-to-fine, 2D-first process in visual cognition (Marr, Helmholtz), which CoS mimics via explicit token sequencing. Errors under CoS yield graceful degradation, such as limited rotation misestimation, rather than gross spatial mislocalizations (Man et al., 25 Nov 2025).
Multi-scale Contextualization: CoS’s multi-stage resamplers for MLLM bridging exploit both global and local contexts, and ablative studies confirm the necessity of both for maximal performance.
Initialization and Transfer: Fine-tuning adapts pre-trained weights through inflation strategies, ensuring continuity and immediate effectiveness of upscaled visual token sets (Huang et al., 22 Jul 2024).

A plausible implication is that CoS enables unified AR reasoning for complex multimodal tasks, tightly coupling visual evidence with structured attribute inference and scalable contextualization.

7. Limitations and Implementation Notes

Noted limitations and optimization notes include:

In CoS resampling, global scaling alone yields minimal accuracy improvement; local scaling dominates downstream performance gains (Huang et al., 22 Jul 2024).
Multi-level aggregation contributes modest discrimination gains by fusing early and late backbone features.
Initialization for fine-tuning relies on inflating pre-trained weights and nearest-neighbor query expansion to preserve semantic continuity in expanded token sets.
Matching coarse-to-fine progression in both object serialization and token detail order is essential; reversing the chain is detrimental to both speed and accuracy.

Together, these properties establish Chain-of-Sight as a foundational design pattern for both high-performance vision-language 3D detection (Man et al., 25 Nov 2025) and accelerated MLLM pre-training with flexible, extensible token schemes (Huang et al., 22 Jul 2024).