3D Token Concatenation in Transformer Models

Updated 3 July 2026

3D token concatenation is a method for linearizing and fusing 3D data into ordered or permutation-invariant token streams for transformer processing.
It encompasses techniques such as literal sequence appending, voxel flattening, and hierarchical fusion to integrate modality-specific and geometric information.
Design choices in token ordering, fusion strategy, and compression critically affect model efficiency, occlusion robustness, and 3D reconstruction performance.

3D token concatenation denotes a family of sequence-formation and fusion procedures that make 3D information compatible with transformer-style token processing. In recent arXiv literature, the phrase does not identify a single standardized operator. It can mean literal sequence concatenation of heterogeneous tokens, as in appending pose tokens to image tokens or source-3D tokens to target-3D tokens; it can also refer to flattening sparse voxel, triplane, or object-structured 3D representations into ordered token streams; and several papers explicitly argue against order-sensitive concatenation by replacing it with permutation-invariant token sets or hierarchical fusion mechanisms (Yang et al., 2024, Lu et al., 17 Sep 2025, Asim et al., 21 Feb 2026, Cai et al., 21 Nov 2025).

1. Conceptual scope and recurrent formulations

Across current systems, three recurrent meanings dominate. First, “concatenation” can be literal sequence-axis appending, where modality-specific token blocks are placed into one transformer input. PostoMETRO is explicit: the encoder input is the camera token, the image-token sequence, and the pose-token sequence, i.e. the implied fused input is $[T_C;T_I;T_P]$ (Yang et al., 2024). Second, the term can describe serialization of 3D structure into a token list, even when no separate modalities are appended afterward. AToken’s closest equivalent to “3D token concatenation” is the flattening of active voxel features into a sparse list of feature-position pairs $\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ , while VAR-3D serializes triplane codebook indices into a multi-scale sequence (Lu et al., 17 Sep 2025, Han et al., 14 Feb 2026). Third, some papers use the design space as a foil: SceneTok defines the scene representation as a permutation-invariant unstructured set, and FASTer describes itself as replacing naïve long-sequence concatenation with adaptive token selection and hierarchical fusion (Asim et al., 21 Feb 2026, Dang et al., 28 Feb 2025).

This suggests that “3D token concatenation” is best treated as an umbrella label for how 3D evidence is linearized, fused, or explicitly not linearized for transformer processing.

System	Mechanism closest to “3D token concatenation”	Ordering status
AToken	Flatten active voxel features into $\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ with $[0,x,y,z]$ coordinates	Exact lexical order not given
SceneTok	Produce $\mathcal Z=\{\mathbf z_i\}_{i=1}^K$ as an unstructured scene-token set	Deliberately order invariant
PostoMETRO	Concatenate camera, image, and pose tokens for the encoder	Ordered sequence
Kyvo	Serialize scenes as object records with structural markers and optional 512-token shapes	Ordered sequence

2. Serialization of native 3D content into token streams

Several works treat 3D token concatenation primarily as the problem of converting geometry into a transformer-consumable sequence. AToken does so by adapting the Trellis-SLAT pipeline: a 3D asset is rendered into multi-view RGB images from spherically sampled cameras, patchified with the same $4\times16\times16$ space-time patchifier used for images and videos, back-projected into a $64^3$ voxel grid, restricted to active or surface voxels, and flattened into the shared latent representation

$\mathbf z=\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^{L}, \qquad \mathbf p_i=[t,x,y,z].$

For static 3D assets, $t=0$ , so 3D occupies the $(x,y,z)$ subspace of the shared 4D lattice. The same encoder $\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ 0, sparse transformer representation, patch embedding logic, and 4D positional treatment are used for images, videos, and 3D; only the final decoder head differs by modality (Lu et al., 17 Sep 2025).

AToken is unusually explicit about what is and is not concatenated. It does not concatenate raw multiview image tokens as the 3D representation. Instead, multiview patch evidence is unified first by voxel-level aggregation, and the resulting sequence is a list of sparse voxel tokens rather than a sequence of raw view tokens. The paper also records an implementation ambiguity: the main text says each voxel gathers and averages patch features from relevant views, whereas Figure 1 says aggregation uses the nearest viewpoint rather than averaging across all views. The exact lexical order of the flattened sparse sequence is not specified (Lu et al., 17 Sep 2025).

VAR-3D adopts a different native representation. Its tokenizer begins from six rendered $\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ 1 RGB-D views with per-pixel Plücker coordinates, encodes them with a view-aware VQ-VAE, and produces latent triplane features $\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ 2. These are interpolated to multiple resolutions, quantized with a shared codebook of size $\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ 3 and dimension $\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ 4, and serialized into a multi-scale sequence. The paper is unusually precise about the serialization rule: within each scale, indices of each plane are arranged in raster-scan order, and indices corresponding to the same spatial location in the three planes are placed consecutively. Generation is then next-scale rather than token-by-token: $\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ 5 Tokens at each scale are predicted in parallel while conditioning on previous scales (Han et al., 14 Feb 2026).

VAT addresses a closely related issue from a different angle: 3D data lacks an inherent raster order, so the model first compresses unordered 3D features into a short latent token sequence using an in-context transformer,

$\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ 6

where $\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ 7 is the learned 1D point-cloud feature set and $\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ 8 is a set of learnable latent tokens. Multi-scale token organization is then achieved not by simple concatenation across scales but by residual multi-scale quantization,

$\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ 9

The reported scale sizes are $\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ 0, and the paper’s direct message is that cross-scale 3D token organization is governed more by residual coupling and additive recomposition than by literal concatenation (Zhang et al., 2024).

Kyvo moves from native 3D latents to explicitly symbolic serialization. It represents a scene as a list of objects, each enclosed by [OBJECT-START] and [OBJECT-END], with attributes such as [SHAPE], [LOCATION], [POSE], [CATEGORY], or [DIMENSIONS]. For complex geometry it adopts Trellis slats, trains a 3D VQ-VAE that compresses $\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ 1 to $\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ 2, and uses an $\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ 3-entry codebook so that each object is represented by $\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ 4 discrete tokens. Numbers are discretized into dedicated numerical tokens with granularity $\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ 5, and the final vocabulary size is $\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ 6 (Sahoo et al., 9 Jun 2025).

3. Literal concatenation for multimodal conditioning

In multimodal systems, 3D token concatenation often becomes a direct conditioning interface. PostoMETRO is the clearest literal example. A cropped image $\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ 7 is turned into image tokens $\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ 8, while a pretrained VQ-VAE pose tokenizer compresses 2D pose into $\{(\mathbf z_i,\mathbf p_i)\}_{i=1}^L$ 9 pose tokens with codebook size $[0,x,y,z]$ 0. The image branch, a learnable camera token $[0,x,y,z]$ 1, and the modulated pose tokens $[0,x,y,z]$ 2 are concatenated and passed to the encoder; the fused memory is then queried by learned joint tokens $[0,x,y,z]$ 3 and vertex tokens $[0,x,y,z]$ 4 in the decoder. For a $[0,x,y,z]$ 5 input, the encoder handles $[0,x,y,z]$ 6 tokens, and the paper explicitly calls the fusion a “naive fusion strategy” that nevertheless improves occlusion robustness (Yang et al., 2024).

LTM3D uses concatenation more indirectly. Its explicit claim is that raw condition tokens should not be concatenated directly to 3D shape tokens. Instead, Prefix Learning converts text or image condition tokens $[0,x,y,z]$ 7 into learned prefix tokens

$[0,x,y,z]$ 8

with $[0,x,y,z]$ 9 learned queries of dimension $\mathcal Z=\{\mathbf z_i\}_{i=1}^K$ 0. These prefix tokens are then concatenated or prepended to the partial 3D shape-token context in the MAE-based conditional model. The paper presents “directly concatenating DinoV2 image features with the input token sequence” as an ablation baseline and positions Prefix Learning as the preferred bridge between condition-token space and 3D latent-token space (Kang et al., 30 May 2025).

“Native 3D Editing with Full Attention” makes token concatenation the central conditioning mechanism. Source and noisy target 3D objects are encoded into native 3D latents $\mathcal Z=\{\mathbf z_i\}_{i=1}^K$ 1 and $\mathcal Z=\{\mathbf z_i\}_{i=1}^K$ 2, patchified and projected into token sequences

$\mathcal Z=\{\mathbf z_i\}_{i=1}^K$ 3

and then concatenated: $\mathcal Z=\{\mathbf z_i\}_{i=1}^K$ 4 The transformer applies full self-attention over this combined sequence, while text instructions remain injected via text cross-attention. The paper’s direct claim is that this 3D token concatenation is more parameter-efficient and yields superior performance than conditioning on source 3D features through an additional cross-attention branch (Cai et al., 21 Nov 2025).

Kyvo extends literal concatenation to a unified autoregressive multimodal stream. Depending on the task, image tokens, structured 3D scene tokens, and text tokens are concatenated in one sequence with explicit separators such as [BOS], [IMAGE-START], [SCENE-START], [TEXT-START], [OUTPUT-SEP], and [EOS]. Recognition is formatted as image $\mathcal Z=\{\mathbf z_i\}_{i=1}^K$ 5 3D, rendering as 3D $\mathcal Z=\{\mathbf z_i\}_{i=1}^K$ 6 image, and question answering as image + 3D + question $\mathcal Z=\{\mathbf z_i\}_{i=1}^K$ 7 answer. The paper reports that placing the image before the 3D sequence improves both instruction-following and question answering, which indicates that modality order inside the concatenated stream materially affects performance (Sahoo et al., 9 Jun 2025).

4. Order sensitivity, permutation invariance, and non-concatenative alternatives

A significant part of the recent literature is organized around the claim that naïve ordered concatenation is often the wrong inductive bias for 3D scenes. SceneTok is the clearest statement of this position. Its encoder maps context images and poses $\mathcal Z=\{\mathbf z_i\}_{i=1}^K$ 8 to a latent scene-token set

$\mathcal Z=\{\mathbf z_i\}_{i=1}^K$ 9

and the resulting representation is described repeatedly as “a small set of permutation-invariant tokens,” “an unstructured set of tokens,” and “disentangled from the spatial grid.” The learned scene queries attend to multiview patch tokens through a Perceiver-style latent bottleneck, and the paper deliberately uses only 2D RoPE because 3D RoPE would introduce temporal order bias and hurt order invariance. The token bottleneck is $4\times16\times16$ 0 on RealEstate10K/ACID and $4\times16\times16$ 1 on DL3DV (Asim et al., 21 Feb 2026).

Pts3D-LLM is not anti-sequence, but it rejects direct concatenation of 2D and 3D token streams as the primary fusion primitive. Its core token equation is

$4\times16\times16$ 2

so the final visual token is a per-location sum of aligned 2D image features, explicit 3D point-cloud features, and 3D positional encoding. The paper’s strongest message is that the crucial design variable is then the ordering of these already fused tokens when they are flattened for the LLM. Video-random ordering drops performance relative to video-patch ordering, and for point-based tokens an object-grouped order is best, with Point-objects reaching NS = 100.5 versus Point-default = 95.3 and Point-random = 94.3 (Thomas et al., 6 Jun 2025).

FASTer makes a similar argument in temporal 3D detection. The rebuttal text explicitly frames the method as an alternative to simply concatenating 3D tokens across time: focal-token acquisition keeps only a “small number of focal tokens,” adaptive scaling compresses them in a geometry-aware way, and Grouped Hierarchical Fusion progressively condenses long temporal sequences while Intra-Group Fusion “prevents our model from degenerating into a simple stacking.” Historical focal points make up only $4\times16\times16$ 3 of all points, require just $4\times16\times16$ 4GB of memory in total, and permit scaling to 16 or even 64 frames (Dang et al., 28 Feb 2025).

These systems do not deny that transformers eventually consume token arrays. Their claim is narrower: meaningful 3D token organization often requires set aggregation, additive fusion, or hierarchical condensation before any flat sequence is handed to the backbone. This suggests that the critical issue is not only whether tokens are concatenated, but what structural commitments have already been imposed on them.

5. Pruning, merging, and efficiency constraints on concatenated 3D token streams

A separate line of work studies what happens after concatenation has already created an oversized sequence. SeGPruner targets the standard 3D QA pipeline in which visual tokens from many views are flattened and concatenated with question tokens before LLM inference. With 12 views resized to $4\times16\times16$ 5 and encoded by SigLIP inside LLaVA-OneVision-7B, the baseline contains 8,748 visual tokens per scene before pruning. SeGPruner inserts a training-free front-end: a Saliency-aware Token Selector keeps attention-important tokens,

$4\times16\times16$ 6

and a Geometry-aware Token Diversifier adds spatially diverse tokens using a fused semantic-spatial distance. The paper reports a 91% reduction in visual token budget and an 86% reduction in latency while maintaining competitive 3D reasoning performance (Li et al., 31 Mar 2026).

MedPruner formulates the same issue for 3D medical VLMs that slice a volume $4\times16\times16$ 7 into 2D images and concatenate projected per-slice token blocks: $4\times16\times16$ 8 The critique is explicit: direct concatenation of consecutive slices creates “massive anatomical redundancy.” Inter-slice Anchor-based Filtering removes redundant slices based on pixel-wise mean $4\times16\times16$ 9 distance, and Dynamic Information Nucleus Selection retains only the smallest token subset whose cumulative attention mass exceeds $64^3$ 0. On MedGemma, the reported retained-token rates are 4.87% on M3D, 4.62% on 3D-RAD, and 2.46% on AMOS-MM, with performance maintained or slightly improved in aggregate (Liu et al., 12 Mar 2026).

HTTM addresses concatenation-induced cost inside attention itself. In VGGT, global layers attend over tokens from all views, and the paper states that sequences can exceed 20k tokens. Existing token-merging methods apply the same merge map to all heads, which “results in identical tokens in the layers’ output.” HTTM instead performs head-wise temporal token merging, then reconstructs the full sequence before the standard head-concatenation step

$64^3$ 1

Its claim is that preserving head-specific token groupings before head concatenation maintains representational uniqueness; the reported speedup is up to 7x (Wang et al., 26 Nov 2025).

The MVLA pruning framework in “2D or 3D: Who Governs Salience in VLA Models?” studies yet another concatenative regime: a 2D stream of 256 image tokens and a 3D point-cloud stream of 256 point tokens are concatenated with language tokens before the LLaMA-2 7B backbone. The paper’s direct point is that modal expansion doubles the visual token count from 256 to 512, but the best pruning policy is not symmetric. Its tri-stage salience-aware method reports up to 2.55x speedup with only 5.8% overhead (Zheng et al., 10 Apr 2026).

6. Empirical behavior and unresolved details

Empirically, the literature shows that well-designed 3D token concatenation can work, but the meaning of “well-designed” varies by representation. AToken’s sparse voxel-token organization reaches 28.28 PSNR and 0.062 LPIPS on Toys4k reconstruction, with 90.9% zero-shot classification accuracy for 3D semantic understanding and 91.3% for the discrete variant. The same shared token space also plugs into an external image-to-3D diffusion pipeline by replacing Trellis-SLAT’s original 3D tokens with AToken tokens (Lu et al., 17 Sep 2025). Native 3D Editing reports FID = 91.9, FVD = 286.5, and CLIP = 0.249, and attributes its gains over prior 2D-lifting approaches to native-3D training data together with source-target 3D token concatenation (Cai et al., 21 Nov 2025). Pts3D-LLM reports that explicit 3D feature fusion improves NS(All) from 99.5 ± 0.52 to 101.1 ± 0.56, while point-based structures can match video-based ones when tokens are sampled and ordered carefully (Thomas et al., 6 Jun 2025).

At the same time, several papers identify unresolved implementation details around concatenation. AToken leaves unspecified the exact ordering of flattened voxel tokens, the exact multiview back-projection formula, the final aggregation rule for voxel features, the active-voxel thresholding procedure, and the explicit 4D RoPE equation (Lu et al., 17 Sep 2025). LTM3D does not give a full symbolic sequence layout for prefix tokens and shape tokens, and does not specify positional encoding for prefix versus shape tokens in the main model (Kang et al., 30 May 2025). VAR-3D specifies raster order and cross-plane locality but not exact separator-token syntax between scales or the exact latent resolutions $64^3$ 2 used at each scale (Han et al., 14 Feb 2026). MedPruner formalizes the concatenated slice-token baseline but omits the actual values of the sensitivity threshold $64^3$ 3, the information threshold $64^3$ 4, and the temperature $64^3$ 5 used in DINS (Liu et al., 12 Mar 2026).

A broader pattern also emerges. Some systems benefit from bringing 3D into the same token schema as images and videos, as in AToken and Kyvo; some benefit from direct sequence concatenation of source and target 3D latents, as in Native 3D Editing; and others argue that the best 3D representation is precisely one whose semantics are not tied to concatenation order, as in SceneTok. This suggests that “3D token concatenation” is not a settled method but an active design axis spanning sequence construction, conditioning, ordering, compression, and invariance (Sahoo et al., 9 Jun 2025, Asim et al., 21 Feb 2026).

The resulting research picture is therefore heterogeneous but coherent. When token order must carry structure, papers increasingly specify how 3D coordinates, planes, objects, or prefixes are serialized. When order is judged harmful, they replace it with latent sets, sparsity, or hierarchical merging. In both cases, the central technical problem remains the same: how to expose 3D structure to token-based transformers without losing the geometric, semantic, or efficiency properties that make 3D distinct.