Learnable Tokens in Deep Learning Models

Updated 20 April 2026

Learnable tokens are trainable, continuous embedding vectors that serve as adaptable, task-specific representations in deep neural models.
They enable computational efficiency, modular compositionality, and flexible fusion across vision, language, time series, graphs, and multimodal contexts.
Empirical studies demonstrate that well-engineered tokens can achieve significant speedups and accuracy gains with minimal extra parameters.

Learnable tokens are trainable, continuous embedding vectors incorporated into deep neural architectures—often transformers—to provide adaptable, task- or context-specific intermediate representations. These tokens serve as explicit, model-parameterized structures that can focus model attention, enable efficient compression, support fusion or grounding across modalities, or even unlock new control mechanisms. Their design, training, and functional roles are highly diverse, spanning vision, language, time series, graphs, and multimodal contexts. Recent research demonstrates that modest sets of learnable tokens, if properly engineered, can deliver state-of-the-art efficiency and accuracy tradeoffs, modular compositionality, and new forms of model interaction or adaptation.

1. Mathematical Formulation and Initialization

A learnable token is typically a vector (or small set thereof) in the model’s hidden-dimensional space, parameterized as

$\mathbf{t}_i \in \mathbb{R}^d, \quad i = 1, ..., T$

where $d$ is the model dimension and $T$ is the number of tokens. Tokens may be global (shared across all data), class-specific, modality-specific, task-specific, or spatially assigned (e.g., per grid cell). Initialization is usually with i.i.d. draws from $\mathcal{N}(0, \sigma^2 I)$ or truncated normal, compatible with standard transformer practices (Jiang et al., 2024, Zhu et al., 9 Jun 2025, Cornelissen et al., 25 Mar 2026, Sun et al., 13 Apr 2026).

Tokens are trained end-to-end with the rest of the model, often through standard backpropagation. In transformer and attention models, they are incorporated into sequences for joint attention: $Z_0 = [x^1, ..., x^N; \, t^1, ..., t^T] + E_{\text{pos}} + E_{\text{seg}}$ where $x^i$ are raw input tokens, $E_{\text{pos}}$ is a positional embedding, and $E_{\text{seg}}$ is an (optional) segment embedding (Wang et al., 2023, Kim et al., 29 Jan 2025).

2. Architectural Roles and Token Types

Learnable tokens serve highly diverse architectural functions, including (but not limited to):

Information Bottlenecks/Pooling: Tokens replace implicit pooling (e.g., <EOS>) with fixed-capacity, explicit pooling that forces information condensation (e.g., “Bottleneck Tokens” in MLLMs (Sun et al., 13 Apr 2026), “fusion tokens” in DeepMLF (Georgiou et al., 15 Apr 2025), Le MuMo JEPA (Cornelissen et al., 25 Mar 2026)).
Multimodal Fusion: Tokens act as latent bottlenecks or explicit fusion sites for multimodal streams (vision, audio, text, depth, etc.), enabling modular and progressive cross-modal interaction (Cornelissen et al., 25 Mar 2026, Georgiou et al., 15 Apr 2025).
Prompt/Control Contexts: Small sets of trainable tokens provide control or context for generation, safety enforcement, or grounding—in lieu of unstructured text. This includes modular safety alignment (MOSAIC (Peng et al., 17 Mar 2026)), audio-visual prompts (SOUPLE (Nguyen et al., 24 Mar 2026)), or learnable graph pooling prompts for LLMs (Kim et al., 29 Jan 2025).
Sparse/Meta Representation: A tiny bank of tokens can act as meta- or expert-tokens to summarize or sparsify otherwise dense representations (LeMeViT (Jiang et al., 2024), SET (Yi et al., 2024), METransformer (Wang et al., 2023)).
Structural/Spatial Grounding: Discrete vocabularies of spatial tokens (e.g., grid tokens, offset tokens) allow MLLMs to natively reference 2D locations, bounding boxes, or segmentation masks (Ren et al., 11 Dec 2025).
Time/Scale Decomposition: In time-series or structured signals, tokens may encode features at multiple resolutions or via learned frequency decompositions (LEFT (Wang et al., 9 Feb 2026), Kinematic Tokenization (Kearney, 15 Jan 2026)).
Tokenization Boundary Prediction: In text, tokens can represent learnable boundary predictors for dynamic, per-sample tokenization, as in FLEXITOKENS (Owodunni et al., 17 Jul 2025).

3. Training Objectives and Optimization Schemes

The learning objective for tokens depends on their functional role and architectural integration:

Supervised Losses: Tokens may be directly supervised via standard cross-entropy or task-specific objectives applied to their outputs, e.g., classification, segmentation, retrieval (Jiang et al., 2024, Georgiou et al., 15 Apr 2025, Sun et al., 13 Apr 2026).
Distillation and Alignment: In continual learning or knowledge consolidation, tokens facilitate embedding alignment with additional regularization such as masked distillation losses or diversity penalties (Zhu et al., 9 Jun 2025).
Contrastive and Generative Information Condensation: Sequential, autoregressive, or contrastive losses may be imposed exclusively via tokens using specialized masks (as in “condensation masks” forcing all predictive signals through the bottleneck (Sun et al., 13 Apr 2026)).
Cycle, Consistency, or Diversity Regularization: Cycle-consistency (e.g., with analysis-synthesis loops across modalities (Wang et al., 9 Feb 2026)), orthogonality (e.g., forcing expert tokens to specialize (Wang et al., 2023)), or attention-diversity can be imposed to maximize token effectiveness.
Reinforcement Learning: Spatial/token compositional vocabularies can be finetuned with RL under multi-phase reward structures (grid-phase, offset-phase) for precise spatial grounding (Ren et al., 11 Dec 2025).
Unsupervised/Self-Supervised: For latent bottlenecks or feature summarization, tokens can be optimized with adversarial, reconstruction, or mutual information-based objectives (Wang et al., 9 Feb 2026, Cornelissen et al., 25 Mar 2026).

Token-specific hyperparameters—number, dimensionality, relative placement, regularization weight—substantially influence performance and are generally tuned empirically for each domain (Yi et al., 2024, Georgiou et al., 15 Apr 2025, Sun et al., 13 Apr 2026).

4. Empirical Benefits and Theoretical Trade-Offs

Empirical studies have consistently demonstrated that learnable tokens offer:

Compression and Efficiency: Replacing full self-attention over $N$ tokens ( $O(N^2)$ ) with attention through $d$ 0 tokens yields significant computational savings (up to $d$ 1 in vision (Jiang et al., 2024); $d$ 2 in time series (Wang et al., 9 Feb 2026)) with negligible or positive accuracy impact.
Superior Adaptation/Fine-tuning: Token-based finetuning often matches or outperforms full- or head-only finetuning for parameter efficiency and generalization (Jiang et al., 2024, Zhu et al., 9 Jun 2025). In MLLMs, dedicated spatial or bottleneck tokens enable new capabilities (precision grounding, modular retrieval) absent in static <EOS>-pooling (Ren et al., 11 Dec 2025, Sun et al., 13 Apr 2026).
Flexibility and Compositionality: Modular tokens (e.g., safety, prompt, or context tokens) can be flexibly combined at inference—enabling conditional behaviors, task- or constraint-specific activation, or rapid domain adaptation without retraining entire backbones (Peng et al., 17 Mar 2026, Nguyen et al., 24 Mar 2026).
Interpretability and Specialization: Diversity/orthogonality losses or voting schemes yield tokens that specialize on distinct sub-tasks or features, enhancing interpretability and enabling attention analysis (Wang et al., 2023).
Information Preservation/Selective Pooling: Intermediate representation tokens (e.g., LGPT (Kim et al., 29 Jan 2025), SET (Yi et al., 2024)) balance fine- and coarse-grained information, outperforming naive mean-pooling or node-level prompts in graphs, tabular, or spatially structured data.

5. Domain-Specific Implementations

Vision

LeMeViT employs $d$ 3 learnable meta tokens, initialized by cross-attention and refined in dual cross-attention loops, achieving state-of-the-art accuracy with significant computational gains over prior sparse-token methods (Jiang et al., 2024).
SET introduces pairs of spectral tokens (amplitude/phase decomposition), with attention-based feature enhancement and standardized inference for domain-robust segmentation, achieving state-of-the-art mIoU in domain generalization (Yi et al., 2024).

Text and Tokenization

FLEXITOKENS embeds learnable boundary-prediction modules into byte-level LMs, enabling dynamic adaptation to domain, script, and language, overcoming subword tokenizer rigidity, and yielding higher compression and downstream task accuracy (Owodunni et al., 17 Jul 2025).
“Pause tokens” introduce computed delays in LM output, expanding the model’s “computational width” and providing measurable improvements on QA and reasoning tasks when jointly pre-trained and finetuned with delays (Goyal et al., 2023).

Multimodal and Retrieval

Bottleneck tokens, inserted at the output of decoder-only MLLMs, serve as fixed-size explicit pooling mechanisms; coupled with generative condensation masks, these tokens deliver substantial retrieval and QA gains (Sun et al., 13 Apr 2026).
GETok introduces specialized spatial token vocabularies (grid and offset) for 2D grounding in MLLMs, achieving new state-of-the-art results across referring, segmentation, and comprehension tasks while preserving inference efficiency (Ren et al., 11 Dec 2025).

Control and Prompting

MOSAIC leverages small banks of modular, composable control tokens—optimized with order-based task sampling and distribution-level alignment—to enforce safety constraints at inference, enabling flexible and granular policy composition without parameter updates of the backbone LLM (Peng et al., 17 Mar 2026).
SOUPLE and other audio-visual methods replace fixed CLIP prompts with small, learnable context vectors, enabling conditional context and superior alignment in localization and segmentation (Nguyen et al., 24 Mar 2026).
LGPTs generalize from single- to multi-token graph pooling for LLM prompts, balancing local and global structural information and improving structured QA performance in LLMs (Kim et al., 29 Jan 2025).

6. Practical Considerations and Limitations

Token Count and Placement: Optimal number and placement of tokens are empirical. Too few tokens risk undercapacity (loss of information), too many diminish compression benefits (Sun et al., 13 Apr 2026, Georgiou et al., 15 Apr 2025).
Initialization and Training: Token initialization, regularization, and curriculum (e.g., pretrained encoders for fusion) control convergence and prevent collapse or redundancy (Georgiou et al., 15 Apr 2025, Yi et al., 2024).
Overhead: While inference-time overhead is generally negligible (common is $d$ 4 latency cost per (Sun et al., 13 Apr 2026)), generative losses and multi-phase training can increase training cost by up to 38%, two-pass KV cache schemes, or large multitask RL schedules (Sun et al., 13 Apr 2026, Ren et al., 11 Dec 2025).
Scalability: Tokens introduce only $d$ 5 extra parameters (typically <1% of model size), supporting scalable parameter-efficient transfer learning and continual learning (Zhu et al., 9 Jun 2025, Peng et al., 17 Mar 2026).
Generalization: Compositional tokens can be incrementally extended (e.g., to new safety categories) or recombined during inference, without catastrophic forgetting or retraining (Peng et al., 17 Mar 2026).
Limitations: For domains with radically different structural priors, token-design may require substantial reengineering (e.g., spatial tokens for 3D data, dynamic morphologies in language (Owodunni et al., 17 Jul 2025)).

7. Extensions, Generalization, and Open Directions

Arbitrary Structure Summarization: The token-pooling principle generalizes to time series, graphs, tables, and transformers over arbitrary modalities (images, video, speech), serving as universal compression and fusion interfaces (Kim et al., 29 Jan 2025, Georgiou et al., 15 Apr 2025, Wang et al., 9 Feb 2026).
Dynamic/Contextual Tokenization: Contextually adaptive or “dynamic token” mechanisms (FLEXITOKENS, TokenLearner) hold promise for model continual adaptation and efficient scaling.
Modular Control and Safety: Composable control tokens (as in MOSAIC) address the need for granular, user- or deployment-specific policy application in LLMs without entangling core capabilities.
Interpretable Representations: Orthogonal or diversity-regularized tokens support interpretability and selection, opening interpretability and selective inference directions.
Unexplored Modalities/Architectures: Potential extensions include 3D/temporal grounding, reinforcement-based token routing, and hybrid discrete/continuous token regimes.

Learnable tokens have emerged as a general and powerful architectural primitive for modern deep models, enabling efficiency, modularity, control, and extensibility across modalities and tasks, with strong empirical gains and growing theoretical support (Jiang et al., 2024, Sun et al., 13 Apr 2026, Peng et al., 17 Mar 2026, Ren et al., 11 Dec 2025).