Jasper Token Compression 600M: Efficient Transformer
- Jasper-Token-Compression-600M is a bilingual transformer embedding model that integrates a fully differentiable token compression block for efficient sequence reduction.
- It employs a two-layer SwiGLU-activated MLP and adaptive average pooling to achieve significant inference speedups, nearly matching an 8B teacher model’s performance.
- The model combines knowledge distillation and contrastive learning in a four-stage training pipeline, balancing compression ratios with embedding fidelity.
Jasper-Token-Compression-600M is an open-source bilingual (English and Chinese) transformer embedding model that introduces a fully differentiable, one-dimensional convolution-based token compression block for efficient sequence reduction. Developed as an extension of the Stella and Jasper distillation-based paradigms, this architecture leverages both knowledge distillation and contrastive learning to achieve high-quality embeddings with significant inference acceleration compared to conventional dense transformer models of similar parameter count (Zhang et al., 18 Nov 2025).
1. Model Overview and Motivation
Jasper-Token-Compression-600M is designed to address the limitations of high memory and compute overhead associated with deep transformer models processing long sequences. The core innovation is a token-compression module that reduces the sequence length prior to self-attention, yielding faster and memory-efficient processing without substantial loss of embedding fidelity. The approach is motivated by the need for practical runtime efficiency, enabling a base 600M-parameter model to approach the performance of a full 8B-parameter teacher while offering substantial speed gains (Zhang et al., 18 Nov 2025). This is achieved by building on sequence-level convolutional compression concepts previously validated for deletion-based sentence compression (Hou et al., 2020).
2. Token Compression Block: Architecture and Operation
The Jasper-Token-Compression-600M architecture inserts a token-compression module between the word-piece embedding layer and transformer encoder blocks:
- Input: The model receives a sequence , where is the (possibly long) input length and the embedding width.
- Qwen3MLP Layer: A two-layer SwiGLU-activated feedforward network is applied:
- First linear mapping: , with SwiGLU and dropout 0.1
- Second linear mapping:
- The result
- AdaptiveAvgPool1d: A parameter-free 1D average-pooling layer reduces along the length dimension to the target , yielding . The pooling kernel size and stride are dynamically chosen such that
where (no padding).
- Transformer Stack: The compressed sequence is then processed by the standard Qwen3 transformer blocks (attention and FFN modules, now at length ).
The whole module is end-to-end differentiable, with only the MLP containing trainable parameters, and no additional masking or complicated memory management required (Zhang et al., 18 Nov 2025).
3. Dynamic Compression Scheduling
The input length after compression, , is determined by two hyperparameters:
- Compression ratio : controls the aggressiveness of pooling.
- Threshold : for short sequences (), no compression is applied (); otherwise,
During training, is dynamically sampled per batch according to the following schedule:
- With probability 0.1,
- With probability 0.4,
- With probability 0.3,
- With probability 0.2,
This exposes the network to a spectrum of compression rates, fostering robustness to variable-length sequences and variable compression (Zhang et al., 18 Nov 2025).
4. Integration with Distillation and Contrastive Training
The token-compression block is embedded within a four-stage training pipeline:
- Stage 1: Plain knowledge distillation (KD), cosine loss only:
- Stage 2: Fixed-ratio compression with KD, MLP weights updated, pooling static.
- Stage 3: Dynamic compression with structure-preserving distillation; adds pairwise similarity loss:
Total loss:
- Stage 4: Contrastive retrieval fine-tuning with InfoNCE and soft KL-distillation:
All losses are backpropagated through the token-compression block with standard gradient flow. The average pooling is parameter-free; all learnable parameters reside in the MLP.
5. Performance Evaluation and Trade-offs
Jasper-Token-Compression-600M achieves strong performance and efficiency trade-offs, as measured on the Massive Text Embedding Benchmark (MTEB):
| Model | MTEB (en) | MTEB (zh) | Inference time (1K tokens) |
|---|---|---|---|
| Vanilla 0.6B | 70.70 | 66.33 | 24.24 ms |
| Jasper-TC-600M (ρ = 0.5) | 74.75 | 73.51 | 13.11 ms (≈46% faster) |
| Jasper-TC-600M (ρ = 0.33) | 74.58 | N/A | 9.38 ms (×2.6 speed) |
| Jasper-TC-600M (ρ = 0.2) | 74.21 | N/A | 6.56 ms (×3.7 speed) |
| Jasper-TC-600M (ρ = 0.1) | – | – | 4.48 ms (×5.4 speed) |
| 8B teacher | 75.22 | 73.84 | – |
Reducing results in minimal drops in Mean(Task) while offering linear-to-superlinear throughput gains. At default , the 600M model nearly matches the 8B teacher's quality at double the inference speed, and can go up to speed improvements with only minor metric loss (Zhang et al., 18 Nov 2025).
6. Comparative Context: Related Token Compression Methods
Prior approaches to token compression in NLP tasks—particularly deletion-based models—leveraged 1D convolutional encoder-decoder networks, employing U-Net style architectures with skip connections for retaining fine-grained token information (Hou et al., 2020). However, these models focused primarily on sentence-level binary masking for deletion, producing a retained/deleted mask per token using a block of stacked 1D convolutions, max-pooling, and upsampling layers.
By contrast, Jasper-Token-Compression-600M applies a learnable MLP followed by a non-parametric adaptive pooling operation, reducing the entire sequence length prior to standard transformer processing and enabling highly efficient memory and compute profiles. A plausible implication is that this strategy avoids the masking and alignment complications found in probabilistic or hard-selection compression, while maintaining end-to-end differentiability and interpretability.
7. Significance and Potential Impacts
Jasper-Token-Compression-600M establishes that deep transformer models can incorporate trainable, fully differentiable compression modules to flexibly trade sequence length for efficiency, without catastrophic loss in embedding performance. Its approach integrates seamlessly into distillation and contrastive learning frameworks, offering practical deployment options for large-scale retrieval, embedding, and multilingual tasks. The methodology further generalizes the utility of convolution-inspired sequence reduction in modern transformer pipelines beyond application-specific sentence compression (Zhang et al., 18 Nov 2025, Hou et al., 2020).