1D Convolutional Token Compression

Updated 19 November 2025

The technique reduces token sequence lengths using 1D CNN operations to aggregate local context and decide which tokens to keep or drop.
It integrates trainable, deletion-based architectures with parameter-free pooling modules to achieve dynamic, efficient compression.
Empirical results demonstrate improved training speed and competitive F1 scores (≈0.80–0.84) on benchmark datasets like GoogleNews.

One-dimensional convolution-based token compression refers to neural architectures and algorithmic modules that employ one-dimensional (1D) convolutional operations to reduce variable-length sequences of token embeddings—typically in natural language processing—into shorter, information-preserving compressed representations. This paradigm is deployed both as an explicit learned deletion mechanism at the token level (e.g., for sentence compression) and as a parameter-free, dynamic pooling layer for sequence-length reduction upstream of deep encoders. Recent instantiations include trainable U-Net style CNNs for token-wise deletion and non-learned convolutional compression blocks such as the AdaptiveAvgPool1d-based module in Jasper-Token-Compression-600M.

1. 1D Convolutional Principles for Sequence Compression

One-dimensional convolution, formally defined for a sequence $\{x_1, ..., x_L\}$ as $h_i = f(\sum_{j=1}^k W_j x_{i+j-1} + b)$ with a nonlinearity $f(\cdot)$ such as ReLU, enables local context aggregation over a fixed receptive field, supporting the compression or transformation of sequence-based inputs for subsequent modeling stages. In deletion-based compression (as in sentence compression), these feature maps are further processed to output binary keep/delete decisions per position, while in pooling-based compression, they directly yield shortened sequences.

The convolution operation's window size ( $k$ ), stride, and choice of padding control the granularity and information aggregation, thus impacting the quality and resolution of compressed outputs. In module designs that use parameter-free averaging, the kernel is chosen to partition and downsample the sequence uniformly.

2. Trainable Token-wise CNN Compression Architectures

A canonical example of trainable, token-level deletion-based compression is presented in "A Token-wise CNN-based Method for Sentence Compression" (Hou et al., 2020). Given an input matrix $X\in\mathbb{R}^{L \times d_c}$ (with $d_c$ denoting embedding channels, e.g., GloVe 100d or concatenated BERT hidden states), the model applies a U-Net–style architecture:

Down-path: Two 1D convolutions (kernel 5 then 3, stride 1), followed by a 1D max-pool (reducing length $L\to L/2$ ).
Bottleneck: Two further 1D convolutions (both kernel size 3), expanding to a latent 128d feature.
Up-sampling: Nearest-neighbor or transposed convolution restores the sequence to length $L$ , concatenated via skip connection with the last down-path output.
Up-path: Three 1D convolutions (kernels 3, 3, 1), outputting logits for each token position.
Output: Token-wise softmax over binary classes (keep/delete) applied to generate compression masks.

Training minimizes token-wise cross-entropy loss. Empirical results on the GoogleNews dataset indicate that CNN+Multi-layer-BERT achieves F1=0.80 (Acc=0.82) on a 10k test set, with marginally lower accuracy but ∼10× faster convergence than BiLSTM+BERT baselines.

Ablations clarify the importance of skip connections, multi-gram kernels, and upsampling. The mostly local nature of convolutional aggregation, and the lack of explicit modeling for long-range dependencies, constitute acknowledged limitations in such architectures (Hou et al., 2020).

3. Non-Learned 1D Convolutional Pooling for Dynamic Token Compression

Jasper-Token-Compression-600M (Zhang et al., 18 Nov 2025) exemplifies a parameter-free pooling module that can be strictly interpreted as a 1D convolution with uniform (all-ones) kernel, sliding over token positions with variable kernel size and stride. This mechanism operates on the output of a small, trainable feature transformation block (Qwen3MLP using SwiGLU activations):

Incoming representations: $X^{(0)} \in \mathbb{R}^{B \times L_\text{in} \times D}$ .
Qwen3MLP: Two-layer feedforward block producing $X^{(1)} \in \mathbb{R}^{B \times L_\text{in} \times D}$ .
AdaptiveAvgPool1d: Downsamples $L_\text{in}$ to $L_\text{tgt}$ via

$Y_{b,i,d} = \frac{1}{|S_i|} \sum_{j \in S_i} X_{b,j,d}$

where $S_i$ partitions $L_\text{in}$ tokens among $L_\text{tgt}$ outputs.

The only learnable parameters reside in the Qwen3MLP; all pooling weights are fixed. The resulting module enables controlled, runtime-adjustable sequence-length reduction immediately upstream of the Transformer block, conferring highly flexible speed–accuracy trade-offs.

4. Dynamic Compression Rates and Curriculum Strategies

The Jasper approach employs a piecewise mechanism for selecting the compression ratio $r\in(0,1]$ per mini-batch, transitioning from fixed $r=0.33$ (Stage 2) to dynamic ratios in later stages (Stage 3–4). For an input sequence of original length $L_\text{in}$ and a set threshold $L_\text{th}=80$ , the target compressed length is

$L_\text{tgt}(L_\text{in}, r) = \begin{cases} L_\text{in}, & L_\text{in} \leq L_\text{th}\ \lfloor L_\text{th} + (L_\text{in} - L_\text{th}) \cdot r \rfloor, & L_\text{in} > L_\text{th} \end{cases}$

Compression rates $r$ are sampled so that most batches fall in $[0.33, 1]$ but with a lower-probability tail to more aggressive settings (as low as $0.1$). At inference, arbitrary $r$ can be set to balance computational latency and representation fidelity as needed (Zhang et al., 18 Nov 2025).

5. Integration with Pretrained Language Representations

Both deletion-based (token-wise CNN) and pooling-based (Jasper) approaches leverage features from pretrained LLMs but differ in strategy:

Token-wise CNN: Utilizes frozen BERT (base, 12-layer, uncased, 768d) hidden states as token representations, optionally concatenating multiple upper-layer outputs to form input channels. No BERT fine-tuning occurs; the entire compression learning takes place in the CNN.

BERT layer integration ablation reveals that aggregating multiple layers marginally increases F1 and accuracy but with diminishing returns beyond 2–3 layers (Hou et al., 2020).

Pooling-based compression (Jasper): Receives token embeddings (possibly already preprocessed/embedded) and compresses length before Transformer attention. This suggests the pooling operation can be applied as a plug-in to a broad array of pretrained encoder pipelines (Zhang et al., 18 Nov 2025).

6. Evaluation, Empirical Results, and Limitations

In deletion-based compression, CNN+Multi-layer-BERT achieves F1=0.80 (Acc=0.82) on GoogleNewsSmall and F1 ≈0.84 on GoogleNewsLarge, with an order-of-magnitude improvement in training speed over RNN-based methods. The Jasper compression module, tested on MTEB tasks, increases English Mean(Task) from 70.70 (uncompressed 0.6B param baseline) to 74.75, with Chinese rising from 66.33 to 73.51. Encoder latency reduces nearly 2× at a compression ratio of 0.33, with minimal degradation in embedding quality.

Ablation studies indicate that the U-Net-style upsampling and skip connections (token-wise CNN), as well as the Qwen3MLP for feature transformation (Jasper), are integral to optimal performance. Notably, the parameter-free AdaptiveAvgPool1d yields significant computational savings without loss of representational power in downstream embedding tasks.

Limitations stem from the typically local nature of convolutional pattern extraction—hindering direct modeling of long-range token dependencies—and, in deletion settings, from reliance on auto-labeled training data, which may introduce noise. Future directions highlighted include hybrid models integrating CNN-based encoders as fast RL policy networks, joint fine-tuning of frozen language representations with CNN compression modules, and adaptive, task-specific control over compression rates (Hou et al., 2020, Zhang et al., 18 Nov 2025).

7. Applications and Prospects

One-dimensional convolution-based token compression, in both deletion and pooling formulations, is leveraged for high-throughput sentence compression, efficient retrieval, and memory-limited text encoding. Its capability for dynamic, per-sample control over compression ratio, in particular, enables flexible deployment in large-scale multilingual settings and interactive or resource-constrained applications. Extensions to adaptively tuned or reinforcement learning–integrated compression, as well as incorporation within retrieval-optimized architectures, represent areas of current and future research interest (Hou et al., 2020, Zhang et al., 18 Nov 2025).

PDF Markdown Chat (Pro)

References (2)

A Token-wise CNN-based Method for Sentence Compression (2020)

Jasper-Token-Compression-600M Technical Report (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to One-Dimensional Convolution-Based Token Compression.