One-D-Piece: Variable-Length Image Tokenizer
- One-D-Piece is a variable-length image compression framework that uses a one-dimensional sequence of discrete tokens, enabling adjustable quality through Tail Token Drop.
- It employs a ViT-based encoder and TiTok-inspired tokenization with cross-entropy and reconstruction losses to effectively prioritize information in early tokens.
- Empirical results demonstrate superior rate-distortion performance and enhanced downstream task compatibility compared to traditional codecs like JPEG and WebP.
One-D-Piece refers to a family of methods and ideas in which the key structural element is a one-dimensional sequence (“piece”)—most notably, in the context of recent machine learning, a discrete image tokenizer with variable-length outputs for quality-controllable compression. The primary technical reference for One-D-Piece is "One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression" (Miwa et al., 17 Jan 2025), which introduces and analyzes a variable-length, discrete token-based image compression architecture grounded in the TiTok framework. The concept is central to a new paradigm in learned compression, enabling image reconstructions of adjustable quality with efficient downstream compatibility.
1. Architectural Foundations of One-D-Piece
One-D-Piece is constructed around a TiTok-derived encoder-decoder pipeline augmented for variable-length tokenization. The encoder is a Vision Transformer (ViT), ingesting images , partitioning them into patches, embedding patches to produce a set of continuous latent vectors, and appending learnable “latent” tokens. These latents are vector-quantized: each is mapped to a discrete code via a learned codebook.
Formally, the quantized token sequence is . The decoder leverages MaskGIT and CNN components to reconstruct the image from . The central architectural element unique to One-D-Piece is the Tail Token Drop mechanism: during both training and inference, can be arbitrarily truncated in length, enabling explicit control over the quality-rate tradeoff. At inference, a user can select any prefix length , yielding a compressed representation of $1.5m$ bytes ($12$ bits per token).
2. Tail Token Drop Regularization
Tail Token Drop is the key regularization innovation in One-D-Piece. For each batch during training, a value is sampled. Only the prefix is provided to the decoder, which must reconstruct the image from this truncated sequence. This stochastic truncation compels the tokenizer to concentrate information in the leading tokens, imposing information prioritization without an explicit loss term beyond standard reconstruction objectives.
This approach effectively merges what could be an explicit, -averaged reconstruction regularizer,
into the fundamental loss by randomizing the number of tokens per training instance, thereby making the model robust to arbitrary cut-off lengths at inference.
3. Training Procedures and Objectives
The training protocol follows a two-stage TiTok scheme:
- Stage 1 (Token Prediction): Cross-entropy loss is minimized between the predicted distribution over tokens and targets from a pretrained tokenizer across positions,
- Stage 2 (Reconstruction): The encoder is frozen; training optimizes a composite loss,
where , (with a ConvNeXT feature projector), and is the typical PatchGAN adversarial loss (, ). During this stage, Tail Token Drop is active, so the model learns to optimize reconstructions under varying token count constraints.
4. Compression Efficiency and Quality Control
The variable-length property allows explicit, token-by-token quality-budget steering. For a given , the decoder reconstructs . As increases, quality metrics and perceptual fidelity uniformly improve. Empirically, recognizable reconstructions are achieved at and near-lossless rendering is seen at [(Miwa et al., 17 Jan 2025), Table 1, Figure 1].
Quantitative metrics (rFID, PSNR, LPIPS, SSIM, pixel-wise ) establish that One-D-Piece consistently surpasses traditional codecs (JPEG, WebP) and existing neural tokenizers at equivalent or lower byte rates, both in distortion-oriented and perceptual metrics. For 256 tokens (384 bytes), rFID is $1.08$ (WebP: $31.98$), confirming major improvements in perceptual similarity.
5. Downstream Task Performance
One-D-Piece's reconstructions demonstrate high compatibility with computer vision pipelines. Off-the-shelf networks (ConvNeXT for classification, YOLO11x for detection, SERE for segmentation, Depth Anything for depth estimation, CLIP for embedding retrieval) were evaluated on outputs produced with varying . For example, at 128 tokens (192 bytes), ImageNet-1K classification top-1 rises to $0.779$ versus WebP's $0.664$. In semantic segmentation (mIoU), the gain is similar: $0.572$ for One-D-Piece-L (128) versus $0.410$ for WebP [(Miwa et al., 17 Jan 2025), Table 3]. This trend reflects strong preservation of task-relevant semantic and structural details at low bitrates.
6. Analytical Insights and Ablation Studies
Several ablations demonstrate that information is effectively packed into the earliest tokens:
- Token Contribution: Replacing with random codes, with measured impact , yields a monotonic decay from down, far stronger than in baseline TiTok Figure 2.
- First-token Clustering: Images with identical share global appearance; swapping swaps scene layout, indicating leading tokens encode coarse information Figure 3.
- Linear Probing: Probing encoder features with a linear classifier for ImageNet-1K, One-D-Piece-L achieves $0.389$ accuracy, surpassing TiTok, confirming the semantic informativeness of the head tokens.
These findings underscore the mechanism by which Tail Token Drop fosters robust learning of prioritization across the token sequence.
7. Applications and Implications
One-D-Piece is particularly suited to hardware-constrained and adaptive environments—on-device inference, variable bitrate video/image transmission, vision-LLM (VLM) inputs, and generative pipelines. Its ability to produce meaningful representations at tokens with a single model obviates retraining or codebook relearning as requirements shift. The superiority in both rate-distortion and perceptual utility, along with the confirmed downstream task performance without fine-tuning, sets a new baseline for quality-controllable, variable-rate image tokenizers (Miwa et al., 17 Jan 2025).
A plausible implication is that variable-length, one-dimensional tokenization will become foundational in future vision backbones and multimodal pipelines, especially where content-adaptive compression and inference on budget-constrained systems are critical.