Papers
Topics
Authors
Recent
Search
2000 character limit reached

One-D-Piece: Variable-Length Image Tokenizer

Updated 2 February 2026
  • One-D-Piece is a variable-length image compression framework that uses a one-dimensional sequence of discrete tokens, enabling adjustable quality through Tail Token Drop.
  • It employs a ViT-based encoder and TiTok-inspired tokenization with cross-entropy and reconstruction losses to effectively prioritize information in early tokens.
  • Empirical results demonstrate superior rate-distortion performance and enhanced downstream task compatibility compared to traditional codecs like JPEG and WebP.

One-D-Piece refers to a family of methods and ideas in which the key structural element is a one-dimensional sequence (“piece”)—most notably, in the context of recent machine learning, a discrete image tokenizer with variable-length outputs for quality-controllable compression. The primary technical reference for One-D-Piece is "One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression" (Miwa et al., 17 Jan 2025), which introduces and analyzes a variable-length, discrete token-based image compression architecture grounded in the TiTok framework. The concept is central to a new paradigm in learned compression, enabling image reconstructions of adjustable quality with efficient downstream compatibility.

1. Architectural Foundations of One-D-Piece

One-D-Piece is constructed around a TiTok-derived encoder-decoder pipeline augmented for variable-length tokenization. The encoder is a Vision Transformer (ViT), ingesting images XRH×W×CX\in\mathbb{R}^{H\times W\times C}, partitioning them into 16×1616\times16 patches, embedding patches to produce a set of NN continuous latent vectors, and appending NN learnable “latent” tokens. These latents are vector-quantized: each ziz_i is mapped to a discrete code qi{1,,K}q_i\in\{1,\ldots,K\} via a learned codebook.

Formally, the quantized token sequence is q=[q1,,qN]\mathbf{q} = [q_1, \ldots, q_N]. The decoder leverages MaskGIT and CNN components to reconstruct the image from q\mathbf{q}. The central architectural element unique to One-D-Piece is the Tail Token Drop mechanism: during both training and inference, q\mathbf{q} can be arbitrarily truncated in length, enabling explicit control over the quality-rate tradeoff. At inference, a user can select any prefix length m[1,N]m\in[1,N], yielding a compressed representation of $1.5m$ bytes ($12$ bits per token).

2. Tail Token Drop Regularization

Tail Token Drop is the key regularization innovation in One-D-Piece. For each batch during training, a value kU(0,N1)k\sim U(0,N-1) is sampled. Only the prefix q=[q1,,qNk]\mathbf{q}'=[q_1,\ldots,q_{N-k}] is provided to the decoder, which must reconstruct the image from this truncated sequence. This stochastic truncation compels the tokenizer to concentrate information in the leading tokens, imposing information prioritization without an explicit loss term beyond standard reconstruction objectives.

This approach effectively merges what could be an explicit, kk-averaged reconstruction regularizer,

Lrec(q1:Nk)=XX^(q1:Nk)22,L_{\mathrm{rec}}(\mathbf{q}_{1:N-k}) = \| X - \hat{X}(\mathbf{q}_{1:N-k}) \|^2_2,

into the fundamental loss by randomizing the number of tokens per training instance, thereby making the model robust to arbitrary cut-off lengths at inference.

3. Training Procedures and Objectives

The training protocol follows a two-stage TiTok scheme:

  • Stage 1 (Token Prediction): Cross-entropy loss is minimized between the predicted distribution over tokens and targets from a pretrained tokenizer across NN positions,

Lstage1=i=1Nlogp(qiX).L_{\mathrm{stage1}} = -\sum_{i=1}^N \log p(q_i|X).

  • Stage 2 (Reconstruction): The encoder is frozen; training optimizes a composite loss,

Lstage2=L2+λpercLperceptual+λGANLGAN,L_{\mathrm{stage2}} = L_{\ell_2} + \lambda_{\mathrm{perc}} L_{\mathrm{perceptual}} + \lambda_{\mathrm{GAN}} L_{\mathrm{GAN}},

where L2=XX^22L_{\ell_2} = \| X - \hat{X} \|^2_2, Lperceptual=F(X)F(X^)1L_{\mathrm{perceptual}} = \| F(X) - F(\hat{X}) \|_1 (with FF a ConvNeXT feature projector), and LGANL_{\mathrm{GAN}} is the typical PatchGAN adversarial loss (λperc=0.1\lambda_{\mathrm{perc}}=0.1, λGAN=0.01\lambda_{\mathrm{GAN}}=0.01). During this stage, Tail Token Drop is active, so the model learns to optimize reconstructions under varying token count constraints.

4. Compression Efficiency and Quality Control

The variable-length property allows explicit, token-by-token quality-budget steering. For a given mm, the decoder reconstructs X^m=Decoder([q1,,qm])\hat{X}_m=\mathrm{Decoder}([q_1,\ldots,q_m]). As mm increases, quality metrics and perceptual fidelity uniformly improve. Empirically, recognizable reconstructions are achieved at m=8m=8 and near-lossless rendering is seen at m=256m=256 [(Miwa et al., 17 Jan 2025), Table 1, Figure 1].

Quantitative metrics (rFID, PSNR, LPIPS, SSIM, pixel-wise L1/L2L_1/L_2) establish that One-D-Piece consistently surpasses traditional codecs (JPEG, WebP) and existing neural tokenizers at equivalent or lower byte rates, both in distortion-oriented and perceptual metrics. For 256 tokens (384 bytes), rFID is $1.08$ (WebP: $31.98$), confirming major improvements in perceptual similarity.

5. Downstream Task Performance

One-D-Piece's reconstructions demonstrate high compatibility with computer vision pipelines. Off-the-shelf networks (ConvNeXT for classification, YOLO11x for detection, SERE for segmentation, Depth Anything for depth estimation, CLIP for embedding retrieval) were evaluated on outputs produced with varying mm. For example, at 128 tokens (192 bytes), ImageNet-1K classification top-1 rises to $0.779$ versus WebP's $0.664$. In semantic segmentation (mIoU), the gain is similar: $0.572$ for One-D-Piece-L (128) versus $0.410$ for WebP [(Miwa et al., 17 Jan 2025), Table 3]. This trend reflects strong preservation of task-relevant semantic and structural details at low bitrates.

6. Analytical Insights and Ablation Studies

Several ablations demonstrate that information is effectively packed into the earliest tokens:

  • Token Contribution: Replacing qiq_i with random codes, with measured impact Δi=X^X^(i)1\Delta_i=\| \hat{X} - \hat{X}^{(i)} \|_1, yields a monotonic decay from i=1i=1 down, far stronger than in baseline TiTok Figure 2.
  • First-token Clustering: Images with identical q1q_1 share global appearance; swapping q1q_1 swaps scene layout, indicating leading tokens encode coarse information Figure 3.
  • Linear Probing: Probing encoder features with a linear classifier for ImageNet-1K, One-D-Piece-L achieves $0.389$ accuracy, surpassing TiTok, confirming the semantic informativeness of the head tokens.

These findings underscore the mechanism by which Tail Token Drop fosters robust learning of prioritization across the token sequence.

7. Applications and Implications

One-D-Piece is particularly suited to hardware-constrained and adaptive environments—on-device inference, variable bitrate video/image transmission, vision-LLM (VLM) inputs, and generative pipelines. Its ability to produce meaningful representations at m=1256m=1\ldots256 tokens with a single model obviates retraining or codebook relearning as requirements shift. The superiority in both rate-distortion and perceptual utility, along with the confirmed downstream task performance without fine-tuning, sets a new baseline for quality-controllable, variable-rate image tokenizers (Miwa et al., 17 Jan 2025).

A plausible implication is that variable-length, one-dimensional tokenization will become foundational in future vision backbones and multimodal pipelines, especially where content-adaptive compression and inference on budget-constrained systems are critical.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to One-D-Piece.