Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
117 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Object Recognition as Next Token Prediction (2312.02142v4)

Published 4 Dec 2023 in cs.CV

Abstract: We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels. To ground this prediction process in auto-regression, we customize a non-causal attention mask for the decoder, incorporating two key features: modeling tokens from different labels to be independent, and treating image tokens as a prefix. This masking mechanism inspires an efficient method - one-shot sampling - to simultaneously sample tokens of multiple labels in parallel and rank generated labels by their probabilities during inference. To further enhance the efficiency, we propose a simple strategy to construct a compact decoder by simply discarding the intermediate blocks of a pretrained LLM. This approach yields a decoder that matches the full model's performance while being notably more efficient. The code is available at https://github.com/kaiyuyue/nxtp

Citations (6)

Summary

  • The paper introduces a novel framework that reinterprets object recognition as a next token prediction problem using language decoders.
  • It leverages custom non-causal attention masking and one-shot sampling to efficiently generate object labels with improved recall.
  • The method truncates a pre-trained language model to reduce complexity while achieving competitive performance versus models like CLIP.

Object recognition is essential in computer vision, which typically entails converting visual image data into identifiable object labels. A newly explored approach in object recognition frames this task as a next token prediction problem, traditionally used in language processing fields.

The central idea of the method is to utilize a language decoder that can auto-regressively predict text tokens from image embeddings to form labels. To do this, image embeddings generated by a model, like a Vision Transformer (ViT), are considered a prefix series of tokens to which object labels' tokens are appended as predictions.

An adjustment called non-causal attention masking becomes pivotal within the decoder mechanism. This feature enables the model to treat tokens from distinct labels independently, while label tokens remain dependent on the image embeddings. The custom non-causal attention mask also encourages efficient sampling methods during inference.

One of the unique strategies introduced is the "one-shot sampling" methodology. It allows for the parallel sampling of multiple label tokens rather than the sequential token sampling seen in traditional approaches. This new method efficiently handles token predictions and ranking them by their likelihood without running into common repetition problems found in sequences.

For further efficiency, the paper proposes compacting a language decoder without sacrificing performance: a method they call "truncating the decoder." They demonstrate this by removing intermediate blocks from a pre-trained LLM, successfully reducing model complexity while maintaining effectiveness.

Comparisons with related works, including methods like CLIP and various auto-regressive visual-LLMs, indicate that this new framework can generate more relevant labels with higher recall rates while also being considerably more efficient. The paper also discusses the unique challenges of their methods, such as defining the labels, managing noisy data, and the competition issue during parallel sampling.

The authors suggest that future research could explore mitigation strategies for the competition issue in one-shot sampling and alternative adaptations of their method for more specialized tasks like fine-grained recognition or single-label predictions.

Overall, this novel framework significantly contributes to object recognition by reimagining it through the lens of auto-regressive LLMing and paving the way for more efficient processing and diverse label generation.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com