- The paper introduces a novel framework that reinterprets object recognition as a next token prediction problem using language decoders.
- It leverages custom non-causal attention masking and one-shot sampling to efficiently generate object labels with improved recall.
- The method truncates a pre-trained language model to reduce complexity while achieving competitive performance versus models like CLIP.
Object recognition is essential in computer vision, which typically entails converting visual image data into identifiable object labels. A newly explored approach in object recognition frames this task as a next token prediction problem, traditionally used in language processing fields.
The central idea of the method is to utilize a language decoder that can auto-regressively predict text tokens from image embeddings to form labels. To do this, image embeddings generated by a model, like a Vision Transformer (ViT), are considered a prefix series of tokens to which object labels' tokens are appended as predictions.
An adjustment called non-causal attention masking becomes pivotal within the decoder mechanism. This feature enables the model to treat tokens from distinct labels independently, while label tokens remain dependent on the image embeddings. The custom non-causal attention mask also encourages efficient sampling methods during inference.
One of the unique strategies introduced is the "one-shot sampling" methodology. It allows for the parallel sampling of multiple label tokens rather than the sequential token sampling seen in traditional approaches. This new method efficiently handles token predictions and ranking them by their likelihood without running into common repetition problems found in sequences.
For further efficiency, the paper proposes compacting a language decoder without sacrificing performance: a method they call "truncating the decoder." They demonstrate this by removing intermediate blocks from a pre-trained LLM, successfully reducing model complexity while maintaining effectiveness.
Comparisons with related works, including methods like CLIP and various auto-regressive visual-LLMs, indicate that this new framework can generate more relevant labels with higher recall rates while also being considerably more efficient. The paper also discusses the unique challenges of their methods, such as defining the labels, managing noisy data, and the competition issue during parallel sampling.
The authors suggest that future research could explore mitigation strategies for the competition issue in one-shot sampling and alternative adaptations of their method for more specialized tasks like fine-grained recognition or single-label predictions.
Overall, this novel framework significantly contributes to object recognition by reimagining it through the lens of auto-regressive LLMing and paving the way for more efficient processing and diverse label generation.