Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Neighboring Autoregressive Modeling for Efficient Visual Generation (2503.10696v1)

Published 12 Mar 2025 in cs.CV and eess.IV

Abstract: Visual autoregressive models typically adhere to a raster-order next-token prediction" paradigm, which overlooks the spatial and temporal locality inherent in visual content. Specifically, visual tokens exhibit significantly stronger correlations with their spatially or temporally adjacent tokens compared to those that are distant. In this paper, we propose Neighboring Autoregressive Modeling (NAR), a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure, following a near-to-farnext-neighbor prediction" mechanism. Starting from an initial token, the remaining tokens are decoded in ascending order of their Manhattan distance from the initial token in the spatial-temporal space, progressively expanding the boundary of the decoded region. To enable parallel prediction of multiple adjacent tokens in the spatial-temporal space, we introduce a set of dimension-oriented decoding heads, each predicting the next token along a mutually orthogonal dimension. During inference, all tokens adjacent to the decoded tokens are processed in parallel, substantially reducing the model forward steps for generation. Experiments on ImageNet$256\times 256$ and UCF101 demonstrate that NAR achieves 2.4$\times$ and 8.6$\times$ higher throughput respectively, while obtaining superior FID/FVD scores for both image and video generation tasks compared to the PAR-4X approach. When evaluating on text-to-image generation benchmark GenEval, NAR with 0.8B parameters outperforms Chameleon-7B while using merely 0.4 of the training data. Code is available at https://github.com/ThisisBillhe/NAR.

Summary

  • The paper introduces Neighboring Autoregressive Modeling (NAR), a paradigm that improves visual generation efficiency by predicting spatially and temporally close neighbors first, using a progressive outpainting method.
  • Empirical evaluations show NAR significantly improves throughput (e.g., 8.6x on UCF-101) and achieves better quality metrics like FID/FVD compared to baseline approaches.
  • NAR offers a scalable methodology for handling large-scale visual generation tasks and highlights potential for future research integrating dimension-oriented decoding and advanced tokenizers.

Neighboring Autoregressive Modeling for Efficient Visual Generation

The paper "Neighboring Autoregressive Modeling for Efficient Visual Generation" introduces a novel paradigm known as Neighboring Autoregressive Modeling (NAR) that seeks to enhance both efficiency and quality in visual generation tasks. Traditional autoregressive models have relied on a raster-order "next-token prediction" method, which overlooks the intrinsic spatial and temporal locality present in visual content. Visual tokens are inherently more correlated with their neighbors than with distant tokens. The NAR approach strategically reformulates autoregressive visual generation as a progressive outpainting procedure, which involves expanding the decoded region by predicting spatially and temporally close neighbors first.

NAR employs a "next-neighbor prediction" method that begins with an initial token and decodes subsequent tokens based on their proximity, using Manhattan distance to determine decoding order. This method supports parallel prediction of adjacent tokens, significantly reducing the number of model forward steps required for image and video generation tasks. A set of dimension-oriented decoding heads enables this parallelism, each targeting predictions along distinct orthogonal dimensions within spatial-temporal space.

Empirical evaluations demonstrate that NAR achieves remarkable improvements in throughput and generation quality. On datasets like ImageNet and UCF-101, NAR exhibits a throughput improvement by factors of 2.4 and 8.6 respectively, while producing better FID and FVD scores compared to the established PAR-4X approach. Remarkably, when assessed on the GenEval benchmark for text-to-image generation, NAR with merely 0.8 billion parameters surpasses Chameleon-7B by utilizing just 0.4% of the training data.

The practical and theoretical implications of NAR are profound. This paradigm not only provides a tangible improvement in visual generation efficiency and quality but also introduces a scalable methodology for handling large-scale and high-resolution tasks, adapting to both image and video data contexts seamlessly. The utilization of dimension-oriented decoding heads underscores potential future developments in AI that can further capitalize on the inherent spatial and temporal correlations found within visual data. NAR opens avenues for exploring deeper integration with multimodal networks and enhanced tokenization strategies that could cater to diverse domains beyond conventional visual tasks.

The paper acknowledges limitations such as the need for more advanced visual tokenizers and larger-scale evaluation datasets to further validate the model's potential. These aspects pave the way for future research directions, enabling greater alignment with the state-of-the-art technologies across the landscape of autoregressive and generative modeling.