Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 80 tok/s

Gemini 2.5 Pro 28 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 38 tok/s Pro

GPT-4o 125 tok/s Pro

Kimi K2 181 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis (2207.09814v2)

Published 20 Jul 2022 in cs.CV

Abstract: In this paper, we present NUWA-Infinity, a generative model for infinite visual synthesis, which is defined as the task of generating arbitrarily-sized high-resolution images or long-duration videos. An autoregressive over autoregressive generation mechanism is proposed to deal with this variable-size generation task, where a global patch-level autoregressive model considers the dependencies between patches, and a local token-level autoregressive model considers dependencies between visual tokens within each patch. A Nearby Context Pool (NCP) is introduced to cache-related patches already generated as the context for the current patch being generated, which can significantly save computation costs without sacrificing patch-level dependency modeling. An Arbitrary Direction Controller (ADC) is used to decide suitable generation orders for different visual synthesis tasks and learn order-aware positional embeddings. Compared to DALL-E, Imagen and Parti, NUWA-Infinity can generate high-resolution images with arbitrary sizes and support long-duration video generation additionally. Compared to NUWA, which also covers images and videos, NUWA-Infinity has superior visual synthesis capabilities in terms of resolution and variable-size generation. The GitHub link is https://github.com/microsoft/NUWA. The homepage link is https://nuwa-infinity.microsoft.com.

Citations (61)

View on Semantic Scholar

Summary

The paper introduces a dual autoregressive framework that divides visual synthesis into global patch-level and local token-level generation for infinite image and video creation.
It employs a Nearby Context Pool and an Arbitrary Direction Controller to reduce computation costs and manage patch ordering dynamically.
Experimental results show superior performance in FID, CLIP-SIM, Block-FID, and FVD scores compared to baselines like Taming Transformer and MaskGIT.

Overview of "NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis"

The paper "NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis" presents an innovative methodology for high-resolution image and video generation. The authors introduce NUWA-Infinity, a model capable of producing arbitrarily-sized visual content, distinguishing itself from previous models like DALL·E, Imagen, and Parti, which are restricted to fixed-size outputs.

Key Contributions

NUWA-Infinity leverages an autoregressive over autoregressive framework, dissecting the synthesis process into two levels: global patch-level and local token-level generation. This dual-layer approach effectively models dependencies both between patches and within patches, enabling the creation of consistent and detailed visual outputs.

Autoregressive Mechanism: The dual autoregressive structure allows for nuanced processing of visual content, capturing complex dependencies to maintain consistency across large-scale images and videos.
Nearby Context Pool (NCP): The NCP saves computation costs by storing and utilizing caches of previously generated patches, preserving contextual integrity without extensive computational overhead.
Arbitrary Direction Controller (ADC): This component manages patch generation orders and assigns positional embeddings dynamically, supporting nuanced outpainting tasks.

Experimental Evaluation

The model is evaluated across five tasks: Unconditional Image Generation\textsuperscript{HD}, Text-to-Image\textsuperscript{HD}, Image Outpainting\textsuperscript{HD}, Image Animation\textsuperscript{HD}, and Text-to-Video\textsuperscript{HD}. Notably, NUWA-Infinity outperforms alternative approaches like Taming Transformer and MaskGIT in generating high-resolution imagery with improved visual quality and semantic consistency.

For Text-to-Image\textsuperscript{HD}, NUWA-Infinity demonstrates robust performance with significant improvements in FID and CLIP-SIM scores, even when generated outputs extend significantly beyond training image dimensions.
In Image Outpainting\textsuperscript{HD}, the model illustrates superior capability in directional image extension, achieving better Block-FID scores compared to baselines.
The Image Animation\textsuperscript{HD} task showcases NUWA-Infinity's proficiency in generating temporally consistent video outputs, evidenced by lower FVD scores.

Implications and Future Directions

The advancement presented by NUWA-Infinity is pertinent for applications requiring scalable and varied visual content generation, such as virtual design, multimedia production, and augmented reality. Its ability to seamlessly extend images and construct long-duration videos while maintaining high fidelity is particularly advantageous in these domains.

Future developments could focus on optimizing the model’s computational efficiency further, potentially integrating non-autoregressive elements to accelerate inference time. Additionally, expansion of training datasets could enhance the model’s generalization capabilities, thereby facilitating broader real-world applicability.

Conclusion

The introduction of NUWA-Infinity marks a significant progression in visual synthesis technology, addressing limitations in scalability and resolution found in previous models. Through its autoregressive framework, NUWA-Infinity not only improves upon existing methods but also sets a foundation for future research into infinitely scalable visual content generation.