JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation (2411.07975v1)

Published 12 Nov 2024 in cs.CV, cs.AI, and cs.CL

Abstract: We present JanusFlow, a powerful framework that unifies image understanding and generation in a single model. JanusFlow introduces a minimalist architecture that integrates autoregressive LLMs with rectified flow, a state-of-the-art method in generative modeling. Our key finding demonstrates that rectified flow can be straightforwardly trained within the LLM framework, eliminating the need for complex architectural modifications. To further improve the performance of our unified model, we adopt two key strategies: (i) decoupling the understanding and generation encoders, and (ii) aligning their representations during unified training. Extensive experiments show that JanusFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches across standard benchmarks. This work represents a step toward more efficient and versatile vision-LLMs.

Authors (14)

Yiyang Ma (15 papers)
Xingchao Liu (28 papers)
Xiaokang Chen (39 papers)
Wen Liu (55 papers)
Chengyue Wu (22 papers)
Zhiyu Wu (26 papers)
Zizheng Pan (23 papers)
Zhenda Xie (51 papers)
Haowei Zhang (17 papers)
Xingkai Yu (9 papers)
Liang Zhao (353 papers)
Yisong Wang (14 papers)
Jiaying Liu (99 papers)
Chong Ruan (16 papers)

Citations (1)

View on Semantic Scholar

Summary

Integration of Autoregression and Rectified Flow Models in JanusFlow

The paper "JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation" presents an innovative approach to unify image understanding and generation within a single framework. This is achieved by combining autoregressive LLMs with rectified flow methods. JanusFlow stands out for its minimalist architecture, which maintains efficacy without requiring intricate architectural changes.

Key Architectural Choices

JanusFlow introduces strategies that enhance the efficacy of unified multimodal models. These include decoupling the understanding and generation encoders to address task-specific requirements separately, and aligning their representations during training to ensure coherent semantic processing. The approach allows rectified flow methods, known for their generative modeling capabilities, to integrate smoothly into LLM frameworks, which traditionally excel in sequence prediction tasks.

Performance Evaluation

JanusFlow demonstrates impressive results across various benchmark datasets. On text-to-image generation benchmarks, such as MJHQ FID-30k and GenEval, JanusFlow's performance surpasses established models like SDv1.5 and SDXL. Specifically, it achieved an MJHQ FID score of 9.51, highlighting its ability to generate high-quality images. Moreover, in multimodal comprehension benchmarks (MMBench, SeedBench, and GQA), JanusFlow outperformed specialized understanding models, a significant achievement considering the compact architecture of only 1.3 billion parameters.

Theoretical and Practical Implications

From a theoretical perspective, the successful integration of rectified flow within an autoregressive framework challenges existing paradigms in multimodal model design. JanusFlow's ability to operate with a decoupled but aligned encoder design indicates new potential directions for reducing task interference in unified models. Practically, JanusFlow offers a more streamlined approach to multimodal tasks, potentially reducing the computational burden compared to more complex architectures that maintain separate components for distinct tasks.

Future Speculations

Looking ahead, JanusFlow's framework could significantly influence future AI developments by providing a template for integrating disparate models into a cohesive whole. Future work might explore extending this integration principle to other modalities or applying similar alignments to additional architectures, such as those incorporating reinforcement learning or causal inference mechanisms.

Conclusion

JanusFlow marks a substantial advancement in the field of unified multimodal models, capitalizing on the strengths of both autoregressive LLMs and rectified flow methods. By balancing simplicity with performance, it demonstrates that architectural elegance need not compromise functionality, thus paving the way for new research into efficient model integrations that bridge understanding and generation capabilities seamlessly.

PDF Markdown

Related Papers

Tweets

https://twitter.com/A_K_Nain/status/1860872409827688575

https://twitter.com/sainingxie/status/1858215273091776799

YouTube

Show All Videos