Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Partition Generative Modeling: Masked Modeling Without Masks (2505.18883v1)

Published 24 May 2025 in cs.LG

Abstract: We introduce ``Partition Generative Models'' (PGMs), a novel approach to masked generative modeling (MGMs), particularly effective for masked diffusion LLMing (MDLMs). PGM divides tokens into two distinct groups and employs sparse attention patterns to prevent cross-group information exchange. Hence, the model is trained to predict tokens in one group based solely on information from the other group. This partitioning strategy eliminates the need for MASK tokens entirely. While traditional MGMs inefficiently process MASK tokens during generation, PGMs achieve greater computational efficiency by operating exclusively on unmasked tokens. Our experiments on OpenWebText with a context length of 1024 tokens demonstrate that PGMs deliver at least 5x improvements in both latency and throughput compared to MDLM when using the same number of sampling steps, while generating samples with better generative perplexity than MDLM. Finally, we show that PGMs can be distilled with Self-Distillation Through Time (SDTT), a method originally devised for MDLM, in order to achieve further inference gains.

Summary

Partition Generative Modeling: Masked Modeling Without Masks

The paper introduces Partition Generative Models (PGMs), a novel approach to masked generative modeling that eliminates the use of MASK tokens, improving both computational efficiency and generative performance of LLMs. PGMs are particularly effective in masked diffusion LLMing (MDLM) as they partition tokens into two groups, utilizing sparse attention to prevent cross-group information exchange. This allows the model to predict tokens in one group based solely on information from the other, removing inefficiencies associated with MASK tokens in traditional MGMs during generation processes.

Key Contributions

  1. Partitioning Strategy: PGMs introduce a partitioning strategy wherein token groups are formed without MASK tokens, creating a functional framework where two groups of tokens interact through sparse attention. This architecture leads to greater computational efficiency because PGMs operate only on unmasked tokens, contrasting with MGMs, which must manage full-length masked sequences.
  2. Experimental Validation: The efficacy of PGMs was validated through experiments on OpenWebText, demonstrating significant improvement over MDLMs. Specifically, PGMs showed at least a fivefold improvement in both latency and throughput compared to MDLM when using an equivalent number of sampling steps, while also achieving better generative perplexity.
  3. Model Distillation: The authors further improve inference efficiency by leveraging Self-Distillation Through Time (SDTT), originally devised for MDLMs. This method allowed PGMs to achieve additional inference gains, enhancing their utility for practical applications.

Results and Implications

The robust numerical results indicate that PGMs can significantly reduce inference time and resource usage while maintaining a high level of generative perplexity. This computational efficiency makes them an appealing choice for real-time applications where low latency is critical. Additionally, the ability to achieve better performance with fewer sampling steps implies that PGMs could be beneficial in large-scale deployments, where resource conservation is essential.

The theoretical implications of this approach suggest a shift away from traditional autoregressive models, paving the path for non-autoregressive models in various generative tasks. The partitioning strategy may also inspire future network architecture modifications to handle other modalities effectively, such as image and audio data, where masked generative modeling currently holds prominence.

Future Developments in AI

The potential of PGMs in reshaping generative model efficiency offers exciting avenues for future exploration. Future work might focus on extending PGMs to multimodal settings, developing new distillation techniques specifically optimized for PGMs, or exploring their application in domains such as video and audio synthesis. The work inherently calls for refinement towards scaling the model size and extending context lengths, aligning with ongoing efforts to upscale model capabilities in the AI landscape.

In conclusion, Partition Generative Modeling presents a compelling efficiency-focused alternative to existing generative modeling techniques, with strong potential for further development and adoption in AI technologies.