Simple and Effective Masked Diffusion Language Models (2406.07524v2)

Published 11 Jun 2024 in cs.CL, cs.AI, and cs.LG

Abstract: While diffusion models excel at generating high-quality images, prior work reports a significant performance gap between diffusion and autoregressive (AR) methods in LLMing. In this work, we show that simple masked discrete diffusion is more performant than previously thought. We apply an effective training recipe that improves the performance of masked diffusion models and derive a simplified, Rao-Blackwellized objective that results in additional improvements. Our objective has a simple form -- it is a mixture of classical masked LLMing losses -- and can be used to train encoder-only LLMs that admit efficient samplers, including ones that can generate arbitrary lengths of text semi-autoregressively like a traditional LLM. On LLMing benchmarks, a range of masked diffusion models trained with modern engineering practices achieves a new state-of-the-art among diffusion models, and approaches AR perplexity. We provide the code, along with a blog post and video tutorial on the project page: https://s-sahoo.com/mdlm

Authors (8)

Subham Sekhar Sahoo (10 papers)
Marianne Arriola (3 papers)
Yair Schiff (19 papers)
Aaron Gokaslan (33 papers)
Edgar Marroquin (1 paper)
Alexander Rush (11 papers)
Volodymyr Kuleshov (45 papers)
Justin T Chiu (5 papers)

Citations (24)

View on Semantic Scholar

Summary

The paper demonstrates that models like SUBS (fine-tuned) and Caducues show robust performance across tasks, with SUBS achieving 0.795 on Mouse Enhancers and 0.940 on NonTATA Promoters.
It reveals that no single model excels in every benchmark, underscoring the need for tailored model selection for specific genomic tasks.
The study implies that fine-tuning techniques can significantly enhance prediction accuracy, paving the way for refined genomic prediction applications.

Evaluation and Benchmarking of Genomic Prediction Models

Introduction

The paper presents an extensive evaluation and benchmarking of several genomic prediction models using a diverse set of genomic datasets. The central objective of the paper is to analyze the performance of different models in identifying enhancer regions, coding versus intergenomic regions, and various other genomic classifications.

Methodology

The models evaluated include Mamba, SUBS (from scratch and fine-tuned), SEDD, Caducues, Plaid, and D3PM. These models were tested across multiple benchmarks such as "Mouse Enhancers," "Coding vs. Intergenomic," "Human vs. Worm," "Human Enhancers Cohn," "Human Enhancer Ensembl," "Human Regulatory," "Human OCR Ensembl," and "Human NonTATA Promoters."

Results

The paper presents the results in a tabular format, summarizing the performance metrics (likely AUC or accuracy) with their respective standard deviations for various models across different benchmarks. Here are some noteworthy findings:

Mouse Enhancers: The SUBS (fine-tuned) model achieved the highest score with 0.795 ± 0.029.
Coding vs. Intergenomic: This task had multiple models achieving a tied best score of 0.913 including SUBS (fine-tuned), SEDD, and Caducues.
Human vs. Worm: Caducues marginally outperformed other models with a score of 0.971 ± 0.001.
Human Enhancers Cohn: The highest score of 0.746 ± 0.015 was obtained by SEDD.
Human Enhancer Ensembl: The Caducues model excelled with a score of 0.907 ± 0.000.
Human Regulatory: Caducues again performed best, achieving 0.874 ± 0.003.
Human OCR Ensembl: The best metric here was 0.823 ± 0.008 obtained by SUBS (fine-tuned).
Human NonTATA Promoters: SUBS (fine-tuned) excelled with a score of 0.940 ± 0.007.

Discussion

The comparison highlights that no single model consistently outperformed others across all benchmarks. However, the SUBS (fine-tuned) and Caducues models generally showed robust and high performance across multiple tasks, indicating their potential suitability for broader genomic applications.

Implications

Practical: The differential performance across benchmarks indicates the necessity of model selection tailored to specific genomic tasks for optimal outcomes. The fine-tuning approach also demonstrates substantial performance gains, suggesting that customized training can significantly enhance prediction capabilities.
Theoretical: The consistency in high performance by models like SUBS (fine-tuned) and Caducues suggests that their underlying methodologies capture essential genomic features effectively. This finding can spark further research into understanding any unique architectural advantages these models possess.

Future Directions

Future research can aim to fine-tune and adapt high-performing models for more specific genomic tasks outside the current benchmarks. Additionally, integrating these models with emerging genomic datasets and evaluating their transferability and generalization capabilities could further validate their robustness.

Conclusion

The paper provides a valuable benchmarking comparison of various genomic prediction models, elucidating the strengths and weaknesses of each across different genomic classification tasks. This paper lays the groundwork for future advancement and adaptation of machine learning models in genomics.

PDF Markdown

Related Papers

GitHub

GitHub - kuleshov-group/mdlm

Tweets

https://twitter.com/srush_nlp/status/1808484529855729978

https://twitter.com/arankomatsuzaki/status/1800709054857392396

https://twitter.com/s_scardapane/status/1804504611983897045

https://twitter.com/sedielem/status/1800975079490322453

https://twitter.com/fly51fly/status/1800806782111207424

https://twitter.com/SkyLi0n/status/1867000125161099540

YouTube

Show All Videos