Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Best Practices and Lessons Learned on Synthetic Data (2404.07503v2)

Published 11 Apr 2024 in cs.CL
Best Practices and Lessons Learned on Synthetic Data

Abstract: The success of AI models relies on the availability of large, diverse, and high-quality datasets, which can be challenging to obtain due to data scarcity, privacy concerns, and high costs. Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns. This paper provides an overview of synthetic data research, discussing its applications, challenges, and future directions. We present empirical evidence from prior art to demonstrate its effectiveness and highlight the importance of ensuring its factuality, fidelity, and unbiasedness. We emphasize the need for responsible use of synthetic data to build more powerful, inclusive, and trustworthy LLMs.

Exploring the Frontiers of Synthetic Data in AI Development

Introduction to Synthetic Data

The landscape of AI technology is ever-evolving, with synthetic data taking center stage as a pivotal solution to the challenges of data scarcity, privacy issues, and the steep costs associated with data acquisition and annotation. Synthetic data, crafted through algorithms, generative models, or simulations, mirrors the properties of real-world data, holding the promise to refine AI models significantly. Despite its potential, the pursuit of authentic synthetic data generation is fraught with challenges in ensuring data factuality, managing biases, and maintaining fidelity.

Synthetic Data Utilization in Model Training

Applications Across Domains

  • Mathematical Reasoning: Noteworthy advancements have been achieved in LMs for math-related tasks through synthetic question-answer generation. Despite the straightforward process of scaling synthetic math data, verifying its accuracy remains a considerable hurdle.
  • Code Reasoning: In contrast to math, code reasoning benefits from the executable nature of code, offering a natural combination of code with execution results. Techniques like actor-critic approaches and self-improvement strategies have showcased substantial progress in this sphere.
  • Tool-use Learning and Planning: Synthetic trajectories have shown promise in teaching LMs tool-using capabilities, highlighting models like LaMDA and Toolformer. Similarly, synthetic environments have aided LLMs in learning complex planning tasks with a considerable level of autonomy.
  • Multi-modal Data Generation: From reverse rendering of images to text to multi-modality instruction following, synthetic data has demonstrated its utility in generating high-quality, diverse datasets that facilitate advanced model training.

Evaluating Synthetic Data's Role

The application of synthetic data extends beyond training to evaluation, where it serves crucial roles in assessing factuality, safety, and the overall performance of AI models. Techniques have evolved from basic statistical measures to more sophisticated model-based and real-time simulation methods, substantially enriching the evaluation landscape.

Challenges and Future Directions

The utilization of synthetic data is not without its pitfalls. Concerns range from the potential proliferation of misinformation and ambiguities in AI alignment to complications in evaluation decontamination. Future explorations are necessitated in several areas:

  • Quality and diversity improvements in synthetic data generation, aiming for high fidelity and real-world resemblance, with attributes that closely mimic target domains.
  • Efficient scalable oversight that utilizes synthetic data for robust monitoring of AI systems, addressing the need for comprehensive governance frameworks.

Conclusion

Synthetic data harbors transformative potential for AI development, offering a versatile solution to several longstanding challenges. It amplifies the capacity to generate abundant, diverse, and controlled training datasets while navigating the complexities of privacy and ethical considerations. Looking ahead, the focus on refining generative techniques will be paramount in harnessing the full spectrum of synthetic data’s benefits, ensuring AI models are more robust, inclusive, and aligned with human values and societal norms. As the AI community delves deeper into the realms of synthetic data, the journey is marked with challenges, yet buoyed by an undercurrent of significant promise and potential.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (191)
  1. Privacy preserving synthetic data release using deep learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018, Dublin, Ireland, September 10–14, 2018, Proceedings, Part I 18, pages 510–526. Springer, 2019.
  2. Lapca: Language-agnostic pretraining with cross-lingual alignment. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2098–2102, 2023.
  3. Do as i can, not as i say: Grounding language in robotic affordances. ArXiv preprint, abs/2204.01691, 2022. URL https://arxiv.org/abs/2204.01691.
  4. Concrete problems in ai safety. ArXiv preprint, abs/1606.06565, 2016. URL https://arxiv.org/abs/1606.06565.
  5. Frontier ai regulation: Managing emerging risks to public safety. ArXiv preprint, abs/2307.03718, 2023. URL https://arxiv.org/abs/2307.03718.
  6. Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3):337–351, 2023.
  7. One question answering model for many languages with cross-lingual dense passage retrieval. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 7547–7560, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/3df07fdae1ab273a967aaa1d355b8bb6-Abstract.html.
  8. A general language assistant as a laboratory for alignment. ArXiv preprint, abs/2112.00861, 2021. URL https://arxiv.org/abs/2112.00861.
  9. Generating synthetic data in finance: opportunities, challenges and pitfalls. In Proceedings of the First ACM International Conference on AI in Finance, pages 1–8, 2020.
  10. Llemma: An open language model for mathematics. ArXiv preprint, abs/2310.10631, 2023. URL https://arxiv.org/abs/2310.10631.
  11. R. Babbar and B. Schölkopf. Data scarcity, robustness and extreme multi-label classification. Machine Learning, 108(8):1329–1351, 2019.
  12. Constitutional ai: Harmlessness from ai feedback. ArXiv preprint, abs/2212.08073, 2022. URL https://arxiv.org/abs/2212.08073.
  13. A methodology for controlling bias and fairness in synthetic data generation. Applied Sciences, 12(9):4619, 2022.
  14. Data augmentation for text generation without any augmented data. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2223–2237, Online, 2021. Association for Computational Linguistics. 10.18653/v1/2021.acl-long.173. URL https://aclanthology.org/2021.acl-long.173.
  15. Improving language models by retrieving from trillions of tokens. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 2206–2240. PMLR, 2022. URL https://proceedings.mlr.press/v162/borgeaud22a.html.
  16. Language models are realistic tabular data generators. ArXiv preprint, abs/2210.06280, 2022. URL https://arxiv.org/abs/2210.06280.
  17. Unity perception: Generate synthetic data for computer vision. ArXiv preprint, abs/2107.04259, 2021. URL https://arxiv.org/abs/2107.04259.
  18. Measuring progress on scalable oversight for large language models. ArXiv preprint, abs/2211.03540, 2022. URL https://arxiv.org/abs/2211.03540.
  19. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. ArXiv preprint, abs/2312.09390, 2023. URL https://arxiv.org/abs/2312.09390.
  20. Deepfakes and international conflict. Brookings Institution, 2023.
  21. Chart-based reasoning: Transferring capabilities from llms to vlms. ArXiv preprint, abs/2403.12596, 2024. URL https://arxiv.org/abs/2403.12596.
  22. Red teaming deep neural networks with feature synthesis tools. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.
  23. Explore, establish, exploit: Red teaming language models from scratch. ArXiv preprint, abs/2306.09442, 2023b. URL https://arxiv.org/abs/2306.09442.
  24. Tagged back-translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pages 53–63, Florence, Italy, 2019. Association for Computational Linguistics. 10.18653/v1/W19-5206. URL https://aclanthology.org/W19-5206.
  25. Improved unsupervised neural machine translation with semantically weighted back translation for morphologically rich and low resource languages. Neural Processing Letters, 54(3):1707–1726, 2022.
  26. Self-play fine-tuning converts weak language models to strong language models, 2024.
  27. CoMPosT: Characterizing and evaluating caricature in LLM simulations. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10853–10875, Singapore, Dec. 2023. Association for Computational Linguistics. 10.18653/v1/2023.emnlp-main.669. URL https://aclanthology.org/2023.emnlp-main.669.
  28. Generative ai for math: Abel. https://github.com/GAIR-NLP/abel, 2023.
  29. Cross-lingual natural language generation via pre-training. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7570–7577. AAAI Press, 2020. URL https://aaai.org/ojs/index.php/AAAI/article/view/6256.
  30. Deep reinforcement learning from human preferences. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 4299–4307, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html.
  31. Ultrafeedback: Boosting language models with high-quality feedback, 2023.
  32. J. Dahmen and D. Cook. Synsys: A synthetic data generation system for healthcare applications. Sensors, 19(5):1181, 2019.
  33. Toxicity in chatgpt: Analyzing persona-assigned language models. ArXiv preprint, abs/2304.05335, 2023. URL https://arxiv.org/abs/2304.05335.
  34. Handling divergent reference texts when evaluating table-to-text generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4884–4895, Florence, Italy, 2019. Association for Computational Linguistics. 10.18653/v1/P19-1483. URL https://aclanthology.org/P19-1483.
  35. Enhancing chat language models by scaling high-quality instructional conversations. ArXiv preprint, abs/2305.14233, 2023. URL https://arxiv.org/abs/2305.14233.
  36. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 489–500, Brussels, Belgium, 2018. Association for Computational Linguistics. 10.18653/v1/D18-1045. URL https://aclanthology.org/D18-1045.
  37. Practical synthetic data generation: balancing privacy and the broad availability of data. O’Reilly Media, 2020.
  38. Improving back-translation with iterative filtering and data selection for sinhala-english nmt. In 2021 Moratuwa Engineering Research Conference (MERCon), pages 438–443. IEEE, 2021.
  39. Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective. Synthese, 198(Suppl 27):6435–6467, 2021.
  40. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214–2220, Florence, Italy, 2019. Association for Computational Linguistics. 10.18653/v1/P19-1213. URL https://aclanthology.org/P19-1213.
  41. FactKB: Generalizable factuality evaluation using language models enhanced with factual knowledge. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 933–952, Singapore, Dec. 2023a. Association for Computational Linguistics. 10.18653/v1/2023.emnlp-main.59. URL https://aclanthology.org/2023.emnlp-main.59.
  42. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair nlp models. ArXiv preprint, abs/2305.08283, 2023b. URL https://arxiv.org/abs/2305.08283.
  43. Synthetic data augmentation using gan for improved liver lesion classification. In 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018), pages 289–293. IEEE, 2018.
  44. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. ArXiv preprint, abs/2209.07858, 2022. URL https://arxiv.org/abs/2209.07858.
  45. The pile: An 800gb dataset of diverse text for language modeling. ArXiv preprint, abs/2101.00027, 2021. URL https://arxiv.org/abs/2101.00027.
  46. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
  47. Gemini: a family of highly capable multimodal models. ArXiv preprint, abs/2312.11805, 2023. URL https://arxiv.org/abs/2312.11805.
  48. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. ArXiv preprint, abs/2403.05530, 2024. URL https://arxiv.org/abs/2403.05530.
  49. Gemma: Open models based on gemini research and technology. ArXiv preprint, abs/2403.08295, 2024. URL https://arxiv.org/abs/2403.08295.
  50. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023a. 10.1073/pnas.2305016120. URL https://www.pnas.org/doi/abs/10.1073/pnas.2305016120.
  51. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023b.
  52. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  53. Generalizing back-translation in neural machine translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pages 45–52, Florence, Italy, 2019. Association for Computational Linguistics. 10.18653/v1/W19-5205. URL https://aclanthology.org/W19-5205.
  54. Deepfake detection by human crowds, machines, and machine-informed crowds. Proceedings of the National Academy of Sciences, 119(1):e2110013119, 2022.
  55. Cruxeval: A benchmark for code reasoning, understanding and execution. ArXiv preprint, abs/2401.03065, 2024. URL https://arxiv.org/abs/2401.03065.
  56. Deepfake detection by analyzing convolutional traces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 666–667, 2020.
  57. Transitioning from real to synthetic data: Quantifying the bias in model. ArXiv preprint, abs/2105.04144, 2021. URL https://arxiv.org/abs/2105.04144.
  58. Language models can teach themselves to program better. ArXiv preprint, abs/2207.14502, 2022. URL https://arxiv.org/abs/2207.14502.
  59. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6626–6637, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/8a1d694707eb0fefe65871369074926d-Abstract.html.
  60. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html.
  61. An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems, 35:30016–30030, 2022.
  62. q2superscript𝑞2q^{2}italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7856–7870, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. 10.18653/v1/2021.emnlp-main.619. URL https://aclanthology.org/2021.emnlp-main.619.
  63. Synthetic data for social good. ArXiv preprint, abs/1710.08874, 2017. URL https://arxiv.org/abs/1710.08874.
  64. Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech. ArXiv preprint, abs/2302.07736, 2023a. URL https://arxiv.org/abs/2302.07736.
  65. Large language models can self-improve. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1051–1068, Singapore, Dec. 2023b. Association for Computational Linguistics. 10.18653/v1/2023.emnlp-main.67. URL https://aclanthology.org/2023.emnlp-main.67.
  66. Inner monologue: Embodied reasoning through planning with language models. ArXiv preprint, abs/2207.05608, 2022. URL https://arxiv.org/abs/2207.05608.
  67. Sleeper agents: Training deceptive llms that persist through safety training. ArXiv preprint, abs/2401.05566, 2024. URL https://arxiv.org/abs/2401.05566.
  68. Survey of hallucination in natural language generation. ACM Computing Surveys (CSUR), 55(12):1–38, 2023.
  69. Scaling up visual and vision-language representation learning with noisy text supervision. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 4904–4916. PMLR, 2021. URL http://proceedings.mlr.press/v139/jia21b.html.
  70. Mistral 7b. ArXiv preprint, abs/2310.06825, 2023. URL https://arxiv.org/abs/2310.06825.
  71. Vima: General robot manipulation with multimodal prompts. In NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.
  72. Teaching language models to hallucinate less with synthetic tasks, 2023. URL https://arxiv.org/abs/2310.06827.
  73. Llms can’t plan, but can help planning in llm-modulo frameworks. arXiv preprint arXiv:2402.01817, 2024.
  74. Cross-lingual training for automatic question generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4863–4872, Florence, Italy, 2019. Association for Computational Linguistics. 10.18653/v1/P19-1481. URL https://aclanthology.org/P19-1481.
  75. Learning skillful medium-range global weather forecasting. Science, 382(6677):1416–1421, 2023.
  76. Auditing the ai auditors: A framework for evaluating fairness and bias in high stakes ai predictive models. American Psychologist, 78(1):36, 2023.
  77. Unlocking the conversion of web screenshots into html code with the websight dataset, 2024. URL https://arxiv.org/abs/2403.09029.
  78. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328, 2022.
  79. Y. LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62, 2022.
  80. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR, 2023.
  81. Scalable agent alignment via reward modeling: a research direction. ArXiv preprint, abs/1811.07871, 2018. URL https://arxiv.org/abs/1811.07871.
  82. Retrieval-augmented generation for knowledge-intensive NLP tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html.
  83. Solving quantitative reasoning problems with language models, 2022. URL https://arxiv.org/abs/2206.14858.
  84. B. Li and C. Callison-Burch. Paxqa: Generating cross-lingual question answering examples at training scale. ArXiv preprint, abs/2304.12206, 2023. URL https://arxiv.org/abs/2304.12206.
  85. Common 7b language models already possess strong math capabilities. ArXiv preprint, abs/2403.04706, 2024. URL https://arxiv.org/abs/2403.04706.
  86. Seeds: Emulation of weather forecast ensembles with diffusion models. ArXiv preprint, abs/2306.14066, 2023a. URL https://arxiv.org/abs/2306.14066.
  87. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
  88. Code as policies: Language model programs for embodied control. ArXiv preprint, abs/2209.07753, 2022. URL https://arxiv.org/abs/2209.07753.
  89. Back-translation for large-scale multilingual machine translation. In Proceedings of the Sixth Conference on Machine Translation, pages 418–424, Online, 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.wmt-1.50.
  90. Let’s verify step by step. ArXiv preprint, abs/2305.20050, 2023. URL https://arxiv.org/abs/2305.20050.
  91. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, 2022. Association for Computational Linguistics. 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
  92. Codemind: A framework to challenge large language models for code reasoning. ArXiv preprint, abs/2402.09664, 2024a. URL https://arxiv.org/abs/2402.09664.
  93. Deplot: One-shot visual language reasoning by plot-to-table translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10381–10399, 2023a.
  94. Matcha: Enhancing visual language pretraining with math reasoning and chart derendering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12756–12770, 2023b.
  95. H. Liu and A. C.-C. Yao. Augmenting math word problems via iterative question composing. ArXiv preprint, abs/2401.09003, 2024. URL https://arxiv.org/abs/2401.09003.
  96. Visual instruction tuning. Advances in neural information processing systems, 36, 2024b.
  97. Mitigating political bias in language models through reinforced calibration. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 14857–14866. AAAI Press, 2021. URL https://ojs.aaai.org/index.php/AAAI/article/view/17744.
  98. Mind’s eye: Grounded language model reasoning through simulation. ArXiv preprint, abs/2210.05359, 2022. URL https://arxiv.org/abs/2210.05359.
  99. Training socially aligned language models in simulated human society. ArXiv preprint, abs/2305.16960, 2023c. URL https://arxiv.org/abs/2305.16960.
  100. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. ArXiv preprint, abs/2312.15685, 2023d. URL https://arxiv.org/abs/2312.15685.
  101. Machine learning for synthetic data generation: a review. ArXiv preprint, abs/2302.04062, 2023. URL https://arxiv.org/abs/2302.04062.
  102. F. Lucini. The real deal about synthetic data. MIT Sloan Management Review, 63(1):1–4, 2021.
  103. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. ArXiv preprint, abs/2308.09583, 2023a. URL https://arxiv.org/abs/2308.09583.
  104. Wizardcoder: Empowering code large language models with evol-instruct. ArXiv preprint, abs/2306.08568, 2023b. URL https://arxiv.org/abs/2306.08568.
  105. Tagged back-translation revisited: Why does it really work? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5990–5997, Online, 2020. Association for Computational Linguistics. 10.18653/v1/2020.acl-main.532. URL https://aclanthology.org/2020.acl-main.532.
  106. UniChart: A universal vision-language pretrained model for chart comprehension and reasoning. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14662–14684, Singapore, 2023. Association for Computational Linguistics. 10.18653/v1/2023.emnlp-main.906. URL https://aclanthology.org/2023.emnlp-main.906.
  107. Membership inference attacks against language models via neighbourhood comparison. ArXiv preprint, abs/2305.18462, 2023. URL https://arxiv.org/abs/2305.18462.
  108. Generating training data with language models: Towards zero-shot language understanding. Advances in Neural Information Processing Systems, 35:462–477, 2022.
  109. Meta. Meta and microsoft introduce the next generation of llama. https://ai.meta.com/blog/llama-2, 2023.
  110. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251, 2023.
  111. Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024.
  112. Orca: Progressive learning from complex explanation traces of gpt-4. ArXiv preprint, abs/2306.02707, 2023. URL https://arxiv.org/abs/2306.02707.
  113. S. I. Nikolenko. Synthetic data for deep learning, volume 174. Springer, 2021.
  114. Bias in data-driven artificial intelligence systems—an introductory survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(3):e1356, 2020.
  115. OpenAI. Gpt-4 technical report, 2023.
  116. Proving test set contamination in black box language models. ArXiv preprint, abs/2310.17623, 2023. URL https://arxiv.org/abs/2310.17623.
  117. Training language models to follow instructions with human feedback. ArXiv preprint, abs/2203.02155, 2022. URL https://arxiv.org/abs/2203.02155.
  118. The effects of reward misspecification: Mapping and mitigating misaligned models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=JYtwGwIL7ye.
  119. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023.
  120. Openwebmath: An open dataset of high-quality mathematical web text. ArXiv preprint, abs/2310.06786, 2023. URL https://arxiv.org/abs/2310.06786.
  121. R. Patel and E. Pavlick. Mapping language models to grounded conceptual spaces. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=gJcEM8sxHK.
  122. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.225.
  123. Discovering language model behaviors with model-written evaluations. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13387–13434. Association for Computational Linguistics, 2023.
  124. Meta back-translation. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=3jjmdp7Hha.
  125. M. Przystupa and M. Abdul-Mageed. Neural machine translation of low-resource and similar languages with backtranslation. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 224–235, Florence, Italy, 2019. Association for Computational Linguistics. 10.18653/v1/W19-5431. URL https://aclanthology.org/W19-5431.
  126. Learning transferable visual models from natural language supervision. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021. URL http://proceedings.mlr.press/v139/radford21a.html.
  127. Scaling language models: Methods, analysis & insights from training gopher, 2021. URL https://arxiv.org/abs/2112.11446.
  128. Direct preference optimization: Your language model is secretly a reward model. In NeurIPS, 2023. URL https://api.semanticscholar.org/CorpusID:258959321.
  129. Hierarchical text-conditional image generation with clip latents. ArXiv preprint, abs/2204.06125, 2022. URL https://arxiv.org/abs/2204.06125.
  130. Synthetic data augmentation for zero-shot cross-lingual question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7016–7030, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. 10.18653/v1/2021.emnlp-main.562. URL https://aclanthology.org/2021.emnlp-main.562.
  131. T. Rid. Active measures: The secret history of disinformation and political warfare. Farrar, Straus and Giroux, 2020.
  132. Photorealistic text-to-image diffusion models with deep language understanding. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 36479–36494. Curran Associates, Inc., 2022a. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ec795aeadae0b7d230fa35cbaf04c041-Paper-Conference.pdf.
  133. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022b.
  134. Analysing mathematical reasoning abilities of neural models. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=H1gR5iR5FX.
  135. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2024.
  136. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany, 2016. Association for Computational Linguistics. 10.18653/v1/P16-1009. URL https://aclanthology.org/P16-1009.
  137. Towards zero-shot multilingual synthetic question and answer generation for cross-lingual reading comprehension. In Proceedings of the 14th International Conference on Natural Language Generation, pages 35–45, Aberdeen, Scotland, UK, 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.inlg-1.4.
  138. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.
  139. Towards understanding sycophancy in language models. In The Twelfth International Conference on Learning Representations, 2024.
  140. Detecting pretraining data from large language models, 2023.
  141. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
  142. Learning performance-improving code edits. ArXiv preprint, abs/2302.07867, 2023. URL https://arxiv.org/abs/2302.07867.
  143. Design2code: How far are we from automating front-end engineering?, 2024. URL https://arxiv.org/abs/2403.03163.
  144. Large language models encode clinical knowledge. ArXiv preprint, abs/2212.13138, 2022. URL https://arxiv.org/abs/2212.13138.
  145. J. Steinhardt. Ml systems will have weird failure modes. https://bounded-regret.ghost.io/ml-systems-will-have-weird-failure-modes-2/, 2022.
  146. Aligning large multimodal models with factually augmented rlhf. ArXiv preprint, abs/2309.14525, 2023. URL https://arxiv.org/abs/2309.14525.
  147. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. ArXiv preprint, abs/2306.05301, 2023. URL https://arxiv.org/abs/2306.05301.
  148. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  149. Galactica: A large language model for science. ArXiv preprint, abs/2211.09085, 2022. URL https://arxiv.org/abs/2211.09085.
  150. Lamda: Language models for dialog applications. ArXiv preprint, abs/2201.08239, 2022. URL https://arxiv.org/abs/2201.08239.
  151. Fine-tuning language models for factuality. In ICLR, 2023. URL https://api.semanticscholar.org/CorpusID:265158181.
  152. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012. 10.1109/IROS.2012.6386109.
  153. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023. URL https://arxiv.org/abs/2307.09288.
  154. Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024.
  155. Synthetic data, real errors: how (not) to publish and use synthetic data. In International Conference on Machine Learning, pages 34793–34808. PMLR, 2023.
  156. Generative agent-based modeling with actions grounded in physical, social, or digital space using concordia. ArXiv preprint, abs/2312.03664, 2023. URL https://arxiv.org/abs/2312.03664.
  157. Will we run out of data? an analysis of the limits of scaling datasets in machine learning. ArXiv preprint, abs/2211.04325, 2022. URL https://arxiv.org/abs/2211.04325.
  158. Voyager: An open-ended embodied agent with large language models. ArXiv preprint, abs/2305.16291, 2023. URL https://arxiv.org/abs/2305.16291.
  159. Self-consistency improves chain of thought reasoning in language models. 2022a. URL https://arxiv.org/abs/2203.11171.
  160. Self-instruct: Aligning language models with self-generated instructions. volume abs/2212.10560, 2022b. URL https://arxiv.org/abs/2212.10560.
  161. Towards faithful neural table-to-text generation with content-matching constraints. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1072–1086, Online, 2020. Association for Computational Linguistics. 10.18653/v1/2020.acl-main.101. URL https://aclanthology.org/2020.acl-main.101.
  162. Generative image translation for data augmentation in colorectal histopathology images. In Advances in Neural Information Processing Systems, 2019.
  163. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR.
  164. Symbol tuning improves in-context learning in language models. volume abs/2305.08298, 2023a. URL https://arxiv.org/abs/2305.08298.
  165. Simple synthetic data reduces sycophancy in large language models, 2023b. URL https://arxiv.org/abs/2308.03958.
  166. Long-form factuality in large language models. 2024. URL https://api.semanticscholar.org/CorpusID:268724304.
  167. Magicoder: Source code is all you need. ArXiv preprint, abs/2312.02120, 2023c. URL https://arxiv.org/abs/2312.02120.
  168. Ethical and social risks of harm from language models. ArXiv preprint, abs/2112.04359, 2021. URL https://arxiv.org/abs/2112.04359.
  169. Fake it till you make it: face analysis in the wild using synthetic data alone. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 3661–3671. IEEE, 2021. 10.1109/ICCV48922.2021.00366. URL https://doi.org/10.1109/ICCV48922.2021.00366.
  170. Wizardlm: Empowering large language models to follow complex instructions. ArXiv preprint, abs/2304.12244, 2023. URL https://arxiv.org/abs/2304.12244.
  171. On synthetic data for back translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 419–430, Seattle, United States, 2022. Association for Computational Linguistics. 10.18653/v1/2022.naacl-main.32. URL https://aclanthology.org/2022.naacl-main.32.
  172. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, 2020.
  173. Intercode: Standardizing and benchmarking interactive coding with execution feedback. Advances in Neural Information Processing Systems, 36, 2024.
  174. Toolsword: Unveiling safety issues of large language models in tool learning across three stages. ArXiv preprint, abs/2402.10753, 2024. URL https://arxiv.org/abs/2402.10753.
  175. Metamath: Bootstrap your own mathematical questions for large language models. ArXiv preprint, abs/2309.12284, 2023. URL https://arxiv.org/abs/2309.12284.
  176. Large language model as attributed training data generator: A tale of diversity and bias. Advances in Neural Information Processing Systems, 36, 2024.
  177. Self-rewarding language models. ArXiv preprint, abs/2401.10020, 2024. URL https://arxiv.org/abs/2401.10020.
  178. Scaling relationship on learning mathematical reasoning with large language models. ArXiv preprint, abs/2308.01825, 2023. URL https://arxiv.org/abs/2308.01825.
  179. Star: Bootstrapping reasoning with reasoning. In NeurIPS, 2022. URL https://api.semanticscholar.org/CorpusID:247762790.
  180. Exploring collaboration mechanisms for llm agents: A social psychology view. ArXiv preprint, abs/2310.02124, 2023a. URL https://arxiv.org/abs/2310.02124.
  181. Instruction tuning for large language models: A survey, 2023b. URL https://arxiv.org/abs/2308.10792.
  182. Siren’s song in the ai ocean: A survey on hallucination in large language models. ArXiv preprint, abs/2309.01219, 2023c. URL https://arxiv.org/abs/2309.01219.
  183. Llavar: Enhanced visual instruction tuning for text-rich image understanding. ArXiv preprint, abs/2306.17107, 2023d. URL https://arxiv.org/abs/2306.17107.
  184. Svit: Scaling up visual instruction tuning. ArXiv preprint, abs/2307.04087, 2023. URL https://arxiv.org/abs/2307.04087.
  185. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 15–20, New Orleans, Louisiana, 2018. Association for Computational Linguistics. 10.18653/v1/N18-2003. URL https://aclanthology.org/N18-2003.
  186. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  187. The ai economist: Taxation policy design via two-level deep multiagent reinforcement learning. Science advances, 8(18):eabk2607, 2022.
  188. Mirror-generative neural machine translation. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=HkxQRTNYPH.
  189. Is this the real life? is this just fantasy? the misleading success of simulating social interactions with llms. ArXiv preprint, abs/2403.05020, 2024. URL https://arxiv.org/abs/2403.05020.
  190. Normbank: A knowledge bank of situational social norms. ArXiv preprint, abs/2305.17008, 2023. URL https://arxiv.org/abs/2305.17008.
  191. Universal and transferable adversarial attacks on aligned language models. ArXiv preprint, abs/2307.15043, 2023. URL https://arxiv.org/abs/2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Ruibo Liu (42 papers)
  2. Jerry Wei (16 papers)
  3. Fangyu Liu (59 papers)
  4. Chenglei Si (26 papers)
  5. Yanzhe Zhang (22 papers)
  6. Jinmeng Rao (19 papers)
  7. Steven Zheng (6 papers)
  8. Daiyi Peng (17 papers)
  9. Diyi Yang (151 papers)
  10. Denny Zhou (65 papers)
  11. Andrew M. Dai (40 papers)
Citations (65)
Youtube Logo Streamline Icon: https://streamlinehq.com