FLAME: A small language model for spreadsheet formulas (2301.13779v2)
Abstract: Spreadsheets are a vital tool for end-user data management. Using LLMs for formula authoring assistance in these environments can be difficult, as these models are expensive to train and challenging to deploy due to their size (up to billions of parameters). We present FLAME, a transformer-based model trained exclusively on Excel formulas that leverages domain insights to achieve competitive performance while being substantially smaller (60M parameters) and training on two orders of magnitude less data. We curate a training dataset using sketch deduplication, introduce an Excel-specific formula tokenizer, and use domain-specific versions of masked span prediction and noisy auto-encoding as pre-training objectives. We evaluate FLAME on formula repair, formula completion, and similarity-based formula retrieval. FLAME can outperform much larger models, such as the Davinci (175B) and Cushman (12B) variants of Codex and CodeT5 (220M), in 10 of 14 evaluation settings for the repair and completion tasks. For formula retrieval, FLAME outperforms CodeT5, CodeBERT, and GraphCodeBERT.
- GoalDebug: A spreadsheet debugger for end users. In 29th International Conference on Software Engineering (ICSE’07), 251–260. IEEE.
- Allamanis, M. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, 143–153.
- Fuse: a reproducible, extendable, internet-scale corpus of spreadsheets. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, 486–489. IEEE.
- Neurosymbolic Repair for Low-Code Formula Languages. Proc. ACM Program. Lang., 6(OOPSLA2).
- Tfix: Learning to fix coding errors with a text-to-text transformer. In International Conference on Machine Learning, 780–791. PMLR.
- Pre-training tasks for embedding-based large-scale retrieval. arXiv preprint arXiv:2002.03932.
- Evaluating Large Language Models Trained on Code.
- Spreadsheetcoder: Formula prediction from semi-structured context. In International Conference on Machine Learning, 1661–1672. PMLR.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Table pre-training: A survey on model architectures, pre-training objectives, and downstream tasks. arXiv preprint arXiv:2201.09745.
- Hierarchical neural story generation. arXiv preprint arXiv:1805.04833.
- CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, 1536–1547. Online: Association for Computational Linguistics.
- Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999.
- Dependence language model for information retrieval. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, 170–177.
- GitHub. 2021. GitHub CoPilot. https://github.com/features/copilot/. [Online; accessed 09-January-2023].
- GraphCodeBERT: Pre-training Code Representations with Data Flow. In International Conference on Learning Representations.
- Learning to complete code with sketches. In International Conference on Learning Representations.
- Deepfix: Fixing common c language errors by deep learning. In Thirty-First AAAI conference on artificial intelligence.
- HermEs: Interactive Spreadsheet Formula Prediction via Hierarchical Formulet Expansion. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 8356–8372.
- Enron’s spreadsheets and related emails: A dataset and analysis. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, volume 2, 7–16. IEEE.
- The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
- Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, 2790–2799. PMLR.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Repair is nearly generation: Multilingual program repair with llms. arXiv preprint arXiv:2208.11640.
- Deduplicating training data mitigates privacy risks in language models. arXiv preprint arXiv:2202.06539.
- Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499.
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Jurafsky, D.; Chai, J.; Schluter, N.; and Tetreault, J. R., eds., Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, 7871–7880. Association for Computational Linguistics.
- Automating code review activities by large-scale pre-training. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 1035–1047.
- TAPEX: Table pre-training via learning a neural SQL executor. arXiv preprint arXiv:2107.07653.
- Morgan Stanley. 2015. Morgan Stanley Technology, Media & Telecom Conference.
- CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis.
- Synchromesh: Reliable code generation from pre-trained language models. arXiv preprint arXiv:2201.11227.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140): 1–67.
- Benchmarking spreadsheet systems. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, 1589–1599.
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982–3992.
- Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715–1725. Berlin, Germany: Association for Computational Linguistics.
- FormaT5: Abstention and Examples for Conditional Table Formatting with Natural Language. arXiv preprint arXiv:2310.17306.
- A general language model for information retrieval. In Proceedings of the eighth international conference on Information and knowledge management, 316–321.
- Language model information retrieval with document expansion. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, 407–414.
- UL2: Unifying Language Learning Paradigms.
- Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424.
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859.
- A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, 1–10.
- Graph-based, self-supervised program repair from diagnostic feedback. In International Conference on Machine Learning, 10799–10808. PMLR.
- Break-it-fix-it: Unsupervised learning for program repair. In International Conference on Machine Learning, 11941–11952. PMLR.
- A proximity language model for information retrieval. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, 291–298.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.