Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Empowering Character-level Text Infilling by Eliminating Sub-Tokens (2405.17103v2)

Published 27 May 2024 in cs.CL and cs.AI

Abstract: In infilling tasks, sub-tokens, representing instances where a complete token is segmented into two parts, often emerge at the boundaries of prefixes, middles, and suffixes. Traditional methods focused on training models at the token level, leading to sub-optimal performance in character-level infilling tasks during the inference stage. Alternately, some approaches considered character-level infilling, but they relied on predicting sub-tokens in inference, yet this strategy diminished ability in character-level infilling tasks due to the large perplexity of the model on sub-tokens. In this paper, we introduce FIM-SE, which stands for Fill-In-the-Middle with both Starting and Ending character constraints. The proposed method addresses character-level infilling tasks by utilizing a line-level format to avoid predicting any sub-token in inference. In addition, we incorporate two special tokens to signify the rest of the incomplete lines, thereby enhancing generation guidance. Extensive experiments demonstrate that our proposed approach surpasses previous methods, offering a significant advantage. Code is available at https://github.com/SenseLLM/FIM-SE.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. CM3: A causal masked multimodal model of the internet. CoRR, abs/2201.07520.
  2. Santacoder: don’t reach for the stars! CoRR, abs/2301.03988.
  3. Palm 2 technical report. CoRR, abs/2305.10403.
  4. Program synthesis with large language models. CoRR, abs/2108.07732.
  5. Qwen technical report. CoRR, abs/2309.16609.
  6. Efficient training of language models to fill in the middle. CoRR, abs/2207.14255.
  7. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  8. Evaluating large language models trained on code. CoRR, abs/2107.03374.
  9. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113.
  10. Pre-training with whole word masking for chinese BERT. IEEE ACM Trans. Audio Speech Lang. Process., 29:3504–3514.
  11. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. CoRR, abs/2307.08691.
  12. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
  13. GLM: general language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 320–335. Association for Computational Linguistics.
  14. Incoder: A generative model for code infilling and synthesis. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  15. Continual pre-training of large language models: How to (re)warm your model? CoRR, abs/2308.04014.
  16. Mistral 7b. CoRR, abs/2310.06825.
  17. Spanbert: Improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguistics, 8:64–77.
  18. The stack: 3 TB of permissively licensed source code. CoRR, abs/2211.15533.
  19. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  20. Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018, pages 66–71. Association for Computational Linguistics.
  21. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 7871–7880. Association for Computational Linguistics.
  22. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  23. Starcoder: may the source be with you! CoRR, abs/2305.06161.
  24. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  25. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  26. Octopack: Instruction tuning code large language models. CoRR, abs/2308.07124.
  27. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  28. Training language models to follow instructions with human feedback. In NeurIPS.
  29. The refinedweb dataset for falcon LLM: outperforming curated corpora with web data, and web data only. CoRR, abs/2306.01116.
  30. Improving language understanding by generative pre-training. OpenAI.
  31. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  32. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  33. Zero: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, page 20. IEEE/ACM.
  34. Code llama: Open foundation models for code. CoRR, abs/2308.12950.
  35. MASS: masked sequence to sequence pre-training for language generation. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 5926–5936. PMLR.
  36. Insertion transformer: Flexible sequence generation via insertion operations. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 5976–5985. PMLR.
  37. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  38. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  39. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
  40. Effective long-context scaling of foundation models. CoRR, abs/2309.16039.
  41. Baichuan 2: Open large-scale language models. CoRR, abs/2309.10305.
  42. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023, pages 5673–5684. ACM.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Houxing Ren (16 papers)
  2. Mingjie Zhan (23 papers)
  3. Zhongyuan Wu (4 papers)
  4. Hongsheng Li (340 papers)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets