Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Inducing Character-level Structure in Subword-based Language Models with Type-level Interchange Intervention Training (2212.09897v2)

Published 19 Dec 2022 in cs.CL

Abstract: Language tasks involving character-level manipulations (e.g., spelling corrections, arithmetic operations, word games) are challenging for models operating on subword units. To address this, we develop a causal intervention framework to learn robust and interpretable character representations inside subword-based LLMs. Our method treats each character as a typed variable in a causal model and learns such causal structures by adapting the interchange intervention training method of Geiger et al. (2021). We additionally introduce a suite of character-level tasks that systematically vary in their dependence on meaning and sequence-level context. While character-level models still perform best on purely form-based tasks like string reversal, our method outperforms character-level models on more complex tasks that blend form, meaning, and context, such as spelling correction in context and word search games. Compared with standard subword-based models, our approach also significantly improves robustness on unseen token sequences and leads to human-interpretable internal representations of characters.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Char2Subword: Extending the subword embedding space using robust character compositionality. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1640–1651, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  2. Approximate causal abstractions. In Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, volume 115 of Proceedings of Machine Learning Research, pages 606–615. PMLR.
  3. Sander Beckers and Joseph Y. Halpern. 2019. Abstracting causal models. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):2678–2685.
  4. Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In International Conference on Learning Representations.
  5. GPT-NeoX-20B: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745.
  6. Kaj Bostrom and Greg Durrett. 2020. Byte pair encoding is suboptimal for language model pretraining. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4617–4624, Online. Association for Computational Linguistics.
  7. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  8. Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation. Transactions of the Association for Computational Linguistics, 10:73–91.
  9. Cicero Dos Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. In International Conference on Machine Learning, pages 1818–1826. PMLR.
  10. Cryptonite: A cryptic crossword benchmark for extreme ambiguity in language. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4186–4192, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  11. CharacterBERT: Reconciling ELMo and BERT for word-level open-vocabulary representations from characters. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6903–6915, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  12. Causal abstractions of neural networks. In Advances in Neural Information Processing Systems, volume 34, pages 9574–9586.
  13. Faithful, interpretable model explanations via causal abstraction. Stanford AI Lab Blog.
  14. Inducing causal structure for interpretable neural networks. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 7324–7338. PMLR.
  15. Injecting numerical reasoning skills into language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 946–958, Online. Association for Computational Linguistics.
  16. Deberta: decoding-enhanced bert with disentangled attention. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  17. Itay Itzhak and Omer Levy. 2022. Models in a spelling bee: Language models implicitly learn the character composition of tokens. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5061–5068, Seattle, United States. Association for Computational Linguistics.
  18. Ayush Kaushal and Kyle Mahowald. 2022. What do tokens know about their characters and how do they know it? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2487–2507, Seattle, United States. Association for Computational Linguistics.
  19. Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia. Association for Computational Linguistics.
  20. Why don’t people use character-level machine translation? arXiv preprint arXiv:2110.08191.
  21. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
  22. Minh-Thang Luong and Christopher D. Manning. 2016. Achieving open vocabulary neural machine translation with hybrid word-character models. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1054–1063, Berlin, Germany. Association for Computational Linguistics.
  23. Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1064–1074, Berlin, Germany. Association for Computational Linguistics.
  24. Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP. arXiv preprint arXiv:2112.10508.
  25. George A Miller. 1995. Wordnet: A lexical database for English. Communications of the ACM, 38(11):39–41.
  26. Ambipun: Generating humorous puns with ambiguous context. arXiv preprint arXiv:2205.01825.
  27. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
  28. Yuval Pinter. 2021. Integrating approaches to word representation. arXiv preprint arXiv:2109.04876.
  29. Mimicking word embeddings using subword RNNs. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 102–112, Copenhagen, Denmark. Association for Computational Linguistics.
  30. Will it unblend? In Proceedings of the Society for Computation in Linguistics 2021, pages 474–476, Online. Association for Computational Linguistics.
  31. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  32. Noisy UGC translation at the character level: Revisiting open-vocabulary capabilities and robustness of char-based models. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 199–211, Online. Association for Computational Linguistics.
  33. Decrypting cryptic crosswords: Semantically complex wordplay puzzles as a target for NLP. In Advances in Neural Information Processing Systems.
  34. BLOOM: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  35. Timo Schick and Hinrich Schütze. 2019. Attentive mimicking: Better word embeddings by attending to informative contexts. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 489–494, Minneapolis, Minnesota. Association for Computational Linguistics.
  36. Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5149–5152. IEEE.
  37. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
  38. Charformer: Fast character transformers via gradient-based subword tokenization. In International Conference on Learning Representations.
  39. Automated crossword solving. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3073–3085, Dublin, Ireland. Association for Computational Linguistics.
  40. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  41. Causal Proxy Models for concept-based model explanations. ArXiv:2209.14279.
  42. ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. Transactions of the Association for Computational Linguistics, 10:291–306.
  43. Homophonic pun generation with lexically constrained rewriting. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2870–2876, Online. Association for Computational Linguistics.
  44. OPT: Open Pre-trained Transformer language models. arXiv preprint arXiv:2205.01068.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jing Huang (140 papers)
  2. Zhengxuan Wu (37 papers)
  3. Kyle Mahowald (40 papers)
  4. Christopher Potts (113 papers)
Citations (12)