Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Taiwan LLM: Bridging the Linguistic Divide with a Culturally Aligned Language Model (2311.17487v1)

Published 29 Nov 2023 in cs.CL and cs.AI

Abstract: In the realm of LLMs, the nuanced linguistic and cultural intricacies of Traditional Chinese, as spoken in Taiwan, have been largely overlooked. This paper introduces Taiwan LLM, a pioneering LLM that specifically caters to the Traditional Chinese language, with a focus on the variant used in Taiwan. Leveraging a comprehensive pretraining corpus and instruction-finetuning datasets, we have developed a model that not only understands the complexities of Traditional Chinese but also embodies the cultural context of Taiwan. Taiwan LLM represents the first of its kind, a model that is not only linguistically accurate but also culturally resonant with its user base. Our evaluations demonstrate that Taiwan LLM achieves superior performance in understanding and generating Traditional Chinese text, outperforming existing models that are predominantly trained on Simplified Chinese or English. The open-source release of Taiwan LLM invites collaboration and further innovation, ensuring that the linguistic diversity of Chinese speakers is embraced and well-served. The model, datasets, and further resources are made publicly available to foster ongoing research and development in this field.

Taiwan-LLM: A Culturally Aligned LLM for Traditional Chinese

The paper introduces Taiwan-LLM, a LLM specifically designed for Traditional Chinese as used in Taiwan. This innovation addresses the overlooked linguistic and cultural aspects inherent to Traditional Chinese, which differ significantly from Simplified Chinese and English, predominantly used in existing LLMs.

Methodological Approach

The development of Taiwan-LLM encompasses a three-phase methodology: Continue-Pretraining (cPT), Supervised Fine-Tuning (SFT), and Feedback Supervised Fine-Tuning (Feedback SFT).

  • Continue-Pretraining (cPT): This phase involves enhancing a base model with a comprehensive Taiwanese corpus to capture the intricacies of Traditional Chinese.
  • Supervised Fine-Tuning (SFT): Utilizing a multi-turn dialogue dataset, this phase hones the model's conversational abilities, emphasizing cultural nuances.
  • Feedback Supervised Fine-Tuning (Feedback SFT): Incorporating user feedback ensures alignment with user preferences, enhancing linguo-cultural relevance.

Experimental Results

Taiwan-LLM exhibits competitive performance, particularly in comparison to proprietary models like GPT-3.5 turbo. On the TC-Eval benchmark suite, the 13-billion parameter version achieves an average performance of 53.99%, effectively aligning with the proprietary benchmarks while ensuring superior handling of Traditional Chinese.

The results underscore the impact of the continue-pretraining phase, improving linguistic accuracy across tasks. Conversely, the inclusion of filtered CommonCrawl data did not contribute positively, underscoring the importance of maintaining high-quality, culturally relevant datasets.

Contribution and Implications

Taiwan-LLM is significant within the landscape of NLP, offering an open-source solution that invites collaboration and further development. The model sets a precedent in addressing the linguistic diversity of Traditional Chinese, offering equitable access to language technologies.

Future Directions

The development of Taiwan-LLM opens avenues for the refinement of similar models tailored for other underrepresented languages. Further exploration into advanced training methods such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) is suggested for continuing performance enhancement.

Conclusion

Taiwan-LLM marks a crucial step in bridging the technological divide for Traditional Chinese speakers. By focusing on the nuances and cultural contexts, it successfully meets the needs of its target demographic, establishing a benchmark for culturally aligned LLMs.

This work signifies progress towards inclusive language representation in AI, ensuring the preservation and accessibility of linguistic diversity within technological advancements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. 2023. Fasteval.
  2. Together AI. 2023. Releasing 3b and 7b redpajama-incite family of models including base, instruction-tuned and chat models.
  3. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  4. Sahil Chaudhary. 2023. Code alpaca: An instruction-following llama model for code generation.
  5. Vicuna: An Open-Source chatbot impressing GPT-4 with 90%* ChatGPT quality.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  7. Free dolly: Introducing the world’s first truly open instruction-tuned llm.
  8. Tri Dao. 2023. FlashAttention-2: Faster attention with better parallelism and work partitioning.
  9. Alpacafarm: A simulation framework for methods that learn from human feedback.
  10. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance.
  11. Advancing the evaluation of traditional chinese language models: Towards a comprehensive benchmark suite.
  12. Mistral 7B.
  13. OpenAssistant Conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327.
  14. Yen-Ting Lin and Yun-Nung Chen. 2023. LLM-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. In Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023), pages 47–58, Toronto, Canada. Association for Computational Linguistics.
  15. The flan collection: Designing data and methods for effective instruction tuning. In Proceedings of the 40 th International Conference on Machine Learning.
  16. Mosaic ML. 2023. Introducing mpt-7b: A new standard for open-source, commercially usable llms.
  17. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only.
  18. Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277.
  19. Zero: Memory optimizations toward training trillion parameter models.
  20. Proximal policy optimization algorithms.
  21. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7.
  22. Xwin-Lm Team. 2023. Xwin-LM.
  23. Llama 2: Open foundation and Fine-Tuned chat models.
  24. TRL: Transformer reinforcement learning.
  25. Self-Instruct: Aligning language models with Self-Generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  26. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109.
  27. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  28. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304. 12244.
  29. Judging LLM-as-a-Judge with MT-Bench and chatbot arena.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Yen-Ting Lin (117 papers)
  2. Yun-Nung Chen (104 papers)
Citations (15)
Github Logo Streamline Icon: https://streamlinehq.com