Taiwan-LLM: A Culturally Aligned LLM for Traditional Chinese
The paper introduces Taiwan-LLM, a LLM specifically designed for Traditional Chinese as used in Taiwan. This innovation addresses the overlooked linguistic and cultural aspects inherent to Traditional Chinese, which differ significantly from Simplified Chinese and English, predominantly used in existing LLMs.
Methodological Approach
The development of Taiwan-LLM encompasses a three-phase methodology: Continue-Pretraining (cPT), Supervised Fine-Tuning (SFT), and Feedback Supervised Fine-Tuning (Feedback SFT).
- Continue-Pretraining (cPT): This phase involves enhancing a base model with a comprehensive Taiwanese corpus to capture the intricacies of Traditional Chinese.
- Supervised Fine-Tuning (SFT): Utilizing a multi-turn dialogue dataset, this phase hones the model's conversational abilities, emphasizing cultural nuances.
- Feedback Supervised Fine-Tuning (Feedback SFT): Incorporating user feedback ensures alignment with user preferences, enhancing linguo-cultural relevance.
Experimental Results
Taiwan-LLM exhibits competitive performance, particularly in comparison to proprietary models like GPT-3.5 turbo. On the TC-Eval benchmark suite, the 13-billion parameter version achieves an average performance of 53.99%, effectively aligning with the proprietary benchmarks while ensuring superior handling of Traditional Chinese.
The results underscore the impact of the continue-pretraining phase, improving linguistic accuracy across tasks. Conversely, the inclusion of filtered CommonCrawl data did not contribute positively, underscoring the importance of maintaining high-quality, culturally relevant datasets.
Contribution and Implications
Taiwan-LLM is significant within the landscape of NLP, offering an open-source solution that invites collaboration and further development. The model sets a precedent in addressing the linguistic diversity of Traditional Chinese, offering equitable access to language technologies.
Future Directions
The development of Taiwan-LLM opens avenues for the refinement of similar models tailored for other underrepresented languages. Further exploration into advanced training methods such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) is suggested for continuing performance enhancement.
Conclusion
Taiwan-LLM marks a crucial step in bridging the technological divide for Traditional Chinese speakers. By focusing on the nuances and cultural contexts, it successfully meets the needs of its target demographic, establishing a benchmark for culturally aligned LLMs.
This work signifies progress towards inclusive language representation in AI, ensuring the preservation and accessibility of linguistic diversity within technological advancements.