Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bailong: Bilingual Transfer Learning based on QLoRA and Zip-tie Embedding (2404.00862v1)

Published 1 Apr 2024 in cs.CL and cs.AI

Abstract: LLMs have demonstrated exceptional performance in various NLP applications. However, the majority of existing open-source LLMs are pre-trained primarily on English data and little part of other languages. This deficiency in multilingual training data results in suboptimal performance when applied to languages with fewer available resources. Furthermore, enhancing the performance of LLMs on low-resource languages by full-parameter fine-tuning with additional data requires substantial computational resources, posing computational barriers for research organizations and individual researchers. Consequently, several techniques such as parameter-efficient tuning and advanced embedding initialization have been proposed to address these challenges. In this work, we combine them to facilitate cross-lingual transfer on English-dominated open-source LLM. To effectively enhance the model's proficiency in Traditional Chinese, we conduct secondary pre-training on Llama 2 7B with Traditional Chinese data by leveraging QLoRA and our proposed zip-tie embedding initialization. The resulting model called Bailong, which stands for Bilingual trAnsfer learnIng based on qLOra and zip-tie embeddiNG. We present Bailong-instruct 7B, a fine-tuned version of Bailong 7B optimized for multi-turn dialogue scenarios. Recognizing the inadequacy of benchmark datasets in Traditional Chinese, we further introduce Bailong-bench to assess the alignment of models with human preferences and the capability to follow instructions in both Traditional Chinese and English tasks. In our evaluation, Bailong-instruct 7B exhibits competitive performance on Bailong-bench and other benchmark datasets when compared to other open-source models of similar or even larger parameter sizes. Bailong-instruct 7B and Bailong-bench are publicly available with the aim of empowering the community to build upon our efforts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023.
  2. Longest common substring in longest common subsequence’s solution service: A novel hyper-heuristic. Computational Biology and Chemistry, 105:107882, 2023.
  3. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  4. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  5. Adrien Barbaresi. Trafilatura: A web scraping library and command-line tool for text discovery and extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp.  122–131, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-demo.15. URL https://aclanthology.org/2021.acl-demo.15.
  6. Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pp.  21–29. IEEE, 1997.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  9. Selection via proxy: Efficient data selection for deep learning. arXiv preprint arXiv:1906.11829, 2019.
  10. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019.
  11. Ada-instruct: Adapting instruction generators for complex reasoning. arXiv preprint arXiv:2310.04484, 2023.
  12. Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177, 2023.
  13. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  14. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  15. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
  16. Extending the pre-training of bloom for improved support of traditional chinese: Models, methods and results. arXiv preprint arXiv:2303.04715, 2023.
  17. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
  18. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  19. Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487, 2022.
  20. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  21. Orpo: Monolithic preference optimization without reference model, 2024.
  22. Parameter-efficient transfer learning for NLP. CoRR, abs/1902.00751, 2019. URL http://arxiv.org/abs/1902.00751.
  23. Advancing the evaluation of traditional chinese language models: Towards a comprehensive benchmark suite. arXiv preprint arXiv:2309.08448, 2023.
  24. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  25. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  26. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36, 2024.
  27. Taiwan llm: Bridging the linguistic divide with a culturally aligned language model. arXiv preprint arXiv:2311.17487, 2023.
  28. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  29. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.  142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
  30. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935–948, 1993.
  31. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745, 2018.
  32. Universal dependencies v2: An evergrowing multilingual treebank collection. arXiv preprint arXiv:2004.10643, 2020.
  33. OpenAI. Introducing chatgpt, 2022. URL https://openai.com/blog/chatgpt#OpenAI.
  34. Efficient language model training through cross-lingual and progressive transfer learning. arXiv preprint arXiv:2301.09626, 2023.
  35. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  36. Bad-x: Bilingual adapters improve zero-shot cross-lingual transfer. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  1791–1799, 2022.
  37. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  38. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  39. Mad-x: An adapter-based framework for multi-task cross-lingual transfer. arXiv preprint arXiv:2005.00052, 2020a.
  40. Unks everywhere: Adapting multilingual language models to new scripts. arXiv preprint arXiv:2012.15562, 2020b.
  41. Typhoon: Thai large language models. arXiv preprint arXiv:2312.13951, 2023.
  42. Sabi\\\backslash\’a: Portuguese large language models. arXiv preprint arXiv:2304.07880, 2023.
  43. Direct preference optimization: Your language model is secretly a reward model. arxiv 2023. arXiv preprint arXiv:2305.18290, 2023.
  44. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  45. SarvamAI. Openhathi series: An approach to build bilingual llms frugally, december 2023., 2023. URL https://www.sarvam.ai/blog/announcing-openhathi-series.
  46. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
  47. Drcd: A chinese machine reading comprehension dataset. arXiv preprint arXiv:1806.00920, 2018.
  48. An empirical study of instruction-tuning large language models in chinese. arXiv preprint arXiv:2310.07328, 2023.
  49. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
  50. STPI. 2020 「科技大擂台 與ai對話」訓練資料集, 2020. URL https://scidm.nchc.org.tw/dataset/grandchallenge2020.
  51. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  52. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  53. D4: Improving llm pretraining via document de-duplication and diversification. arXiv preprint arXiv:2308.12284, 2023.
  54. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  55. Ke Tran. From english to foreign languages: Transferring pre-trained language models. arXiv preprint arXiv:2002.07306, 2020.
  56. Neural machine translation with byte-level subwords. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  9154–9160, 2020.
  57. Efficient test time adapter ensembling for low-resource language varieties. arXiv preprint arXiv:2109.04877, 2021.
  58. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  59. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  60. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  61. Data selection for language models via importance resampling. arXiv preprint arXiv:2302.03169, 2023.
  62. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196, 2023.
  63. Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  6514–6523, 2023.
  64. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Lung-Chuan Chen (2 papers)
  2. Zong-Ru Li (1 paper)