Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TigerBot: An Open Multilingual Multitask LLM (2312.08688v2)

Published 14 Dec 2023 in cs.CL and cs.AI
TigerBot: An Open Multilingual Multitask LLM

Abstract: We release and introduce the TigerBot family of LLMs, consisting of base and chat models, sized from 7, 13, 70 and 180 billion parameters. We develop our models embarking from Llama-2 and BLOOM, and push the boundary further in data, training algorithm, infrastructure, and application tools. Our models yield meaningful performance gain over SOTA open-source models, e.g., Llama-2, specifically 6% gain in English and 20% gain in Chinese. TigerBot model family also achieves leading performance in major academic and industrial benchmarks and leaderboards. We believe that TigerBot represents just a snapshot of lightning-fast progression in LLM open-source community. Therefore, we are thrilled to give back by publicly releasing our models and reporting our approach behind, with additional emphases on building SOTA LLMs in a democratized way and making LLMs of use in real-world applications.

Introduction

LLMs have transformed the AI landscape with capabilities that seem to inch closer to artificial general intelligence (AGI). Their functionality spans across various domains, imparting them with skills ranging from simple question-answering to complex coding tasks. The evolution of LLMs has been chiefly driven by the advancement in their foundational capabilities, computational efficiency, and readiness for real-world applications. This typically involves pretraining on an extensive corpus of data and then refining them through supervised and reinforcement learning techniques. TigerBot is an addition to the cohort of LLMs, following a lineage of models while carving out its own niche in both performance and application diversity.

TigerBot Models

TigerBot is a multifaceted LLM, ranging from 7B to 180B parameters, designed for an array of multilingual multitask applications. It is openly available for both research and commercial use, offering an array of tools and a developer-friendly API. The models incorporate extensive training data, roughly equating to 500 billion tokens that have been thoroughly vetted for quality and diversity, encompassing a significant portion of Chinese language data and a wide range of tasks. Additionally, TigerBot has been released with various developer tools and maintains compatibility with contemporary search engines and knowledge bases.

Training Methods and Data

Underpinning TigerBot's abilities is a diverse dataset comprising public and proprietary sources, meticulously curated for quality and multilingual coverage. The model expands its language capacity by blending tokenizer vocabularies from prominent multilingual models like BLOOM and experimenting widely with parallelism strategies during training. TigerBot's multilingual, multitask coverage, combined with a suite of novel algorithmic and infrastructure enhancements, propels it into the upper echelons of open-source LLMs. This has been achieved at a modest computational cost, with an emphasis on low carbon footprint, aligning with the ethos of democratizing LLM development.

Applications and Safety

The versatility of TigerBot manifests in various applications, ranging from long-context QA and online search augmentation to more niche uses like role-playing and function calling, demonstrating a revolutionary potential for practical utility. Moreover, the model addresses safety concerns through comprehensive filtering during training and runtime, ensuring the outputs are aligned with human values. With safety as a priority, TigerBot remains aligned to core social values and legal considerations by integrating monitored iterative alignments with real user data.

Conclusion

In summary, TigerBot is an exemplary model of bringing together state-of-the-art training techniques, a broad data approach, and real-world application readiness, all underlined by a commitment to accessibility and safety. As the journey of LLMs and AGI continues, TigerBot demonstrates both the immense potential and the ongoing challenges in the field, reinforcing the continuous need for innovation and prudent application development. Its open-source release not only contributes to the thriving AI community but also stakes a claim for the future trajectory of AI research and development.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv:2305.13245 [cs.CL], 05 2023.
  2. Anthropic. Claude 2. https://www.anthropic.com/index/claude-2, 06 2023.
  3. Semantic parsing on freebase from question-answer pairs. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013.
  4. Language models are few-shot learners. arXiv:2005.14165v4 [cs.CL], 05 2020.
  5. Walking down the memory maze: Beyond context limit through interactive reading. arXiv:2310.05029 [cs.CL], 10 2023.
  6. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457 [cs.AI], 03 2018.
  7. O. Contributors. Opencompass: A universal evaluation platform for foundation models. GitHub repository, 2023.
  8. Efficient and effective text encoding for chinese llama and alpaca. arXiv:2304.08177 [cs.CL], 04 2023.
  9. Flashattention: Fast and memory-efficient exact attention with io-awareness. arXiv:2205.14135 [cs.LG], 05 2022.
  10. Google. Sentencepiece. GitHub repository, 2023.
  11. Lora: Low-rank adaptation of large language models. arXiv:2106.09685 [cs.CL], 06 2021.
  12. Huggingface. Text generation inference. GitHub repository, 2023.
  13. Huggingface. Transformers. GitHub repository, 2023.
  14. Dense passage retrieval for open-domain question answering. EMNLP 2020, 04 2020.
  15. Chatharuhi: Reviving anime character in reality via large language model. arXiv:2308.09597 [cs.CL], 2023.
  16. Microsoft. Megatron-deepspeed. GitHub repository, 2023.
  17. Efficient large-scale language model training on gpu clusters using megatron-lm. arXiv:2104.04473 [cs.CL], 04 2021.
  18. NVIDIA. Tensorrt open source software. GitHub repository, 2023.
  19. Training language models to follow instructions with human feedback. arXiv:2203.02155v1 [cs.CL], 03 2022.
  20. O. Peckham. Meta completes research supercluster, announces next-gen datacenter. HPCwire: https://www.hpcwire.com/2023/05/18/meta-completes-research-supercluster-announces-next-gen-datacenter/, 05 2023.
  21. Yarn: Efficient context window extension of large language models. arXiv:2309.00071 [cs.CL], 09 2023.
  22. S. Pichai. An important next step on our ai journey. https://blog.google/technology/ai/bard-google-ai-search-updates/, 02 2023.
  23. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv:2108.12409 [cs.CL], 08 2021.
  24. Direct preference optimization: Your language model is secretly a reward model. arXiv:2305.18290 [cs.LG], 05 2023.
  25. Zero: Memory optimizations toward training trillion parameter models. arXiv:1910.02054 [cs.LG] and In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’20), 10 2019.
  26. Know what you don’t know: Unanswerable questions for squad. arXiv:1806.03822 [cs.CL], 06 2018.
  27. Bloom: A 176b-parameter open-access multilingual language model. arXiv:2211.05100 [cs.CL], 11 2022.
  28. N. Shazeer. Glu variants improve transformer. arXiv:2002.05202 [cs.LG], 02 2020.
  29. Roformer: Enhanced transformer with rotary position embedding. arXiv:2104.09864 [cs.CL], 04 2021.
  30. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv:1811.00937 [cs.CL], 11 2018.
  31. Stanford alpaca: An instruction-following llama model. GitHub repository, 2023.
  32. Llama: Open and efficient foundation language models. arXiv:2302.13971 [cs.CL], 02 2023.
  33. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288 [cs.CL], 07 2023.
  34. Turboderp. Exllamav2. GitHub repository, 2023.
  35. Attention is all you need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA., 06 2017.
  36. Recursively summarizing books with human feedback. arXiv:2109.10862 [cs.CL], 09 2021.
  37. Effective long-context scaling of foundation models. arXiv:2309.16039 [cs.CL], 09 2023.
  38. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. arXiv:2206.01861 [cs.CL], 06 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ye Chen (52 papers)
  2. Wei Cai (130 papers)
  3. Liangmin Wu (1 paper)
  4. Xiaowei Li (63 papers)
  5. Zhanxuan Xin (1 paper)
  6. Cong Fu (24 papers)
Citations (8)
Reddit Logo Streamline Icon: https://streamlinehq.com

Reddit