Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Juru: Legal Brazilian Large Language Model from Reputable Sources (2403.18140v1)

Published 26 Mar 2024 in cs.CL and cs.AI

Abstract: The high computational cost associated with pretraining LLMs limits their research. Two strategies have emerged to address this issue: domain specialization and pretraining with high-quality data. To explore these strategies, we specialized the Sabi\'a-2 Small model with 1.9 billion unique tokens from reputable Brazilian legal sources and conducted few-shot evaluations on legal and general knowledge exams. Our model, Juru, demonstrates the benefits of domain specialization with a reduced amount of pretraining data. However, this specialization comes at the expense of degrading performance in other knowledge areas within the same language. This study contributes to the growing body of scientific evidence showing that pretraining data selection may enhance the performance of LLMs, enabling the exploration of these models at a lower cost.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633 – 2650, 2021.
  2. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 2020-December, 2020.
  3. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  4. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483 – 498. Association for Computational Linguistics, 2021.
  5. Training compute-optimal large language models, 2022.
  6. OpenAI. Gpt-4 technical report, 2023.
  7. Saullm-7b: A pioneering large language model for law. arXiv preprint arXiv:2403.03883, 2024.
  8. RoBERTaLexPT: A legal RoBERTa model pretrained with deduplication for Portuguese. In Pablo Gamallo, Daniela Claro, António Teixeira, Livy Real, Marcos Garcia, Hugo Gonçalo Oliveira, and Raquel Amaro, editors, Proceedings of the 16th International Conference on Computational Processing of Portuguese, pages 374–383, Santiago de Compostela, Galicia/Spain, March 2024. Association for Computational Lingustics.
  9. Sabiá: Portuguese large language models, 2023.
  10. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023.
  11. Textbooks are all you need, 2023.
  12. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  13. Sabiá-2: A new generation of portuguese large language models, 2024.
  14. Emergent abilities of large language models, 2022.
  15. Cabrita: closing the gap for foreign languages. arXiv preprint arXiv:2308.11878, 2023.
  16. Legalnlp-natural language processing methods for the brazilian legal language. In Anais do XVIII Encontro Nacional de Inteligência Artificial e Computacional, pages 763 – 774, 2021.
  17. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  18. Exploring text decoding methods for portuguese legal text generation. In Brazilian Conference on Intelligent Systems, pages 63 – 77. Springer, 2023.
  19. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  20. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, volume 1: Long Papers, pages 1715 – 1725, 2016.
  21. Scaling up models and data with t5x and seqio. Journal of Machine Learning Research, 24(377):1 – 8, 2023.
  22. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596 – 4604. PMLR, 2018.
  23. Palm: Scaling language modeling with pathways, 2022.
  24. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems, volume 36, pages 46595 – 46623, 2023.
  25. Bluex: A benchmark based on brazilian leading universities entrance exams. In Brazilian Conference on Intelligent Systems, pages 337 – 347. Springer, 2023.
  26. Faquad: Reading comprehension dataset in the domain of brazilian higher education. In 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), pages 443 – 448. IEEE, 2019.
  27. Advances in automatically solving the enem. In 2018 7th Brazilian Conference on Intelligent Systems (BRACIS), pages 43 – 48. IEEE, 2018.
  28. Evaluating gpt-3.5 and gpt-4 models on brazilian university admission exams. arXiv preprint arXiv:2303.17003, 2023.
  29. Evaluating gpt-4’s vision capabilities on brazilian university admission exams. arXiv preprint arXiv:2311.14169, 2023.

Summary

We haven't generated a summary for this paper yet.