Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual Property (2402.16389v1)

Published 26 Feb 2024 in cs.CL and cs.AI

Abstract: LLMs have demonstrated impressive performance in various NLP tasks. However, there is limited understanding of how well LLMs perform in specific domains (e.g, the intellectual property (IP) domain). In this paper, we contribute a new benchmark, the first Multilingual-oriented quiZ on Intellectual Property (MoZIP), for the evaluation of LLMs in the IP domain. The MoZIP benchmark includes three challenging tasks: IP multiple-choice quiz (IPQuiz), IP question answering (IPQA), and patent matching (PatentMatch). In addition, we also develop a new IP-oriented multilingual LLM (called MoZi), which is a BLOOMZ-based model that has been supervised fine-tuned with multilingual IP-related text data. We evaluate our proposed MoZi model and four well-known LLMs (i.e., BLOOMZ, BELLE, ChatGLM and ChatGPT) on the MoZIP benchmark. Experimental results demonstrate that MoZi outperforms BLOOMZ, BELLE and ChatGLM by a noticeable margin, while it had lower scores compared with ChatGPT. Notably, the performance of current LLMs on the MoZIP benchmark has much room for improvement, and even the most powerful ChatGPT does not reach the passing level. Our source code, data, and models are available at \url{https://github.com/AI-for-Science/MoZi}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. The EMILLE/CIIL Corpus. EMILLE (Enabling Minority Language Engineering) Project. distributed via ELRA: ELRA-Id W0037, ISLRN 039-846-040-604-0.
  2. Evaluating large language models trained on code.
  3. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  4. Lawyer llama technical report. arXiv preprint arXiv:2305.15062.
  5. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
  6. Belle: Be everyone’s large language model engine. https://github.com/LianjiaTech/BELLE.
  7. Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases. arXiv preprint arXiv:2303.14742.
  8. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.
  9. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
  10. Khalid Choukri and Niklas Paullson. 2004. The OrienTel Moroccan MCA (Modern Colloquial Arabic) database. distributed via ELRA: ELRA-Id ELRA-S0183, ISLRN 613-578-868-832-2.
  11. Llms as factual reasoners: Insights from existing benchmarks and beyond.
  12. Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
  13. Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pages 2356–2362.
  14. Lawgpt: Chinese legal dialogue language model. https://github.com/LiuHC0428/LAW_GPT.
  15. Crosslingual generalization through multitask finetuning.
  16. OpenAI. 2023. Gpt-4 technical report.
  17. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
  18. Large language models encode clinical knowledge.
  19. Speecon Consortium. 2011. Catalan Speecon database. SpeeCon. Speecon Project, distributed via ELRA: ELRA-Id S0327, Speecon resources, 1.0, ISLRN 935-211-147-357-5.
  20. Llama: Open and efficient foundation language models.
  21. Huatuo: Tuning llama model with chinese medical knowledge.
  22. WIPO. 2011. Wipo web site, world intellectual property day – 26 april. [Online; Accessed 20 April 2011.].
  23. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  24. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564.
  25. GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations.
  26. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
  27. Huatuogpt, towards taming language models to be a doctor. arXiv preprint arXiv:2305.15075.
  28. Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Shiwen Ni (34 papers)
  2. Minghuan Tan (15 papers)
  3. Yuelin Bai (13 papers)
  4. Fuqiang Niu (9 papers)
  5. Min Yang (239 papers)
  6. Bowen Zhang (161 papers)
  7. Ruifeng Xu (66 papers)
  8. Xiaojun Chen (100 papers)
  9. Chengming Li (28 papers)
  10. Xiping Hu (46 papers)
  11. Ye Li (155 papers)
  12. Jianping Fan (51 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.