Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset (2405.10542v1)

Published 17 May 2024 in cs.CL and cs.AI

Abstract: In light of recent breakthroughs in LLMs that have revolutionized NLP, there is an urgent need for new benchmarks to keep pace with the fast development of LLMs. In this paper, we propose CFLUE, the Chinese Financial Language Understanding Evaluation benchmark, designed to assess the capability of LLMs across various dimensions. Specifically, CFLUE provides datasets tailored for both knowledge assessment and application assessment. In knowledge assessment, it consists of 38K+ multiple-choice questions with associated solution explanations. These questions serve dual purposes: answer prediction and question reasoning. In application assessment, CFLUE features 16K+ test instances across distinct groups of NLP tasks such as text classification, machine translation, relation extraction, reading comprehension, and text generation. Upon CFLUE, we conduct a thorough evaluation of representative LLMs. The results reveal that only GPT-4 and GPT-4-turbo achieve an accuracy exceeding 60\% in answer prediction for knowledge assessment, suggesting that there is still substantial room for improvement in current LLMs. In application assessment, although GPT-4 and GPT-4-turbo are the top two performers, their considerable advantage over lightweight LLMs is noticeably diminished. The datasets and scripts associated with CFLUE are openly accessible at https://github.com/aliyun/cflue.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Technology Association Workshop, pages 84–90.
  2. Qwen technical report. Computing Research Repository, arXiv:2309.16609.
  3. Baichuan. 2023. Baichuan 2: Open large-scale language models. Computing Research Repository, arXiv:2309.10305.
  4. Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on NLP for ConvAI - ACL 2020, pages 38–45.
  5. Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning. Computing Research Repository, arXiv:2310.15205.
  6. Finqa: A dataset of numerical reasoning over financial data. In Proceedings of EMNLP, pages 3697–3711.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL, pages 4171–4186.
  9. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of NAACL, pages 2368–2378.
  10. Duxiaoman-DI. 2022. Financeiq. https://huggingface.co/datasets/Duxiaoman-DI/FinanceIQ.
  11. FinSBD3. 2021. Financial sdb 3. https://sites.google.com/nlg.csie.ntu.edu.tw/finweb2021/shared-task-finsbd-3.
  12. FiQA. 2018. Financial question answering. https://sites.google.com/view/fiqa.
  13. Duee-fin: A large-scale dataset for document-level event extraction. In Proceedings of NLPCC, pages 172–183.
  14. Measuring massive multitask language understanding. In Proceedings of ICLR.
  15. Lora: Low-rank adaptation of large language models. In Proceedings of ICLR.
  16. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In Proceedings of NeurIPS.
  17. Financebench: A new benchmark for financial question answering. Computing Research Repository, arXiv:2311.11944.
  18. Bizbench: A quantitative reasoning benchmark for business and finance. Computing Research Repository, arXiv:2311.06602.
  19. Cmmlu: Measuring massive multitask language understanding in chinese. Computing Research Repository, arXiv:2306.09212.
  20. Holistic evaluation of language models. Transactions on Machine Learning Research.
  21. Chin-Yew Lin and Eduard Hovy. 2002. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of ACL, page 311–318.
  22. CSDS: A fine-grained Chinese dataset for customer service dialogue summarization. In Proceedings of EMNLP, pages 4436–4451.
  23. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of ACL, pages 3214–3252.
  24. Benchmarking large language models on cmexam – a comprehensive chinese medical exam dataset. In Proceedings of NeurIPS.
  25. Www’18 open challenge: Financial opinion mining and question answering. In Companion Proceedings of WWW, pages 1941–1942.
  26. Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the American Society for Information Science and Technology, 65:782–796.
  27. OpenAI. 2022. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt.
  28. OpenAI. 2023. Gpt-4 technical report. Computing Research Repository, arXiv:2303.08774.
  29. Bleu: A method for automatic evaluation of machine translation. In Proceedings of ACL, page 311–318.
  30. Language models are unsupervised multitask learners. OpenAI blog.
  31. Know what you don’t know: Unanswerable questions for squad. In Companion Proceedings of WWW, pages 784–789.
  32. COMET: A neural framework for MT evaluation. In Proceedings of EMNLP, pages 2685–2702.
  33. Iree: A fine-grained dataset for chinese event extraction in investment research. In Proceedings of China Conference on Knowledge Graph and Semantic Computing, pages 205–210.
  34. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64:99–106.
  35. When flue meets flang: Benchmarks and large pretrained language model for financial domain. In Proceedings of EMNLP, pages 2322–2335.
  36. Ankur Sinha and Tanmay Khandait. 2020. Impact of news on the commodity market: Dataset and results. Computing Research Repository, arXiv:2009.04202.
  37. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  38. Tianchi. 2019. Ccks2019 financial domain document-level event entity extraction dataset. https://tianchi.aliyun.com/dataset/111237.
  39. Tianchi. 2020. Ccks2020 financial domain document-level event entity extraction dataset. https://https://tianchi.aliyun.com/dataset/111209.
  40. Tianchi. 2021. Ccks2021 financial domain event causal relationship extraction dataset. https://tianchi.aliyun.com/dataset/110901.
  41. Tianchi. 2022. Ccks2022 financial domain few-shot event extraction dataset. https://tianchi.aliyun.com/dataset/136800.
  42. TongyiFinance. 2023. Tongyi-finance-14b-chat. https://www.modelscope.cn/models/TongyiFinance/Tongyi-Finance-14B-Chat/summary.
  43. Llama: Open and efficient foundation language models. Computing Research Repository, arXiv:2302.13971.
  44. Superglue: A stickier benchmark for general-purpose language understanding systems. In Proceedings of NeurIPS, pages 3266–3280.
  45. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of ICLR.
  46. CLUE: A Chinese language understanding evaluation benchmark. In Proceedings of COLING, pages 4762–4772.
  47. Fingpt: Open-source financial large language models. FinLLM Symposium at IJCAI 2023.
  48. Hellaswag: Can a machine really finish your sentence? In Proceedings of ACL, pages 4791–4800.
  49. Glm-130b: An open bilingual pre-trained model. Computing Research Repository, arXiv:2210.02414.
  50. Fineval: A chinese financial domain knowledge evaluation benchmark for large language models. Computing Research Repository, arXiv:2308.09975.
  51. Bertscore: Evaluating text generation with bert. In Proceedings of ICLR.
  52. Agieval: A human-centric benchmark for evaluating foundation models. Computing Research Repository, arXiv:2304.06364.
  53. TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance. In Proceedings of ACL-IJCNLP, pages 3277–3287.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jie Zhu (127 papers)
  2. Junhui Li (51 papers)
  3. Yalong Wen (2 papers)
  4. Lifan Guo (4 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets