Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Consistency Matters: Explore LLMs Consistency From a Black-Box Perspective (2402.17411v2)

Published 27 Feb 2024 in cs.CL

Abstract: Nowadays both commercial and open-source academic LLM have become the mainstream models of NLP. However, there is still a lack of research on LLM consistency, meaning that throughout the various stages of LLM research and deployment, its internal parameters and capabilities should remain unchanged. This issue exists in both the industrial and academic sectors. The solution to this problem is often time-consuming and labor-intensive, and there is also an additional cost of secondary deployment, resulting in economic and time losses. To fill this gap, we build an LLM consistency task dataset and design several baselines. Additionally, we choose models of diverse scales for the main experiments. Specifically, in the LightGBM experiment, we used traditional NLG metrics (i.e., ROUGE, BLEU, METEOR) as the features needed for model training. The final result exceeds the manual evaluation and GPT3.5 as well as other models in the main experiment, achieving the best performance. In the end, we use the best performing LightGBM model as the base model to build the evaluation tool, which can effectively assist in the deployment of business models. Our code and tool demo are available at https://github.com/heavenhellchen/Consistency.git

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Qwen technical report. arXiv preprint arXiv:2309.16609.
  3. Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
  4. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  5. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  7. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390.
  8. Chatgpt’s one-year anniversary: Are open-source large language models catching up? arXiv preprint arXiv:2311.16989.
  9. Benchmarking large language models in retrieval-augmented generation. arXiv preprint arXiv:2309.01431.
  10. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883.
  11. Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177.
  12. Neurox: A toolkit for analyzing individual neurons in neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9851–9852.
  13. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.
  14. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  15. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
  16. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
  17. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
  18. Allennlp: A deep semantic natural language processing platform. arXiv preprint arXiv:1803.07640.
  19. exbert: A visual analysis tool to explore learned representations in transformers models. arXiv preprint arXiv:1910.05276.
  20. Who is chatgpt? benchmarking llms’ psychological portrayal using psychobench. arXiv preprint arXiv:2310.01386.
  21. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232.
  22. Human heuristics for ai-generated language are flawed. Proceedings of the National Academy of Sciences, 120(11):e2208839120.
  23. Mistral 7b. arXiv preprint arXiv:2310.06825.
  24. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  25. Estimating the personality of white-box language models. arXiv preprint arXiv:2204.12000.
  26. Bidimensional leaderboards: Generate and evaluate language hand in hand. arXiv preprint arXiv:2112.04139.
  27. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30.
  28. Donald E Knuth. 1992. Two notes on notation. The American Mathematical Monthly, 99(5):403–422.
  29. Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality. arXiv preprint arXiv:2302.14520.
  30. Open models, closed minds? on agents capabilities in mimicking human personalities through open large language models. arXiv preprint arXiv:2401.07115.
  31. Visual analytics for generative transformer models. arXiv preprint arXiv:2311.12418.
  32. Is gpt-3 a psychopath? evaluating large language models from a psychological perspective. arXiv preprint arXiv:2212.10529.
  33. Consisttl: Modeling consistency in transfer learning for low-resource neural machine translation. arXiv preprint arXiv:2212.04262.
  34. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  35. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
  36. Unsupervised data augmentation with naive augmentation and without unlabeled data. arXiv preprint arXiv:2010.11966.
  37. Who is gpt-3? an exploration of personality, values and demographics. arXiv preprint arXiv:2209.14338.
  38. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993.
  39. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  40. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  41. jiant: A software toolkit for research on general-purpose text understanding models. arXiv preprint arXiv:2003.02249.
  42. Ehud Reiter and Anja Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529–558.
  43. Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156.
  44. Personality traits in large language models. arXiv preprint arXiv:2307.00184.
  45. Evaluating evaluation methods for generation in the presence of variation. In International conference on intelligent text processing and computational linguistics, pages 341–351. Springer.
  46. S eq 2s eq-v is: A visual debugging tool for sequence-to-sequence models. IEEE transactions on visualization and computer graphics, 25(1):353–363.
  47. Lstmvis: A tool for visual analysis of hidden state dynamics in recurrent neural networks. IEEE transactions on visualization and computer graphics, 24(1):667–676.
  48. Lmdiff: A visual diff tool to compare language models. arXiv preprint arXiv:2111.01582.
  49. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
  50. The language interpretability tool: Extensible, interactive visualizations and analysis for nlp models. arXiv preprint arXiv:2008.05122.
  51. Evil geniuses: Delving into the safety of llm-based agents. arXiv preprint arXiv:2311.11855.
  52. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  53. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
  54. Fengshenbang 1.0: Being the foundation of chinese cognitive intelligence. CoRR, abs/2209.02970.
  55. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560.
  56. He sicheng Wang Yuxin, Sun Qingxuan. 2023. M3e: Moka massive mixed embedding model.
  57. R-drop: Regularized dropout for neural networks. Advances in Neural Information Processing Systems, 34:10890–10905.
  58. Unsupervised data augmentation for consistency training. Advances in neural information processing systems, 33:6256–6268.
  59. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
  60. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
Citations (1)

Summary

We haven't generated a summary for this paper yet.