Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models (2402.19465v2)
Abstract: Ensuring the trustworthiness of LLMs is crucial. Most studies concentrate on fully pre-trained LLMs to better understand and improve LLMs' trustworthiness. In this paper, to reveal the untapped potential of pre-training, we pioneer the exploration of LLMs' trustworthiness during this period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. To begin with, we apply linear probing to LLMs. The high probing accuracy suggests that \textit{LLMs in early pre-training can already distinguish concepts in each trustworthiness dimension}. Therefore, to further uncover the hidden possibilities of pre-training, we extract steering vectors from a LLM's pre-training checkpoints to enhance the LLM's trustworthiness. Finally, inspired by~\citet{choi2023understanding} that mutual information estimation is bounded by linear probing accuracy, we also probe LLMs with mutual information to investigate the dynamics of trustworthiness during pre-training. We are the first to observe a similar two-phase phenomenon: fitting and compression~\citep{shwartz2017opening}. This research provides an initial exploration of trustworthiness modeling during LLM pre-training, seeking to unveil new insights and spur further developments in the field. We will make our code publicly accessible at \url{https://github.com/ChnQ/TracingLLM}.
- Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7319–7328.
- Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644.
- Deep variational information bottleneck. arXiv preprint arXiv:1612.00410.
- A closer look at memorization in deep networks. In International conference on machine learning, pages 233–242. PMLR.
- Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734.
- Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219.
- Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827.
- Understanding probe behaviors through variational bounds of mutual information. arXiv preprint arXiv:2312.10019.
- European Commission. 2021b. Proposal for a regulation of the european parliament and of the council laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending certain union legislative acts, pub. l. no. com(2021) 206 final.
- Ethics guidelines for trustworthy AI. Publications Office.
- Thomas M Cover. 1999. Elements of information theory. John Wiley & Sons.
- AI Verify Foundation. 2023. Catalogue of llm evaluations.
- A framework for few-shot language model evaluation.
- Bernhard C Geiger. 2021. On information plane analyses of neural network classifiers–a review. IEEE Transactions on Neural Networks and Learning Systems.
- Ziv Goldfeld and Yury Polyanskiy. 2020. The information bottleneck problem and its applications in machine learning. IEEE Journal on Selected Areas in Information Theory, 1(1):19–38.
- Measuring statistical dependence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pages 63–77.
- Wes Gurnee and Max Tegmark. 2023. Language models represent space and time. arXiv preprint arXiv:2310.02207.
- Investigating learning dynamics of BERT fine-tuning. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 87–92. Association for Computational Linguistics.
- Toxigen: A large-scale machine-generated dataset for implicit and adversarial hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics.
- Jamie Hayes. 2020. Trade-offs between membership privacy & adversarially robust learning. arXiv preprint arXiv:2006.04622.
- John Hewitt and Christopher D Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138.
- An empirical study of metrics to measure representational harms in pre-trained language models. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 121–134.
- Beavertails: Towards improved safety alignment of LLM via a human-preference dataset. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Improving activation steering in language models with mean-centring. arXiv preprint arXiv:2312.03813.
- Estimating mutual information. Physical review E, 69(6):066138.
- Implicit representations of meaning in neural language models. arXiv preprint arXiv:2106.00737.
- Inference-time intervention: Eliciting truthful answers from a language model. In Thirty-seventh Conference on Neural Information Processing Systems.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
- Tuning language models by proxy.
- On the impact of hard adversarial instances on overfitting in adversarial training. arXiv preprint arXiv:2112.07324.
- Towards the difficulty for a deep neural network to learn concepts of different complexities. In Thirty-seventh Conference on Neural Information Processing Systems.
- Trustworthy ai: A computational perspective. ACM Transactions on Intelligent Systems and Technology, page 1–59.
- In-context vectors: Making in context learning more effective and controllable through latent space steering. arXiv preprint arXiv:2311.06668.
- Trustworthy llms: a survey and guideline for evaluating large language models’ alignment.
- Ro{bert}a: A robustly optimized {bert} pretraining approach.
- Llm360: Towards fully transparent open-source llms. arXiv preprint arXiv:2312.06550.
- Information bottleneck: Exact analysis of (quantized) neural networks. arXiv preprint arXiv:2106.12912.
- The hsic bottleneck: Deep learning without back-propagation. In Proceedings of the AAAI conference on artificial intelligence, pages 5085–5092.
- Karttikeya Mangalam and Vinay Uday Prabhu. 2019. Do deep neural networks learn shallow learnable examples first?
- Differential privacy has bounded impact on fairness in classification. In International Conference on Machine Learning, pages 23681–23705.
- Samuel Marks and Max Tegmark. 2023. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824.
- What happens to bert embeddings during fine-tuning? In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 33–44.
- Can llms keep a secret? testing privacy implications of language models via contextual integrity theory.
- An emulator for fine-tuning large language models using small language models.
- On the interplay between fine-tuning and sentence-level probing for linguistic knowledge in pre-trained transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2502–2516.
- StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5356–5371.
- Emergent linear representations in world models of self-supervised sequence models. arXiv preprint arXiv:2309.00941.
- Jessica Newman. 2023. A taxonomy of trustworthiness for artificial intelligence: Connecting properties of trustworthiness with risk management and the ai lifecycle.
- Scalable mutual information estimation using dependence graphs. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2962–2966. IEEE.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534.
- The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658.
- Information-theoretic probing for linguistic structure. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4609–4622.
- On variational bounds of mutual information. In International Conference on Machine Learning, pages 5171–5180.
- Language models are unsupervised multitask learners.
- Toward transparent ai: A survey on interpreting the inner structures of deep neural networks. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 464–483. IEEE.
- Probing the probing paradigm: Does probing accuracy entail task relevance? In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3363–3377.
- Identifying semantic induction heads to understand in-context learning. arXiv preprint arXiv:2402.13055.
- Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681.
- On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124020.
- Ravid Shwartz-Ziv and Naftali Tishby. 2017. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810.
- The curious case of hallucinatory unanswerablity: Finding truths in the hidden states of over-confident large language models. arXiv preprint arXiv:2310.11877.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
- Daniel J Solove. 2005. A taxonomy of privacy. U. Pa. l. Rev., 154:477.
- Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561.
- Elham Tabassi. 2023. Artificial intelligence risk management framework (ai rmf 1.0).
- What do you learn from context? probing for sentence structure in contextualized word representations. arXiv preprint arXiv:1905.06316.
- Joma: Demystifying multilayer transformers via joint dynamics of mlp and attention. arXiv preprint arXiv:2310.00535.
- Naftali Tishby and Noga Zaslavsky. 2015. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), pages 1–5. IEEE.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248.
- How does bert answer questions? a layer-wise analysis of transformer representations. In Proceedings of the 28th ACM international conference on information and knowledge management, pages 1823–1832.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
- Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Adversarial glue: A multi-task benchmark for robustness evaluation of language models. In Advances in Neural Information Processing Systems.
- Haoran Wang and Kai Shu. 2023. Backdoor activation attack: Attack large language models using activation steering for safety-alignment. arXiv preprint arXiv:2311.09433.
- On the robustness of chatgpt: An adversarial and out-of-distribution perspective. arXiv preprint arXiv:2302.12095.
- Inferaligner: Inference-time alignment for harmlessness through cross-model guidance. arXiv preprint arXiv:2401.11206.
- To be robust or to be fair: Towards fairness in adversarial training. In International conference on machine learning, pages 11492–11501.
- Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523.
- Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology.
- Yichu Zhou and Vivek Srikumar. 2022. A closer look at how fine-tuning changes bert. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1046–1061.
- On strengthening and defending graph reconstruction attack with markov chain approximation. In International Conference on Machine Learning.
- Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405.
- Chen Qian (226 papers)
- Jie Zhang (846 papers)
- Wei Yao (96 papers)
- Dongrui Liu (34 papers)
- Zhenfei Yin (41 papers)
- Yu Qiao (563 papers)
- Yong Liu (721 papers)
- Jing Shao (109 papers)