Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Baseline Analysis of Reward Models' Ability To Accurately Analyze Foundation Models Under Distribution Shift (2311.14743v7)

Published 21 Nov 2023 in cs.CL and cs.LG

Abstract: Foundation models, specifically LLMs, have lately gained wide-spread attention and adoption. Reinforcement Learning with Human Feedback (RLHF) involves training a reward model to capture desired behaviors, which is then used to align LLM's. These reward models are additionally used at inference-time to estimate LLM responses' adherence to those desired behaviors. However, there is little work measuring how robust these reward models are to distribution shifts. In this work, we evaluate how reward model performance - measured via accuracy and calibration (i.e. alignment between accuracy and confidence) - is affected by distribution shift. We show novel calibration patterns and accuracy drops due to OOD prompts and responses, and that the reward model is more sensitive to shifts in responses than prompts. Additionally, we adapt an OOD detection technique commonly used in classification to the reward model setting to detect these distribution shifts in prompts and responses.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Generalization Analogies: A Testbed for Generalizing AI Oversight to Hard-To-Measure Domains. CoRR, abs/2311.07723.
  2. Extremely Simple Activation Shaping for Out-of-Distribution Detection. arXiv preprint arXiv:2209.09858.
  3. On calibration of modern neural networks. In International conference on machine learning, 1321–1330. PMLR.
  4. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. arXiv:2006.03654.
  5. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136.
  6. Generalized odin: Detecting out-of-distribution image without learning from out-of-distribution data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10951–10960.
  7. On the importance of gradients for detecting distributional shifts in the wild. Advances in Neural Information Processing Systems, 34: 677–689.
  8. Training ood detectors in their natural habitats. In International Conference on Machine Learning, 10848–10865. PMLR.
  9. OpenAssistant Conversations - Democratizing Large Language Model Alignment. CoRR, abs/2304.07327.
  10. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. Advances in neural information processing systems, 32.
  11. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, 7167–7177.
  12. Enabling Calibration In The Zero-Shot Inference of Large Vision-Language Models. arXiv preprint arXiv:2303.12748.
  13. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690.
  14. Let’s Verify Step by Step. CoRR, abs/2305.20050.
  15. How Good Are Large Language Models at Out-of-Distribution Detection? arXiv preprint arXiv:2308.10261.
  16. Energy-based out-of-distribution detection. Advances in Neural Information Processing Systems, 33: 21464–21475.
  17. Can you trust your model's uncertainty? Evaluating predictive uncertainty under dataset shift. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  18. Accurate Layerwise Interpretable Competence Estimation. In Advances in Neural Information Processing Systems, 13981–13991.
  19. Learning to summarize from human feedback. In NeurIPS.
  20. React: Out-of-distribution detection with rectified activations. Advances in Neural Information Processing Systems, 34: 144–157.
  21. On the Effectiveness of Sparsification for Detecting the Deep Unknowns. arXiv preprint arXiv:2111.09805.
  22. Out-of-distribution Detection with Deep Nearest Neighbors. arXiv preprint arXiv:2204.06507.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Will LeVine (7 papers)
  2. Sean Hendryx (12 papers)
  3. Benjamin Pikus (3 papers)
  4. Anthony Chen (22 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com