Papers
Topics
Authors
Recent
2000 character limit reached

Distilled Self-Critique of LLMs with Synthetic Data: a Bayesian Perspective (2312.01957v3)

Published 4 Dec 2023 in cs.CL and cs.LG

Abstract: This paper proposes an interpretation of RLAIF as Bayesian inference by introducing distilled Self-Critique (dSC), which refines the outputs of a LLM through a Gibbs sampler that is later distilled into a fine-tuned model. Only requiring synthetic data, dSC is exercised in experiments regarding safety, sentiment, and privacy control, showing it can be a viable and cheap alternative to align LLMs. Code released at \url{https://github.com/vicgalle/distilled-self-critique}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (14)
  1. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  2. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019.
  3. Alan E. Gelfand. Gibbs sampling. Journal of the American Statistical Association, 95(452):1300–1304, 2000. ISSN 01621459. URL http://www.jstor.org/stable/2669775.
  4. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  5. Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
  6. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.
  7. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  8. Mistral 7b, 2023.
  9. Rl with kl penalties is better viewed as bayesian inference. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  1083–1091, 2022.
  10. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023.
  11. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  12. Training language models to follow instructions with human feedback, 2022.
  13. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  14. Universal and transferable adversarial attacks on aligned language models, 2023.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.