Unresolved Issues: Summarizing Contradictory Knowledge and Robust Safety Alignment
Develop techniques that enable large language models trained via probabilistic modeling to summarize contradictory training knowledge into coherent outputs and to maintain robust safety alignment that resists jailbreak attacks.
References
However, some issues caused by probabilistic modeling still remain unsolved. At a minimum, the model is still unable to summarize contradictory knowledge in the training data as we would like, and the safety alignment is not robust enough: these strategies can be bypassed by methods such as jailbreaking.
— Open Problems and a Hypothetical Path Forward in LLM Knowledge Paradigms
(2504.06823 - Ye et al., 9 Apr 2025) in Section 3.3 (Internal Knowledge Conflicts in LLMs)