Assessing risk severity in unconditioned generation by concept-trained LLMs
Ascertain the severity of unwanted or harmful behaviors, including hallucinations, when large language models trained with the concept-level objective described in this paper are used for unconditioned text generation, a setting not evaluated in the study.
References
As with NTP models, concept-trained models may behave in an unwanted or harmful manner, such as producing hallucinations. In this work, we did not explore using our concept-trained models for unconditioned generation, and thus the severity of these risks is unknown for our models.
— Concept Training for Human-Aligned Language Models
(2603.29123 - Zhang et al., 31 Mar 2026) in Ethical Considerations