Do LLMs commonly produce unpredictable outputs that pose substantive threats to human safety

Determine whether large language models commonly produce unpredictable outputs that could pose substantive threats to human safety, specifically outputs that imply or promote direct harm to human survival beyond restating known facts from existing human knowledge.

Background

The paper argues that most jailbreak studies elicit unsafe responses that restate known facts (e.g., instructions on making a bomb) and therefore rarely pose direct real-world threats. The authors highlight a gap concerning whether LLMs can produce unpredictable, novel content that would imply or promote direct harm to human survival—an existential risk.

To investigate this uncertainty, the authors introduce ExistBench, a benchmark that uses prefix completion to bypass safety constraints and assess whether LLM-generated suffixes express hostility toward humans or actions with severe threat (e.g., executing a nuclear strike). They also examine tool-calling behavior to assess real-world risks. The open question motivates the creation of this benchmark and the subsequent analyses.

References

However, it remains unclear whether LLMs commonly produce unpredictable outputs that could pose substantive threats to human safety.