Unclear protection mechanisms for untrusted content in multi-agent systems

Ascertain whether large language model–based multi-agent systems that interact with untrusted inputs (such as malicious web content, files, email attachments, images, audio, or video) deploy any mechanisms to isolate and sandbox untrusted content and protect users, and characterize these mechanisms if they exist, including how the systems delineate the boundary between trusted and untrusted content.

Background

The paper argues that multi-agent systems inevitably process untrusted inputs and thus expose users to significant risk. Unlike web browsers, which implement isolation through policies such as the same-origin policy, current multi-agent systems lack a clearly defined separation between trusted and untrusted inputs. The authors explicitly state uncertainty about whether any mechanisms are being deployed to protect users, motivating a need to identify and analyze existing protections, if any.

This uncertainty is central to the paper’s thesis that control-flow hijacking attacks exploit weaknesses in metadata handling and orchestration. Establishing what protections exist (or do not) provides a foundation for designing safer multi-agent architectures and evaluating their resilience against adversarial content.

References

Whereas Web browsers have developed sophisticated mechanisms, such as the same-origin policy, to isolate and sandbox untrusted content, the boundary between trusted and untrusted content in multi-agent systems is blurry, and it is not clear what mechanisms—if any—these systems are deploying to protect users from malicious content.

Multi-Agent Systems Execute Arbitrary Malicious Code  (2503.12188 - Triedman et al., 15 Mar 2025) in Section 1 (Introduction)