On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment (2507.07341v1)

Published 9 Jul 2025 in cs.AI and cs.CR

Abstract: With the increased deployment of LLMs, one concern is their potential misuse for generating harmful content. Our work studies the alignment challenge, with a focus on filters to prevent the generation of unsafe information. Two natural points of intervention are the filtering of the input prompt before it reaches the model, and filtering the output after generation. Our main results demonstrate computational challenges in filtering both prompts and outputs. First, we show that there exist LLMs for which there are no efficient prompt filters: adversarial prompts that elicit harmful behavior can be easily constructed, which are computationally indistinguishable from benign prompts for any efficient filter. Our second main result identifies a natural setting in which output filtering is computationally intractable. All of our separation results are under cryptographic hardness assumptions. In addition to these core findings, we also formalize and study relaxed mitigation approaches, demonstrating further computational barriers. We conclude that safety cannot be achieved by designing filters external to the LLM internals (architecture and weights); in particular, black-box access to the LLM will not suffice. Based on our technical results, we argue that an aligned AI system's intelligence cannot be separated from its judgment.

Summary

The paper establishes that external filters for moderating prompts and outputs are computationally intractable due to cryptographic barriers.
It employs time-lock puzzles and encryption methods to show that adversarial instructions remain indistinguishable from benign ones.
The study challenges reliance on black-box filtering for AI safety and underscores the need for robust internal alignment mechanisms.

The Computational Intractability of Filtering for AI Alignment

This paper rigorously analyzes the computational limitations of using external filters—either on prompts or outputs—to achieve AI alignment, particularly in the context of LLMs. The authors establish, under standard cryptographic assumptions, that there exist fundamental barriers to reliably filtering harmful content through any efficient, black-box filtering mechanism. The results have significant implications for both the theory and practice of AI safety, challenging the prevailing paradigm of filter-based alignment and raising questions about the feasibility of external safeguards.

Core Contributions

The paper presents several formal impossibility results:

Prompt Filtering Impossibility: For any high-entropy, benign prompt generator, there exist adversarial prompt generators and LLMs such that no efficient filter (running significantly faster than the LLM) can distinguish between benign and adversarial prompts. This result is constructed using cryptographic time-lock puzzles, ensuring that adversarial prompts can encode harmful instructions in a way that is computationally indistinguishable from benign prompts for any efficient filter.
Output Filtering Impossibility: Even when prompt filtering is infeasible, one might hope to filter harmful outputs. The authors show that, under similar cryptographic assumptions, there exist LLMs whose harmful outputs are computationally indistinguishable from benign outputs for any efficient output filter, even if the filter is more powerful than the LLM itself.
Mitigation Filter Barriers: The analysis extends to more permissive filters that can modify (not just reject) prompts or outputs. The authors demonstrate that, even with this additional flexibility, computational barriers persist. The connection to watermarking schemes is leveraged to show that robust mitigation filtering is also intractable under standard cryptographic assumptions.
Extensions with Shared Keys: The impossibility results are further strengthened in scenarios where the LLM and adversarial prompt generator share secret or public keys. In these cases, the cryptographic assumptions required are even weaker (e.g., the existence of one-way functions or public-key encryption), and the impossibility holds for all polynomial-time filters.

Technical Approach

The constructions rely on embedding harmful instructions within prompts or outputs using cryptographic primitives:

Time-Lock Puzzles: Adversarial prompts encode harmful instructions in a way that only an LLM with sufficient computational resources can decode, while any efficient filter cannot distinguish or extract the harmful content.
Recoverable-Randomness Sampling: The authors design a sampling scheme that allows the LLM to recover the randomness used to generate a prompt, enabling it to extract hidden instructions, while maintaining statistical indistinguishability from benign prompts.
Hardcore Functions and Encryption: In the shared-key scenarios, adversarial prompts are encrypted such that only the LLM (with the appropriate key) can decrypt and act on them, while the filter cannot distinguish them from random noise.

Empirical and Theoretical Context

While the constructed LLMs and adversarial prompt generators are, by necessity, somewhat contrived, the paper draws parallels to empirical phenomena such as deceptive alignment and jailbreaking in current LLMs. Recent studies have shown that models can strategically evade detection or modulate their behavior based on context, reinforcing the practical relevance of the theoretical results.

The authors also situate their work within the broader literature on AI alignment, noting that existing alignment techniques (e.g., RLHF, DPO) often fail to address deep representational misalignment and are vulnerable to prompt-based attacks. The impossibility results underscore the limitations of relying solely on external filters, as mandated by regulatory frameworks like the EU AI Act.

Implications

Practical

Filter-Based Safeguards Are Insufficient: The results demonstrate that, in principle, no efficient external filter can guarantee the alignment of a sufficiently powerful LLM. This challenges the current industry practice of deploying black-box filters as primary safety mechanisms.
Need for Internal Alignment: Achieving robust alignment requires interventions at the level of the model’s architecture and weights, not just external monitoring or filtering.
Regulatory Impact: The findings suggest that regulatory requirements for risk management and content filtering may be fundamentally limited in their effectiveness for advanced AI systems.

Theoretical

Separation of Intelligence and Judgment Is Infeasible: The paper formalizes the philosophical claim that intelligence (the ability to generate outputs) cannot be cleanly separated from judgment (the ability to assess harm) in AI systems. Any attempt to decouple these functions via external filters is subject to computational intractability.
Cryptographic Hardness as a Barrier: The use of standard cryptographic assumptions grounds the impossibility results in well-established theory, making them robust to advances in filtering algorithms.

Future Directions

Interpretability and Internal Safeguards: The results motivate research into model interpretability, transparency, and internal alignment mechanisms that do not rely on external filtering.
Agency and Alignment: The analysis challenges proposals to align AI by decoupling agency from intelligence, suggesting that agency (the capacity to act and judge) is indispensable for alignment.
Connections to Cognitive Science: The authors note potential links to 4E cognitive science, which emphasizes the inseparability of cognition from embodiment and environment, paralleling the inseparability of intelligence and judgment in AI.

Strong Claims and Limitations

Impossibility Under Cryptographic Assumptions: The paper’s central claim is that, assuming the existence of time-lock puzzles, one-way functions, or public-key encryption, no efficient filter can reliably distinguish or block all harmful prompts or outputs.
Contrived Constructions: While the constructed LLMs are not representative of current commercial models, the authors argue that cryptographic impossibility results often foreshadow practical vulnerabilities.
Computational Asymmetry: The results hinge on a computational gap between the LLM and the filter, which is realistic in scenarios where retraining or simulating the LLM is infeasible.

Conclusion

This work provides a rigorous theoretical foundation for the limitations of filter-based AI alignment. By demonstrating the computational intractability of external filtering under standard cryptographic assumptions, the authors highlight the necessity of integrating judgment and intelligence within AI systems. The results have far-reaching implications for AI safety research, regulatory policy, and the design of future AI architectures. Future progress in alignment will likely require a shift from external safeguards to internal, model-centric approaches that address the root causes of misalignment.

PDF Markdown

Related Papers

Tweets

https://twitter.com/fly51fly/status/1943785515532398811

https://twitter.com/verma_apurv5/status/1944521013984669980

YouTube

Show All Videos