Query Circuits in Language Models

Updated 2 October 2025

Query circuits are sparse subgraphs that capture the specific flow of information from a prompt to output in large language models.
They leverage edge scoring methods and efficient sampling techniques to isolate critical connections, recovering high performance with fewer than 5% of model edges.
Empirical benchmarks such as MMLU and ARC show that query circuits can recover 50-80% of model output quality, enhancing interpretability and debugging.

A query circuit is a precise, input-specific computational subgraph within a LLM that describes how information flows from input to output for a particular prompt. Unlike capability circuits—which capture the global implementation of an ability over many examples—query circuits trace the exact, minimal set of model connections responsible for a given output on a specific input. This local perspective enables more faithful, scalable explanations of model decision-making on individual prompts (Wu et al., 29 Sep 2025).

1. Definition and Rationale for Query Circuits

A query circuit is the unique, sparse sub-network of an LLM that transmits the relevant information from a single input prompt to its generated answer. Concretely, the model is treated as a directed acyclic graph (DAG) with nodes for MLP units or attention heads, and edges tracking their computations. For a given prompt, edge importance scores are computed, and the top-N most critical edges define the query circuit for that prompt.

This input-level local focus distinguishes query circuits from global “capability circuits,” which aggregate over many prompts to understand general behaviors such as indirect object identification or arithmetic skills. Query circuits answer the fundamental attribution question: "Why did the model produce this output for this specific prompt, and through which paths?"

2. Edge Importance and Circuit Identification Methodology

Identification of query circuits leverages a combination of edge scoring, subgraph selection, and rigorous faithfulness evaluation. The process works as follows:

For a prompt $q$ , define the model $M$ as a DAG. Every node is a model component (MLP neuron, attention head), and every edge represents a computational connection between components.
Assign each edge $e$ an importance score using techniques such as integrated gradients (EAP-IG). The original definition for an edge’s indirect effect is:

$a_e = \frac{1}{|D|} \sum_{q \in D} L(M(q~|~do(e \leftarrow e')))-L(M(q))$

where $L$ is a scalar performance metric (such as logit difference), $do(\cdot)$ indicates edge ablation/patching, $e'$ is the ablated value, and $D$ is the set of paraphrased or related prompts.

Use an efficient, sampling-based approach (Best-of-N, BoN, and its variants) to combine edge score matrices from paraphrased prompts. Circuits are constructed by greedily selecting the highest-scoring edges under a given sparsity constraint.
Evaluate faithfulness via Normalized Deviation Faithfulness (NDF):

$\mathrm{NDF}(\mathcal{C}_q) = 1 - \min\left( \left| \frac{L(M(q))-L(\mathcal{C}_q(q))}{L(M(q)) - L(M(q'))} \right|, 1 \right)$

Here, $q'$ is a corrupted version of $q$ lacking critical cues. NDF is symmetric, always in $[0,1]$ , and quantifies how well the recovered circuit performance aligns with the full model for that specific prompt.

This framework enables circuit discovery that scales beyond toy or hand-labeled benchmarks, and is not reliant on surrogate approximations or sparse autoencoders.

3. Empirical Findings in Benchmark Tasks

Across various benchmarks—Indirect Object Identification (IOI), arithmetic (addition and multiplication), Massive Multitask Language Understanding (MMLU), and the ARC Challenge—the following empirical results emerge:

Benchmark	% of Edges in Query Circuit	% Performance Recovery
IOI	$\ll$ 5%	$\approx$ 80%
MMLU	1.3%	$\approx$ 60%
ARC	$\sim$ 2.3%	$\approx$ 50%

Such results confirm that extremely sparse query circuits can account for much of the local decision process. For example, on MMLU, a circuit constructed from only 1.3% of model edges recovers 60% of the model’s specific logit advantage for a prompt.

Sampling multiple paraphrases (BoN, iBoN, BoN-CSM) robustly outperforms single-query or averaging methods, indicating that combinatorial interactions in edge effects are substantial and diverse sampling mitigates this noise.

4. Technical Foundations and Faithfulness Guarantees

Critical foundations include:

Edge importance approximated by integrated gradients:

$a_e \approx (e-e')^T \left( \frac{1}{m} \sum_{k=1}^m \nabla_e M(z'+\frac{k}{m}(z-z')) \right)$

where $z$ is the input vector including $e$ and $z'$ is where $e$ is set to its ablated value.

The NDF measure provides a rigorous faithfulness assessment:
- Boundedness in $[0,1]$ ensures interpretability.
- The denominator, $L(M(q)) - L(M(q'))$ , represents the entire model’s “signal” for the prompt; the numerator, $L(M(q)) - L(\mathcal{C}_q(q))$ , quantifies the drop upon restricting computation to the query circuit. NDF = 1 means perfect faithfulness, NDF = 0 means failure to recover the relevant model behavior.
- Unlike original NFS, NDF is robust under adversarial or degenerate prompts.

This metric is critical for consistently evaluating circuit quality on arbitrary datasets, not only synthetic probes.

5. Computational Efficiency and Interpretability Implications

The extreme sparsity of faithful circuits supports rigorous, scalable local interpretability. Advantages include:

Efficiency: Faithful sub-networks are orders-of-magnitude smaller than the model, making tractable visualization and detailed mechanistic inspection possible.
Faithfulness: As query circuits are identified within the actual model rather than through external surrogate methods, the explanation reflects true internal computation.
Specificity: Non-identification of a relevant edge directly signals where/spurious or harmful features are driving a decision, which supports both model debugging and auditing.

This local circuit-centric view bridges mechanistic interpretability with causal explanations of specific outputs.

6. Applications and Forward-Looking Directions

Practical applications of query circuits include:

Debugging model decisions in sensitive domains (e.g., detecting spurious attributions or uncovering bias mechanisms),
Explaining model failures on specific queries in safety-critical applications (legal, medical AI),
Informing pruning or editing by identifying unnecessary or harmful connections,
Groundwork for automated mechanistic interpreters and agents leveraging circuit-level provenance.

Future research challenges involve:

Extension from single-token outputs to multi-token or multi-step generations.
Modeling higher-order combinatorial dependencies (beyond greedy selection, toward optimal subgraph search).
Automated interpretation and summarization of discovered circuits, integrating with agentic interpretability pipelines.

Overall, query circuits repurpose the conceptual lens of “circuit analysis” from global to local, enabling direct, faithful, and practical explanations of specific LLM outputs (Wu et al., 29 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

Query Circuits: Explaining How Language Models Answer User Prompts (2025)

Follow Topic

Get notified by email when new papers are published related to Query Circuits.