Constructing counterfactual text pairs to isolate meaning in LLM probabilities

Develop a general method for constructing, for arbitrary texts (including technical instructions), a matched counterfactual text that preserves style, grammar, length, and language while differing only in semantic content, so that differences in probabilities assigned by a Large Language Model isolate the contribution of meaning beyond toy examples.

Background

In discussing whether LLMs "know" something based on assigned probabilities, the authors highlight the challenge of disentangling the probability contribution of meaning from style, grammar, length, and language. They note that toy examples (e.g., mother vs. father nursing a calf) allow paired comparisons that isolate meaning.

They explicitly state that for arbitrary texts, constructing the appropriate paired counterfactual is not clear, which leaves open the development of a systematic approach to generate such pairs for broader, real-world text types.

References

Furthermore, disentangling the probability contribution of meaning by constructing a pair, as in the example of the calf, seems feasible only on toy examples: it is not clear how to construct the second element of the pair for arbitrary texts, such as the instructions in Figure~\ref{shipping-demo2}.

— LLMs can hide text in other text of the same length (2510.20075 - Norelli et al., 22 Oct 2025) in Section: Discussion

Constructing counterfactual text pairs to isolate meaning in LLM probabilities

Background

References

Related Problems