Constructing counterfactual text pairs to isolate meaning in LLM probabilities
Develop a general method for constructing, for arbitrary texts (including technical instructions), a matched counterfactual text that preserves style, grammar, length, and language while differing only in semantic content, so that differences in probabilities assigned by a Large Language Model isolate the contribution of meaning beyond toy examples.
References
Furthermore, disentangling the probability contribution of meaning by constructing a pair, as in the example of the calf, seems feasible only on toy examples: it is not clear how to construct the second element of the pair for arbitrary texts, such as the instructions in Figure~\ref{shipping-demo2}.
                — LLMs can hide text in other text of the same length
                
                (2510.20075 - Norelli et al., 22 Oct 2025) in Section: Discussion