Faithfulness of model-generated natural-language rules to internal reasoning

Ascertain and quantify how faithfully the natural-language rules generated by AI models for ConceptARC tasks represent the models’ actual internal reasoning procedures that produce the output grids.

Background

The paper asks models to output both an answer grid and a natural-language rule. The authors manually evaluate these rules to see whether they reflect intended abstractions, but they note uncertainty about whether the generated rules faithfully describe the model’s internal reasoning.

They observe that, especially in the textual setting, generated rules often align with outputs, suggesting some degree of faithfulness, but emphasize that further work is needed to quantify this alignment.

References

We cannot be certain that the natural-language rules generated by the AI models we evaluated are faithful representations of the actual reasoning the models do to solve a task, though in general the output grids generated seem to align with the rules.

— Do AI Models Perform Human-like Abstract Reasoning Across Modalities? (2510.02125 - Beger et al., 2 Oct 2025) in Section: Limitations

Faithfulness of model-generated natural-language rules to internal reasoning

Background

References

Related Problems