Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 78 tok/s

Gemini 2.5 Pro 55 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 28 tok/s Pro

GPT-4o 83 tok/s Pro

Kimi K2 175 tok/s Pro

GPT OSS 120B 444 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

A causal framework for explaining the predictions of black-box sequence-to-sequence models (1707.01943v3)

Published 6 Jul 2017 in cs.LG

Abstract: We interpret the predictions of any black-box structured input-structured output model around a specific input-output pair. Our method returns an "explanation" consisting of groups of input-output tokens that are causally related. These dependencies are inferred by querying the black-box model with perturbed inputs, generating a graph over tokens from the responses, and solving a partitioning problem to select the most relevant components. We focus the general approach on sequence-to-sequence problems, adopting a variational autoencoder to yield meaningful input perturbations. We test our method across several NLP sequence generation tasks.

Citations (197)

View on Semantic Scholar

Summary

The paper introduces SocRat, a causal framework leveraging VAE perturbations and Bayesian logistic regression to uncover input-output dependencies in seq2seq models.
It demonstrates accurate token alignment in tasks like dictionary mapping and machine translation, achieving competitive performance with traditional baselines.
The framework enhances model interpretability by detecting biases and causal relationships, offering actionable insights for model refinement and error diagnosis.

Causal Framework for Explaining Black-Box Sequence-to-Sequence Models

The paper presents a novel framework for interpreting the predictions of black-box sequence-to-sequence models by identifying the causal relationships between structured inputs and outputs. This method is grounded in causality and aims to provide explanations that can be used to enhance the interpretability of complex structured prediction tasks in NLP, such as machine translation, summarization, and speech recognition.

Methodology Overview

The authors propose a structured-output causal rationalizer (SocRat) that elucidates the predictions of black-box systems through causal inference, focusing particularly on sequence-to-sequence tasks. The approach encompasses three primary components:

Perturbation Model: This leverages a variational autoencoder (VAE) to create semantically similar but perturbed input data. By exercising the model locally, these perturbations help infer dependencies between input-output tokens.
Causal Model: The framework computes causal dependencies through Bayesian logistic regression, using the perturbations to estimate the likelihood of various input tokens influencing specific output tokens.
Explanation Selection: The dense bipartite graph generated by the causal model is partitioned to select relevant explaining components, using robust optimization to account for estimation uncertainties.

Numerical Results and Experimental Evaluation

The framework was validated across several experiments. A dictionary-based task demonstrated that the method could accurately recover character-to-phoneme mappings, achieving competitive performance with strong alignment baselines. In machine translation, SocRat provided coherent explanations of output predictions relative to input structures, aligning with and enhancing traditional attention mechanisms.

Implications and Future Directions

The proposed method offers substantial insights into the workings of complex prediction models, facilitating improved trust, error diagnosis, and model refinement. It reveals dependencies that traditional methods might overlook, posing potential advancements in alignment and interpretability in NLP models.

Practically, these insights extend to model improvement. For instance, bias detection in translation systems revealed gender-related biases in source-target mappings. SocRat identified strong associative dependencies, which hint at underlying biases that could be addressed in model training or data collection.

Future exploration might focus on extending this framework to other structured data domains, such as image-to-text prediction tasks. While the VAE-based sampling for perturbations is effective, alternative methods tailored to specific domains might further enhance the general applicability of the causal inference approach.

Conclusion

This research proposes a robust, general framework to interpret sequence-to-sequence model predictions, emphasizing causal relationships in structured data. Its application to both synthetic and real-world tasks demonstrates utility in enhancing transparency and detecting biases, marking a significant contribution to the field of model interpretability.