- The paper introduces ICX360, a unified toolkit that advances in-context explainability for LLMs by integrating hierarchical, contrastive, and gradient-based methods.
- The toolkit employs innovative methods such as MExGen, CELL, and Token Highlighter to generate coherent, context-sensitive explanations for model outputs.
- ICX360 supports both black-box and white-box settings, enhancing model auditability and fostering responsible AI deployment in high-stakes domains.
ICX360: An Open-Source Toolkit for In-Context Explainability of LLMs
Introduction and Motivation
The ICX360 toolkit addresses a critical gap in the field of explainable AI for LLMs: developing methods and tools to explain LLM-generated text with respect to the user-provided input context. With LLMs increasingly deployed in high-stakes domains (e.g., medical question answering, code generation, legal document analysis), robust, interpretable, and actionable explanations for model outputs are required for responsible AI adoption and system reliability.
Traditional explainability frameworks for deep learning, such as LIME and SHAP, face challenges when applied to the generative and context-grounded outputs of modern LLMs, especially given their unique input-output characteristics (e.g., long text inputs, autoregressive generation, hierarchical linguistic structure). ICX360 systematically advances the state-of-the-art by providing a unified toolkit for input-based (in-context) explainability tailored to LLMs, supporting both black-box and white-box settings.
ICX360 encompasses three primary explanation methods, each instantiated as an abstraction within the toolkit:
- MExGen: Multi-Level Explanations for Generative LLMs, extending perturbation-based attribution (Ã la LIME/SHAP) to generative LLMs while introducing a hierarchical, multi-level strategy. It innovatively leverages "scalarizers" to convert output texts to evaluation scores and supports efficient handling of long, structured inputs via hierarchical segmentation [monteiro-paes-2025-multi].
- CELL: Contrastive Explanations for LLMs, generating explanations by searching for minimal modifications to the user prompt that elicit contrastive properties (e.g., preferability, contradiction) in the LLM outputs. CELL provides both myopic (mCELL) and budget-constrained global search algorithms, supporting variable desiderata in contrastive analysis (Luss et al., 2024).
- Token Highlighter: A white-box, gradient-based approach computing importance scores for each input token via gradients of the log-likelihood (or Affirmation Loss), with the capability to aggregate scores into coarser linguistic units (e.g., words, phrases, sentences). Beyond explanation, it provides actionable insight for defending against adversarial jailbreaking [hu2025tokenhighlighter].
Landscape of In-Context Explanations
ICX360 situates its methods in a multidimensional landscape defined by:
- Level of LLM Access: Ranges from truly black-box (text-only output), grey-box (token probability/logits access), to white-box (requiring internal model gradients).
- Input Granularity: From token-level (fine granular) to sentence/paragraph/document-level (coarse granular), exploiting the hierarchical structure of language; MExGen operates top-down, Token Highlighter bottom-up, and CELL at a user-defined granularity.
- Output Granularity: All current ICX360 methods operate at phrase or sentence output granularity, thus providing explanations for coherent textual chunks rather than at the fragmented token level.
The multidimensional nature of this landscape is depicted below, where ICX360’s methods are positioned according to their access requirements and input granularity.
Figure 1: A two-dimensional view of the in-context explanation space, with ICX360 methods mapped by LLM access and input granularity.
Implementation and Design Abstractions
ICX360 is implemented as an extensible Python package, featuring:
- Explainer Abstractions: Unified APIs for local black-box (LocalBBExplainer) and white-box (LocalWBExplainer) explainers. Specialized subclasses implement MExGen, CELL, and Token Highlighter, supporting multiple algorithmic variants.
- Model Wrappers: Encapsulate interactions with both Hugging Face and vLLM-API-driven LLMs. The interface harmonizes input preprocessing, output management, and scalability concerns.
- Infillers and Perturbation: Modular infillers (e.g., BART, T5) for controlled input perturbation are a core mechanism in both MExGen and CELL.
- Scalarizers and Metrics: A diverse set of scalarizers convert LLM outputs into interpretable metrics—enabling model-agnostic evaluation of output variation, preference, contradiction, or entailment.
- Segmenters: Linguistically-inspired segmenters using spaCy facilitate flexible aggregations from tokens to larger textual units.
- Evaluation: Faithfulness of explanations is empirically assessed via perturbation curves and impact evaluation.
Example code snippets for each method streamline adoption:
Figure 2: Token Highlighter code snippet, demonstrating gradient-based token attribution.
Figure 3: MExGen code snippet, illustrating hierarchical perturbation-based explanation for generative LLMs.
Figure 4: CELL code snippet, showing invocation for contrastive prompt explanation.
Comparative Advantages and Claims
ICX360 delivers several substantive advances over existing interpretability libraries (e.g., SHAP, Captum, Inseq):
- Coherent Attribution Units: Unlike token-level approaches, ICX360 provides attributions for semantically meaningful linguistic segments, improving interpretability for end-users and practitioners.
- Contrastive Explanations: Currently, ICX360 is unique in offering black-box, contrastive explanations at the prompt level, enabling users to explore counterfactuals and preferences—a capability absent from existing toolkits.
- End-to-End Sequence Explanations: The toolkit supports direct attribution for full output sequences, overcoming the fragmentation problem endemic to token-by-token explanation schemes.
Practical and Theoretical Implications
Practically, ICX360 empowers developers and domain experts to audit LLM outputs, identify and mitigate adversarial jailbreaks, and generate user-directed, fine-grained explanations for context-rich applications. The inclusion of both perturbation- and gradient-based methods, coupled with flexible access modalities, makes the toolkit broadly applicable across a spectrum of LLM deployment scenarios.
Theoretically, the multi-dimensional taxonomy provided by ICX360 clarifies the design space for attribution and explanation algorithms tailored to LLMs. This fosters future research in efficient black-box explainer algorithms, hybrid attention-based approaches, and context-sensitive evaluation metrics.
Prospective developments include amortized inference for perturbation-based explainers [covert2024stochastic, Yang2023EfficientSV], extension to attention-based explanations, and richer compositional strategies for hierarchical text explanation.
Conclusion
ICX360 provides a comprehensive, open-source platform for in-context explainability in LLMs, unifying a spectrum of state-of-the-art explanation methods under a modular and extensible software framework (2511.10879). Through the introduction of hierarchical, contrastive, and gradient-based techniques, ICX360 addresses key challenges in interpretability for generative LLMs, setting a new standard for explanatory depth, flexibility, and user-alignment. The framework will likely catalyze further advances in explainable language technology, with broad implications for trustworthy deployment in critical domains.