Interpretability-Guided Bi-objective Optimization

Updated 9 January 2026

The paper introduces IGBO, a framework that uses Pareto optimization to balance predictive fidelity with human interpretability.
It integrates techniques such as template-based synthesis, evolutionary algorithms, and gradient projection to systematically incorporate human feedback and structured priors.
Empirical findings show that small drops in accuracy can yield significant interpretability gains, making IGBO effective across diverse applications.

Interpretability-Guided Bi-objective Optimization (IGBO) is a family of frameworks for synthesizing models, explanations, or policies that jointly optimize predictive fidelity (accuracy, correctness) and human interpretability. It unifies multiple research directions—template-based synthesis, feature attribution and sensitivity, evolutionary search, and interactive human feedback—under the principle that interpretability should be treated as a first-class, explicitly modeled objective throughout the learning pipeline (Torfah et al., 2021, Schneider et al., 2023, Pillai et al., 2024, Fouladi et al., 2 Jan 2026, Lage et al., 2018, Virgolin et al., 2021). The core methodological advance of IGBO is the rigorous treatment of accuracy–interpretability trade-offs via Pareto-optimal bi-objective optimization formulations.

1. Formal Problem Foundations

Interpretability-Guided Bi-objective Optimization operates on two objectives, typically denoted as accuracy/correctness and interpretability/explainability. The general problem statement is:

$\max_{E \in \mathcal{E}} \left( c(E),\, e(E) \right) \quad \text{or} \quad \min_{m \in \mathcal{M}} \left( \mathcal{L}_{\text{pred}}(m),\, f_{2}(m) \right)$

where $c(E)$ quantifies predictive fidelity (e.g., accuracy of the synthesized interpretation $E$ on a finite sample from the black-box model) and $e(E)$ quantifies explainability using syntactic or structural proxies (e.g., size, simplicity of decision diagrams, ease of predicates) (Torfah et al., 2021). In more expressive settings, $e(E)$ or $f_2(m)$ may be personalized to the user (via neural estimators or human feedback) (Virgolin et al., 2021, Lage et al., 2018).

The feasible set $\mathcal{E}$ (or $\mathcal{M}$ ) may be highly structured—for example, bounded multi-valued decision diagrams, symbolic regression trees, feature groupings, or neural architectures with domain constraints. The optimization seeks the Pareto frontier of non-dominated solutions, providing practitioners with multiple trade-offs rather than a single scalarized optimum.

2. Interpretation and Operationalization of Objectives

Correctness / Fidelity: Accuracy metrics are computed as the fraction of samples $S$ for which the synthesized interpretation matches the black-box output ( $\Delta_C(f_E, S)$ ). In tabular or feature-selection frameworks, generalization error or AUC may be used, estimated via out-of-sample validation (Schneider et al., 2023).

Explainability: Explainability metrics are diverse:

Size-based proxies: inverse of model complexity, number of unused nodes, template sparsity (Torfah et al., 2021).
Structural proxies: sparsity of features, sparsity of feature–feature interactions, monotonicity constraints (Schneider et al., 2023).
Attribution-based scores: Shapley value, Integrated Gradients, Temporal Integrated Gradients (for time-series) (Fouladi et al., 2 Jan 2026, Pillai et al., 2024).
Personalized estimators: learned neural networks mapping model representations to user-specific interpretability scores, trained via active preference feedback (Virgolin et al., 2021).
Human-in-the-loop priors: average human simulation time or cognitive load, empirically measured in user studies (Lage et al., 2018).

A central finding in human-centered work is that interpretability preferences are user- and task-specific; a model considered interpretable under one proxy (e.g., minimum number of features) may not be preferred under another (e.g., shortest response time) (Lage et al., 2018).

3. Optimization Frameworks and Algorithms

IGBO synthesizes solutions via diverse bi-objective optimization schemes:

Template-based Max-SAT Reduction: Candidate interpretations are encoded as bounded decision diagrams; the multi-objective problem is discretized and reduced to a weighted Max-SAT instance whose soft clauses correspond to correctness and explainability (Torfah et al., 2021). The Pareto frontier is recovered via recursive subregion exploration (ExplorePOI algorithm), guaranteeing completeness and soundness.
Evolutionary Algorithms and Group-Based Genetic Search: Feature-composition, interaction, monotonicity groupings are incorporated directly into the genetic encoding of model candidates. NSGA-II style non-dominated sorting and crowding-distance criteria select competitive trade-offs (Schneider et al., 2023, Virgolin et al., 2021).
Gradient-Based Scalarization and Geometric Projection: For differentiable model classes, interpretable priors are imposed via additional loss functions (e.g., feature importance hierarchy encoded via a DAG), and gradient updates are projected using geometric combination rules to ensure descent in both objectives (Fouladi et al., 2 Jan 2026).
Attribution and Sensitivity Integration: Local (Shapley/DeepSHAP) and global (Sobol indices) explanations are aggregated into a scalar interpretability objective, guiding combinatorial optimization over feature-value configurations (Pillai et al., 2024).
Human-in-the-Loop Bayesian Optimization: Candidate models pre-filtered for high accuracy are then actively sampled for human response measurements, allowing expensive, noisy interpretability signals to drive surrogate model search (Lage et al., 2018).

Common features include Pareto-front extraction, trade-off visualization, and mechanisms for scaling up the optimization via constraint pruning, search-space augmentation, and computational acceleration.

4. Representations, Syntactic Constraints, and Structured Priors

IGBO frameworks operate over highly structured model classes, frequently defined by user or domain knowledge:

Decision Diagrams and Bounded Templates: Multi-valued DAG structures with node-level predicate selection and branching; template size and predicate complexity are tunable within optimization (Torfah et al., 2021).
Feature Group Structures: Explicit encoding of allowed interactions, partitioning of features into groups with monotonicity tags, binary masks for selection, etc. (Schneider et al., 2023).
Causal and Domain Hierarchies: Directed acyclic graphs encoding that certain features must be "more important" than others (with explicit interval constraints on attribution scores) (Fouladi et al., 2 Jan 2026).
User-Personalized Complexity Metrics: Syntax trees, operation counts, and composition depths supply input for neural preference estimators (Virgolin et al., 2021).

In time-series or other sequential domains, Temporal Integrated Gradients offers an attribution scheme where the integration path is learned by an Optimal Path Oracle to stay close to the data manifold, mitigating out-of-distribution gradient pathologies (Fouladi et al., 2 Jan 2026).

5. Empirical Evaluation and Key Findings

Empirical work spans synthetic, tabular, sequential, and real-world datasets:

Benchmark Diversity: Airplane perception, bank-loan approval, theorem proving, OpenML binary classification, MIMIC-III medical records, financial time series, text data, and agriculture settings (multi-pathogen optimization) (Torfah et al., 2021, Schneider et al., 2023, Pillai et al., 2024, Fouladi et al., 2 Jan 2026).
Pareto Fronts: Typical experiments yield rich trade-off sets (4–7 points): e.g., in airplane perception, a 0.01 accuracy drop buys a 0.1 gain in explainability (Torfah et al., 2021); in tabular models, high AUC (≈0.91) requires many features/interactions, but interpretable models achieve ≈0.80 AUC with only 1–5% of features (Schneider et al., 2023).
Personalization and Adaptivity: User-specific models discovered via ML-PIE and human-in-the-loop IGBO show significant preference improvement over fixed heuristics, with tangible reductions in mean response time and higher survey preference rates (Virgolin et al., 2021, Lage et al., 2018).
Trade-off Efficiency: Approaches that exploit interpretability guidance prune vast portions of search space, often converging in <20% of the iterations required by blind multi-objective search (Pillai et al., 2024).
Constraint Satisfaction: DAG-based IGBO achieves >80% satisfaction rate of feature importance hierarchies, with accuracy drop <5% compared to unconstrained models (Fouladi et al., 2 Jan 2026).
Scalability: Max-SAT and evolutionary IGBO implementations scale efficiently for bounded template sizes and population settings; computational bottlenecks arise in attribution and sensitivity estimation, especially for large-scale or sequential data (Torfah et al., 2021, Fouladi et al., 2 Jan 2026).

6. Theoretical Properties, Limitations, and Future Directions

IGBO exhibits several formal properties:

Completeness and Soundness: Max-SAT-based IGBO frameworks guarantee enumeration of all Pareto-optimal interpretations in the discrete case (Torfah et al., 2021).
Universality of Pareto Sets: Any monotonic scalarization is covered by the recovered Pareto frontier (Torfah et al., 2021, Schneider et al., 2023).
Convergence Guarantees: Gradient-based projection updates yield descent in both objectives and attain Pareto-stationary points under stochastic noise (Fouladi et al., 2 Jan 2026).
Statistical Bounds: When the interpretation class is finite or PAC-learnable, standard uniform convergence results bound the sample size required for fidelity estimation (Torfah et al., 2021).

Limitations and open directions include:

Computational overhead from multiple gradient or attribution evaluations per sample (Fouladi et al., 2 Jan 2026).
Necessity of explicit domain inputs (e.g., DAGs, syntactic constraints) that may be labor-intensive to specify.
Generalization to more than two objectives (e.g., balancing accuracy, explainability, robustness, fairness) with geometric gradient projection schemes is unresolved.
Human-in-the-loop approaches remain sensitive to the cost and noise in user feedback, requiring adaptive query strategies and surrogate modeling (Lage et al., 2018, Virgolin et al., 2021).
The characterization of the Pareto frontier under model misspecification, real-world distribution shift, and various interpretability proxies is a subject for further theoretical refinement.

7. Practical Deployment and Guidance

Practitioners implementing IGBO frameworks should consider:

Template or population size tuning for algorithmic scalability.
Selection and definition of interpretability metrics suited to the application and user population.
Integration of domain knowledge via structured priors, group definitions, or explicit importance constraints.
Active learning and query budgeting for human-in-the-loop scenarios.
Efficient attribution and sensitivity estimation via sampling, early stopping, or parallelization (Schneider et al., 2023, Fouladi et al., 2 Jan 2026, Pillai et al., 2024, Virgolin et al., 2021).

IGBO provides a principled, scalable, and extensible paradigm for deploying models that balance domain-specific desiderata for predictive performance and interpretability, with rigorous optimization and measurable trade-offs. Its multi-objective methodology, spanning symbolic, sub-symbolic, and interactive domains, continues to expand in sophistication and practical impact across the machine learning landscape.