Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Interpretable Rules for Scalable Data Representation and Classification (2310.14336v3)

Published 22 Oct 2023 in cs.LG and cs.AI

Abstract: Rule-based models, e.g., decision trees, are widely used in scenarios demanding high model interpretability for their transparent inner structures and good model expressivity. However, rule-based models are hard to optimize, especially on large data sets, due to their discrete parameters and structures. Ensemble methods and fuzzy/soft rules are commonly used to improve performance, but they sacrifice the model interpretability. To obtain both good scalability and interpretability, we propose a new classifier, named Rule-based Representation Learner (RRL), that automatically learns interpretable non-fuzzy rules for data representation and classification. To train the non-differentiable RRL effectively, we project it to a continuous space and propose a novel training method, called Gradient Grafting, that can directly optimize the discrete model using gradient descent. A novel design of logical activation functions is also devised to increase the scalability of RRL and enable it to discretize the continuous features end-to-end. Exhaustive experiments on ten small and four large data sets show that RRL outperforms the competitive interpretable approaches and can be easily adjusted to obtain a trade-off between classification accuracy and model complexity for different scenarios. Our code is available at: https://github.com/12wang3/rrl.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. F. Doshi-Velez and B. Kim, “Towards a rigorous science of interpretable machine learning,” arXiv preprint arXiv:1702.08608, 2017.
  2. Z. C. Lipton, “The mythos of model interpretability,” Commun. ACM, vol. 61, no. 10, pp. 36–43, 2018.
  3. L. Chu, X. Hu, J. Hu, L. Wang, and J. Pei, “Exact and consistent interpretation for piecewise linear neural networks: A closed form solution,” in SIGKDD, 2018, pp. 1244–1253.
  4. W. J. Murdoch, C. Singh, K. Kumbier, R. Abbasi-Asl, and B. Yu, “Interpretable machine learning: definitions, methods, and applications,” PNAS, vol. 116, no. 44, pp. 22 071–22 080, 2019.
  5. G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” in NeurIPS, 2017, pp. 3146–3154.
  6. L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
  7. O. Irsoy, O. T. Yıldız, and E. Alpaydın, “Soft decision trees,” in ICPR, 2012, pp. 1819–1822.
  8. B. Letham, C. Rudin, T. H. McCormick, D. Madigan et al., “Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model,” The Annals of Applied Statistics, vol. 9, no. 3, pp. 1350–1371, 2015.
  9. T. Wang, C. Rudin, F. Doshi-Velez, Y. Liu, E. Klampfl, and P. MacNeille, “A bayesian framework for learning rule sets for interpretable classification,” JMLR, vol. 18, no. 1, pp. 2357–2393, 2017.
  10. H. Yang, C. Rudin, and M. Seltzer, “Scalable bayesian rule lists,” in ICML, 2017, pp. 3921–3930.
  11. N. Frosst and G. Hinton, “Distilling a neural network into a soft decision tree,” arXiv preprint arXiv:1711.09784, 2017.
  12. M. T. Ribeiro, S. Singh, and C. Guestrin, “Why should i trust you?: Explaining the predictions of any classifier,” in SIGKDD, 2016, pp. 1135–1144.
  13. Z. Wang, W. Zhang, N. Liu, and J. Wang, “Transparent classification with multilayer logical perceptrons and random binarization,” in AAAI, 2020, pp. 6331–6339.
  14. M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in NeurIPS, 2015, pp. 3123–3131.
  15. W. W. Cohen, “Fast effective rule induction,” in MLP.   Elsevier, 1995, pp. 115–123.
  16. D. Wei, S. Dash, T. Gao, and O. Gunluk, “Generalized linear rule models,” in ICML.   PMLR, 2019, pp. 6687–6696.
  17. E. Angelino, N. Larus-Stone, D. Alabi, M. Seltzer, and C. Rudin, “Learning certifiably optimal rule lists for categorical data,” JMLR, vol. 18, no. 1, pp. 8753–8830, 2017.
  18. J. Lin, C. Zhong, D. Hu, C. Rudin, and M. Seltzer, “Generalized and scalable optimal sparse decision trees,” in ICML.   PMLR, 2020, pp. 6150–6160.
  19. H. Lakkaraju, S. H. Bach, and J. Leskovec, “Interpretable decision sets: A joint framework for description and prediction,” in SIGKDD, 2016, pp. 1675–1684.
  20. T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in SIGKDD, 2016, pp. 785–794.
  21. S. Hara and K. Hayashi, “Making tree ensembles interpretable: A bayesian model selection approach,” in AISTATS, 2018, pp. 77–85.
  22. H. Ishibuchi and T. Yamamoto, “Rule weight specification in fuzzy rule-based classification systems,” TFS, vol. 13, no. 4, pp. 428–435, 2005.
  23. Y. Yang, I. G. Morillo, and T. M. Hospedales, “Deep neural decision trees,” arXiv preprint arXiv:1806.06988, 2018.
  24. C. Glanois, Z. Jiang, X. Feng, P. Weng, M. Zimmer, D. Li, W. Liu, and J. Hao, “Neuro-symbolic hierarchical rule induction,” in ICML, 2022, pp. 7583–7615.
  25. K. Cheng, J. Liu, W. Wang, and Y. Sun, “Rlogic: Recursive logical rule learning from knowledge graphs,” in SIGKDD, 2022, pp. 179–189.
  26. M. Zimmer, X. Feng, C. Glanois, Z. JIANG, J. Zhang, P. Weng, D. Li, J. HAO, and W. Liu, “Differentiable logic machines,” TMLR, 2023.
  27. S. Chaudhury, S. Swaminathan, D. Kimura, P. Sen, K. Murugesan, R. Uceda-Sosa, M. Tatsubori, A. Fokoue, P. Kapanipathi, A. Munawar, and A. Gray, “Learning symbolic rules over Abstract Meaning Representations for textual reinforcement learning,” in ACL, 2023, pp. 6764–6776.
  28. X. Duan, X. Wang, P. Zhao, G. Shen, and W. Zhu, “Deeplogic: Joint learning of neural perception and logical reasoning,” TPAMI, vol. 45, no. 4, pp. 4321–4334, 2023.
  29. Z.-H. Zhou and Y.-X. Huang, “Abductive learning,” in Neuro-Symbolic Artificial Intelligence: The State of the Art.   IOS Press, 2021, pp. 353–369.
  30. W. Dai, Q. Xu, Y. Yu, and Z. Zhou, “Bridging machine learning and logical reasoning by abductive learning,” pp. 2811–2822, 2019.
  31. Q. Zhang, J. Ren, G. Huang, R. Cao, Y. N. Wu, and S.-C. Zhu, “Mining interpretable aog representations from convolutional networks via active question answering,” TPAMI, vol. 43, no. 11, pp. 3949–3963, 2020.
  32. B. Liu and R. Mazumder, “Fire: An optimization approach for fast interpretable rule extraction,” in SIGKDD, 2023, p. 1396–1405.
  33. I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks,” in NeurIPS, 2016, pp. 4107–4115.
  34. Y. Bai, Y.-X. Wang, and E. Liberty, “Proxquant: Quantized neural networks via proximal operators,” arXiv preprint arXiv:1810.00861, 2018.
  35. E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” 2017.
  36. A. Payani and F. Fekri, “Learning algorithms via neural logic networks,” arXiv preprint arXiv:1904.01554, 2019.
  37. Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” TNNS, vol. 5, no. 2, pp. 157–166, 1994.
  38. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “{{\{{TensorFlow}}\}}: a system for {{\{{Large-Scale}}\}} machine learning,” in OSDI, 2016, pp. 265–283.
  39. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in NeurIPS, 2019, pp. 8026–8037.
  40. H. Qin, R. Gong, X. Liu, X. Bai, J. Song, and N. Sebe, “Binary neural networks: A survey,” Pattern Recognition, p. 107281, 2020.
  41. G. Hinton, O. Vinyals, J. Dean et al., “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, vol. 2, no. 7, 2015.
  42. D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
  43. H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” 2017.
  44. D. Anguita, A. Ghio, L. Oneto, X. Parra, J. L. Reyes-Ortiz et al., “A public domain dataset for human activity recognition using smartphones.” in Esann, vol. 3, 2013, p. 3.
  45. B. Rozemberczki, C. Allen, and R. Sarkar, “Multi-Scale attributed node embedding,” Journal of Complex Networks, vol. 9, no. 2, p. cnab014, 05 2021.
  46. R. C. Petersen, P. Aisen, L. A. Beckett, M. Donohue, A. Gamst, D. J. Harvey, C. Jack, W. Jagust, L. Shaw, A. Toga et al., “Alzheimer’s disease neuroimaging initiative (adni): clinical characterization,” Neurology, vol. 74, no. 3, pp. 201–209, 2010.
  47. Z. Wang, J. Wang, N. Liu, C. Liu, X. Li, L. Dong, R. Zhang, C. Mao, Z. Duan, W. Zhang et al., “Learning cognitive-test-based interpretable rules for prediction and early diagnosis of dementia using neural networks,” Journal of Alzheimer’s Disease, vol. 90, no. 2, pp. 609–624.
  48. J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” JMLR, vol. 7, no. Jan, pp. 1–30, 2006.
  49. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2015.
  50. Y. Gorishniy, I. Rubachev, V. Khrulkov, and A. Babenko, “Revisiting deep learning models for tabular data,” in NeurIPS, 2021, pp. 18 932–18 943.
  51. G. Somepalli, M. Goldblum, A. Schwarzschild, C. B. Bruss, and T. Goldstein, “Saint: Improved neural networks for tabular data via row attention and contrastive pre-training,” arXiv preprint arXiv:2106.01342, 2021.
  52. V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in ICML, 2010, pp. 807–814.
Citations (5)

Summary

  • The paper introduces the Rule-based Representation Learner (RRL), a novel model designed to achieve both interpretability and scalability in rule-based learning, particularly for high-stakes domains.
  • RRL employs Gradient Grafting and continuous approximation of discrete logical operations, enabling end-to-end differentiable training for its hierarchical rule structure.
  • Empirical evaluations show RRL outperforms existing interpretable models and competes with complex methods like random forests, demonstrating its potential for scalable, transparent AI in practical applications.

An Expert Overview of "Learning Interpretable Rules for Scalable Data Representation and Classification"

The paper "Learning Interpretable Rules for Scalable Data Representation and Classification" introduces a novel classification model referred to as the Rule-based Representation Learner (RRL). Designed to balance the often competing objectives of interpretability and scalability, this model targets environments that demand transparent decision-making processes, such as medical and financial applications.

Motivation and Problem Statement

Rule-based models, including decision trees, are favored in domains requiring high interpretability due to their clear decision pathways. However, these models often struggle with optimization challenges when applied to large datasets because of their inherent discrete parameters and non-differentiable structures. Current solutions, such as ensemble methods or soft/fuzzy rules, tend to enhance performance at the cost of interpretability. This paper’s chief aim is to simultaneously preserve both interpretability and scalability in rule-based models, a dual objective not fully addressed by existing methodologies.

Methodological Innovations

Rule-based Representation Learner (RRL)

The primary innovation of the paper is the RRL, which constructs interpretable, non-fuzzy rules for data representation and classification. Unlike typical discrete models, the RRL is engineered to function effectively over continuous spaces through a distinctive training methodology called Gradient Grafting. This approach directly optimizes the discrete configurations with gradient information derived from both discrete and continuous models.

  1. Model Structure: The RRL comprises a hierarchical structure with a unique blend of layers, including a binarization layer, several logical layers, and a final linear classification layer. Each logical layer combines conjunctions and disjunctions of features to represent complex relationships in different logical forms, such as Conjunctive Normal Form (CNF) and Disjunctive Normal Form (DNF).
  2. Continuous Approximation: To enable differentiable training, the discrete logical operations within the RRL are projected into a continuous space via Logical Activation Functions. Notably, the paper introduces novel forms of these functions to mitigate the vanishing gradient problem prevalent when scaling to high-dimensional inputs.
  3. Gradient Grafting: This training paradigm allows the model to leverage the learnability of continuous models while optimizing the discrete rules execution. By synchronizing the gradients of both discrete and continuous versions layer-by-layer, RRL ensures that updates remain meaningful and directed towards improving the discrete model’s accuracy.
  4. End-to-End Feature Discretization: The RRL can discretize continuous feature spaces end-to-end, isolating and selecting meaningful partitions within the data during the training process.

Empirical Evaluation

The effectiveness of RRL is validated through comprehensive experimentation on 14 datasets of varying sizes from diverse domains. The results indicate that RRL achieves superior performance compared to existing interpretable models and demonstrates competitive accuracy against complex models like random forests and tree-boosting methods (e.g., XGBoost, LightGBM).

Implications and Future Directions

The RRL opens new avenues for developing interpretable models capable of handling large volumes of data. Its ability to maintain transparency while managing scalability challenges has practical implications in high-stakes environments where decisions must be justifiable.

Theoretical implications highlight the feasibility of bridging discrete and continuous paradigms within machine learning frameworks, offering insights into improving optimization techniques for non-differentiable models. Future research may delve into refining the logical activation functions to further reduce computational overhead or extending the RRL architecture to address multi-modal data types directly. Additionally, integrating domain-specific constraints during training could tailor RRL outputs more closely to individual application needs.

Overall, the Rule-based Representation Learner represents a significant advancement in the field of interpretable machine learning, where the demand for models that are both understandable and performant continues to grow. This work contributes a foundation upon which further innovations in model transparency and scalability might be structured.