Boosted top tagging and its interpretation using Shapley values

Published 22 Dec 2022 in hep-ph and hep-ex | (2212.11606v2)

Abstract: Top tagging has emerged as a fast-evolving subject due to the top quark's significant role in probing physics beyond the standard model. For the reconstruction of top jets, machine learning models have shown a substantial improvement in the classification performance compared to the previous methods. In this work, we build top taggers using $N$-Subjettiness ratios and several Energy Correlation observables as input features to train the eXtreme Gradient BOOSTed decision tree (XGBOOST). The study finds that tighter parton-level matching lead to more accurate tagging. However, in real experimental data, where the parton level data are unknown, this matching cannot be done. We train the XGBOOST models without performing this matching and show that this difference impacts the taggers' effectiveness. Additionally, we test the tagger under different simulation conditions, including changes in center-of-mass energy, parton distribution functions (PDFs), and pileup effects, demonstrating its robustness with performance deviations of less than 1%. Furthermore, we use the SHapley Additive exPlanation (SHAP) framework to calculate the importance of the features of the trained models. It helps us to estimate how much each feature of the data contributed to the model's prediction and what regions are of more importance for each input variable. Finally, we combine all the tagger variables to form a hybrid tagger and interpret the results using the Shapley values.