Duplicate Question Retrieval and Confirmation Time Prediction in Software Communities (2309.05035v3)
Abstract: Community Question Answering (CQA) in different domains is growing at a large scale because of the availability of several platforms and huge shareable information among users. With the rapid growth of such online platforms, a massive amount of archived data makes it difficult for moderators to retrieve possible duplicates for a new question and identify and confirm existing question pairs as duplicates at the right time. This problem is even more critical in CQAs corresponding to large software systems like askubuntu where moderators need to be experts to comprehend something as a duplicate. Note that the prime challenge in such CQA platforms is that the moderators are themselves experts and are therefore usually extremely busy with their time being extraordinarily expensive. To facilitate the task of the moderators, in this work, we have tackled two significant issues for the askubuntu CQA platform: (1) retrieval of duplicate questions given a new question and (2) duplicate question confirmation time prediction. In the first task, we focus on retrieving duplicate questions from a question pool for a particular newly posted question. In the second task, we solve a regression problem to rank a pair of questions that could potentially take a long time to get confirmed as duplicates. For duplicate question retrieval, we propose a Siamese neural network based approach by exploiting both text and network-based features, which outperforms several state-of-the-art baseline techniques. Our method outperforms DupPredictor and DUPE by 5% and 7% respectively. For duplicate confirmation time prediction, we have used both the standard machine learning models and neural network along with the text and graph-based features. We obtain Spearman's rank correlation of 0.20 and 0.213 (statistically significant) for text and graph based features respectively.
- Mining duplicate questions of stack overflow. In 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR), pages 402–412, 2016.
- A contextual approach towards more accurate duplicate bug report detection. In 2013 10th Working Conference on Mining Software Repositories (MSR), pages 183–192, 2013. doi: 10.1109/MSR.2013.6624026.
- Detecting semantically equivalent questions in online user forums. In Proceedings of the Nineteenth Conference on Computational Natural Language Learning, pages 123–131, Beijing, China, July 2015a. Association for Computational Linguistics. doi: 10.18653/v1/K15-1013. URL https://aclanthology.org/K15-1013.
- Detecting semantically equivalent questions in online user forums. In Proceedings of the Nineteenth Conference on Computational Natural Language Learning, pages 123–131, Beijing, China, July 2015b. Association for Computational Linguistics. doi: 10.18653/v1/K15-1013. URL https://aclanthology.org/K15-1013.
- Classification and regression trees. 1983.
- Signature verification using a "siamese" time delay neural network. In Proceedings of the 6th International Conference on Neural Information Processing Systems, NIPS’93, page 737–744, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc.
- Xgboost: A scalable tree boosting system. KDD ’16, page 785–794, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450342322. doi: 10.1145/2939672.2939785. URL https://doi.org/10.1145/2939672.2939785.
- Supervised learning of universal sentence representations from natural language inference data, 2017. URL https://arxiv.org/abs/1705.02364.
- Detecting near-duplicates in large-scale short text databases. In Takashi Washio, Einoshin Suzuki, Kai Ming Ting, and Akihiro Inokuchi, editors, Advances in Knowledge Discovery and Data Mining, pages 877–883, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg. ISBN 978-3-540-68125-0.
- node2vec: Scalable feature learning for networks. CoRR, 2016.
- Joint autoregressive and graph models for software and developer social networks. In Djoerd Hiemstra, Marie-Francine Moens, Josiane Mothe, Raffaele Perego, Martin Potthast, and Fabrizio Sebastiani, editors, Advances in Information Retrieval - 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28 - April 1, 2021, Proceedings, Part I, volume 12656 of Lecture Notes in Computer Science, pages 224–237. Springer, 2021.
- Is this bug severe? a¬†text-cum-graph based model for¬†bug severity prediction. In Massih-Reza Amini, Stéphane Canu, Asja Fischer, Tias Guns, Petra Kralj Novak, and Grigorios Tsoumakas, editors, Machine Learning and Knowledge Discovery in Databases, pages 236–252, Cham, 2023. Springer Nature Switzerland. ISBN 978-3-031-26422-1.
- Long short-term memory. Neural computation, 9:1735–80, 12 1997. doi: 10.1162/neco.1997.9.8.1735.
- Detecting duplicate questions with deep learning. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS), 2016.
- Duplicate questions pair detection using siamese malstm. IEEE Access, 8:21932–21942, 2020. doi: 10.1109/ACCESS.2020.2969041.
- Detection of semantically equivalent question pairs. In Madhusudan Singh, Dae-Ki Kang, Jong-Ha Lee, Uma Shanker Tiwary, Dhananjay Singh, and Wan-Young Chung, editors, Intelligent Human Computer Interaction, pages 12–23, Cham, 2021. Springer International Publishing. ISBN 978-3-030-68449-5.
- Distributed representations of sentences and documents, 2014. URL https://arxiv.org/abs/1405.4053.
- Efficient estimation of word representations in vector space, 2013. URL https://arxiv.org/abs/1301.3781.
- Deepdup: Duplicate question detection in community question answering. In Proceedings of the 2021 5th International Conference on Deep Learning Technologies, ICDLT ’21, page 8–12, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450390163. doi: 10.1145/3480001.3480021. URL https://doi.org/10.1145/3480001.3480021.
- Siamese recurrent architectures for learning sentence similarity. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1), Mar. 2016. doi: 10.1609/aaai.v30i1.10350. URL https://ojs.aaai.org/index.php/AAAI/article/view/10350.
- Attention-based model for predicting question relatedness on stack overflow. CoRR, abs/2103.10763, 2021.
- GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1162. URL https://aclanthology.org/D14-1162.
- Duplicate question detection in question answer website using convolutional neural network. 2019 5th International Conference on Science and Technology (ICST), 1:1–6, 2019.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/abs/1908.10084.
- The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3(4):333–389, apr 2009. ISSN 1554-0669. doi: 10.1561/1500000019.
- Detection of duplicate defect reports using natural language processing. 29th International Conference on Software Engineering (ICSE’07), pages 499–510, 2007.
- Towards more accurate retrieval of duplicate bug reports. In 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011), pages 253–262, 2011. doi: 10.1109/ASE.2011.6100061.
- Duplicate question detection with deep learning in stack overflow. IEEE Access, 8:25964–25975, 2020.
- Near-duplicate detection in web app model inference. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ICSE ’20, page 186–197, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450371216. doi: 10.1145/3377811.3380416. URL https://doi.org/10.1145/3377811.3380416.
- Near-duplicate detection by instance-level constrained clustering. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’06, page 421–428, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595933697. doi: 10.1145/1148170.1148243. URL https://doi.org/10.1145/1148170.1148243.
- Detecting duplicate posts in programming qa communities via latent semantics and association rules. In Proceedings of the 26th International Conference on World Wide Web, WWW ’17, page 1221–1229, Republic and Canton of Geneva, CHE, 2017. International World Wide Web Conferences Steering Committee. ISBN 9781450349130. doi: 10.1145/3038912.3052701. URL https://doi.org/10.1145/3038912.3052701.
- Duplicate detection in programming question answering communities. ACM Trans. Internet Technol., 18(3), apr 2018. ISSN 1533-5399.
- Multi-factor duplicate question detection in stack overflow. Journal of Computer Science and Technology, 30(5):981–997, Sep 2015. ISSN 1860-4749.
- Rima Hazra (21 papers)
- Debanjan Saha (1 paper)
- Amruit Sahoo (3 papers)
- Somnath Banerjee (22 papers)
- Animesh Mukherjee (154 papers)