Journal: Volume 28, No. 4, 2023
Pages: 59 – 69
DOI: https://doi.org/10.62660/2306-4412.4.2023.59-69
1,073 Views

Deduplication of error reports in software malfunction: Algorithms for comparing call stacks

Serhii Pavlenko, Petro Kuliabko
Received 31.08.2023
Revised 16.11.2023
Accepted 18.12.2023

Abstract

In the software industry, the standard recognises automatic fault monitoring systems as mandatory for implementation. Considering the constant development of technologies and the high complexity of programmes, the importance of optimising processes for detecting and eliminating errors becomes a relevant task due to the need for reliability and stability of software. The purpose of this study is to conduct a detailed analysis of existing deduplication algorithms for reports from automatic systems collecting information about software failures. Among the algorithms considered were: the longest common subsequence method, Levenshtein distance, deep learning methods, Siamese neural networks, and hidden Markov models. The results obtained indicate a great potential for optimising processes of error detection and elimination in software. The developed comprehensive approach to the analysis and detection of duplicates in call stacks in failure reports allows for effectively addressing issues. The deep learning methods and hidden Markov models have demonstrated their effectiveness and feasibility for real-world applications. Effective methods for comparing key parameters of reports are identified, which contributes to the identification and grouping of recurring errors. The use of call stack comparison algorithms has proven critical for accurately identifying similar error cases in products with large audiences and high parallelism conditions. Siamese neural networks and the Scream Tracker 3 Module algorithm are used to determine the similarity of call stacks, including the application of recurrent neural networks (long short-term memory, bidirectional long short-term memory). Optimisation of report processing and clustering particularly enhances the speed and efficiency of responding to new failure cases, allowing developers to improve system stability and focus on high-priority issues. The study is useful for software developers, software development companies, system administrators, research groups, algorithm and tool development companies, cybersecurity professionals, and educational institutions

Keywords

References

[1] Bartz, K., Stokes, J.W., Platt, J.C., Kivett, R., Grant, D., Calinoiu, S., & Loihle, G. (2008). Finding similar failures using callstack similarity. In Proceedings of the third conference on tackling computer systems problems with machine learning techniques (Sysml’08). Berkeley: USENIX Association. doi: 10.5555/1855895.1855896.

[2] Brodie, M., Ma, S., Lohman, G., Mignet, L., Wilding, M., Champlin, J., & Sohn, P. (2005). Quickly finding known software problems via automated symptom matching. In Second international conference on autonomic computing (ICAC’05) (pp. 101-110). Seattle: Institute of Electrical and Electronics Engineers. doi: 10.1109/ ICAC.2005.49.

[3] Castelluccio, M., Sansone, C., Verdoliva, L., & Poggi, G. (2017). Automatically analyzing groups of crashes for finding correlations. In ESEC/FSE 2017: Proceedings of the 2017 11th joint meeting on foundations of software engineering (pp. 717-726). New York: Association for Computing Machinery. doi: 10.1145/3106237.3106306.

[4] Dang, Y., Wu, R., Zhang, H., Zhang, D., & Nobel, P. (2012). ReBucket: A method for clustering duplicate crash reports based on call stack similarity. In 2012 34th international conference on software engineering (ICSE) (pp. 1084-1093). Zurich: Institute of Electrical and Electronics Engineers. doi: 10.1109/ ICSE.2012.6227111.

[5] Ebrahimi, N., Islam, S., Hamou-Lhadj, A., & Hamdaqa, M. (2016). An effective method for detecting duplicate crash reports using crash traces and hidden Markov models. In CASCON ‘16: Proceedings of the 26th annual international conference on computer science and software engineering (pp. 75-84). Riverton: IBM Corp. doi: 10.5555/3049877.3049885.

[6] Esteves, J., Costa, R., Zhou, Y., & Almeida, A. (2023). An exploratory analysis of methods for real-time data deduplication in streaming processes. In DEBS ‘23: Proceedings of the 17th ACM international conference on distributed and event-based systems (pp. 91-102). New York: Association for Computing Machinery. doi: 10.1145/3583678.3596898.

[7] Feng, D. (2022). Data deduplication for high performance storage system. Singapore: Springer. doi: 10.1007/978981-19-0112-6_2.

[8] Gupta, S., & Gupta, S. (2021). A systematic study of duplicate bug report detection. International Journal of Advanced Computer Science and Applications, 12(1), 578-589. doi: 10.14569/IJACSA.2021.0120167.

[9] Islam, S., Hamou-Lhadj, A., Koochekian Sabor, K., Hamdaqa, M., & Cai, H. (2021). EnHMM: On the use of ensemble HMMs and stack traces to predict the reassignment of bug report fields. In 2021 IEEE international conference on software analysis, evolution and reengineering (SANER) (pp. 411-421). Honolulu: Institute of Electrical and Electronics Engineers. doi: 10.1109/SANER50967.2021.00045.

[10] Jahan, S., & Rahman, M.M. (2022). Towards understanding the impacts of textual dissimilarity on duplicate bug report detection. In 2023 IEEE international conference on software analysis, evolution and reengineering (SANER) (pp. 25-36). Taipa: Institute of Electrical and Electronics Engineers. doi: 10.1109/SANER56733.2023.00013.

[11] Medzatyi, D., Voichur, Yu., & Voichur, О. (2023). Technology of identification and classification of software failures and vulnerabilities. Measuring and Computing Devices in Technological Processes, 1, 53-57. doi: 10.31891/22199365-2023-73-1-8.

[12] Mukhtar, S., Primadani, C.C., Lee, S., & Jung, P. (2023). A comparison of summarization methods for duplicate software bug reports. Electronics, 12(16), article number 3456. doi: 10.3390/electronics12163456.

[13] Qian, C., Zhang, M., Nie, Y., Lu, S., & Cao, H. (2023). A survey on bug deduplication and triage methods from multiple points of view. Applied Sciences, 13(15), article number 8788. doi: 10.3390/app13158788.

[14] Rosenberg, C.M., & Moonen, L. (2018). Improving problem identification via automated log clustering using dimensionality reduction. In ESEM ‘18: Proceedings of the 12th ACM/IEEE international symposium on empirical software engineering and measurement (pp. 1-10). New York: Association for Computing Machinery. doi: 10.1145/3239235.3239248.

[15] Shmatko, O.V., & Myronenko, M.I. (2018). Information technology of depending of errors software. Scientific Works of Kharkiv National Air Force University, 2(56), 120-125. doi: 10.30748/zhups.2018.56.17.

[16] Sinha, G.R., Thwel, T.Th., Mohdiwale, S., & Shrivastava, D.P. (2021). Introduction to data deduplication approaches. In T.Th. Thwel & G.R. Sinha (Eds.), Data deduplication approaches. Concepts, strategies, and challenges (pp. 1-15). Cambridge, Massachusetts: Academic Press. doi: 10.1016/C2020-0-00104-0.

[17] Trofymenko, O.G., Loginova, N.I., Teslenko, P.O., Savielieva, O.S., & Poliakov, V.M. (2023). Classification of software project risks. Visnyk of Kherson National Technical University, 3(86), 119-128. doi: 10.35546/kntu20784481.2023.3.15.

[18] van Tonder, R., Kotheimer, J., & Le Goues, C. (2018). Semantic crash bucketing. In ASE ‘18: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering (pp. 612-622). New York: Association for Computing Machinery. doi: 10.1145/3238147.3238200.

[19] Wrembel, R. (2022). Data integration, cleaning, and deduplication: Research versus industrial projects. In E. Pardede, P.D. Haghighi, I. Khalil & G. Kotsis (Eds.), Proceedings of the 24th international conference “Information integration and web intelligence” (pp. 3-17). Cham: Springer. doi: 10.1007/978-3-031-21047-1_1.

[20] Yakovyna, V.S., & Uhrynovskyi, B.V. (2019). Software aging in the context of its reliability: A systematic review. Scientific Bulletin of UNFU, 29(5), 123-128. doi: 10.15421/40290525.

[21] Yang, C., Chen, J., Fan, X., Jiang, J., & Sun, J. (2023). Silent compiler bug de-duplication via three-dimensional analysis. In ISSTA 2023: Proceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis (pp. 677-689). New York: Association for Computing Machinery. doi: 10.1145/3597926.3598087.

[22] Zhang, T., Irsan, I.C., Thung, F., & Lo, D. (2023). Cupid: Leveraging ChatGPT for more accurate duplicate bug report detection. Cornell University, 37(4), article number 1. doi: 10.48550/arXiv.2308.10022.

Suggested citation

Pavlenko, S., & Kuliabko, P. (2023). Deduplication of error reports in software malfunction: Algorithms for comparing call stacks. Bulletin of Cherkasy State Technological University, 28(4), 59-69. https://doi.org/10.62660/2306-4412.4.2023.59-69