Factors complicating the identification and processing of duplicates in bibliographic records: A theoretical perspective
Abstract
This article examined the factors that create challenges in the process of identifying and processing duplicates in bibliographic records, which are a crucial component of the information systems of libraries, archives, and publishers. The study explored issues arising from typographical errors, variations in transliteration, the use of special characters, homoglyphs, differing word abbreviation rules, inconsistencies in author name spellings, and shortcomings in the application of standard identifiers such as ISBN and ISSN. Particular attention was given to the impact of discrepancies between international and local MARC standards – including Unimarc and Marc21 – on the creation and processing of bibliographic data. The analysis demonstrated that improper handling of bibliographic records can lead to degraded information retrieval quality for users, inaccuracies in source citations, and increased time expenditures for cataloguing and indexing. Furthermore, inconsistencies in standards impair the management of bibliographic data in multinational systems. The article also examined the consequences of these issues for bibliographic systems, including reduced search query accuracy, difficulties in data integration across catalogues, and increased time and resource costs for record processing. A set of solutions was proposed, including the adoption of unified record standards, the implementation of advanced adaptive search algorithms that account for linguistic and technical discrepancies, and enhanced authority control in bibliographic record creation. The findings have practical implications for information system developers, cataloguers, and library professionals, as they contribute to improving bibliographic databases, reducing duplicate records, and enhancing information retrieval quality for end users
Keywords
data processing; transliteration; library systems; information technology; authority control; cataloguing
References
[1] Alves, V.H.M., Alves, V.A.M., & Cury, A.A. (2024). Artificial intelligence-driven structural health monitoring: Challenges, progress, and applications. In New advances in soft computing in civil engineering. Studies in systems, decision and control (pp. 149-166). Cham: Springer. doi: 10.1007/978-3-031-65976-8_7.
[2] Benevento, E., Aloini, D., & van der Aalst, W.M.P. (2022). How can interactive process discovery address data quality issues in real business settings? Evidence from a case study in healthcare. Journal of Biomedical Informatics, 130, article number 104083. doi: 10.1016/j.jbi.2022.104083.
[3] Berger, B., Waterman, M., & Yu, Y. (2021). Levenshtein distance, sequence comparison and biological database search. IEEE Transactions on Information Theory, 67(6), 3287-3294. doi: 10.1109/TIT.2020.2996543.
[4] Bockel-Rickermann, C., Verdonck, T., & Verbeke, W. (2023). Fraud analytics: A decade of research: Organizing challenges and solutions in the field. Expert Systems with Applications, 232, article number 120605. doi: 10.1016/j.eswa.2023.120605.
[5] Bogani, R., Theodorou, A., & Arnaboldi, L. (2023). Garbage in, toxic data out: A proposal for ethical artificial intelligence sustainability impact statements. AI and Ethics, 3, 1135-1142. doi: 10.1007/s43681-022-00221-0.
[6] Borissov, N., Haas, Q., Minder, B., Kopp-Heim, D., von Gernler, M., Janka, H., Teodoro, D., & Amini, P. (2022). Reducing systematic review burden using Deduklick: A novel, automated, reliable, and explainable deduplication algorithm to foster medical research. Systematic Reviews, 11, article number 172. doi: 10.1186/s13643-02202045-9.
[7] Bruy, O. (2015). Marc21 format for the authoritative data - methodical elaboration and implantation in the library of NaUKMA. Scientific Notes, 22(1), 129-133.
[8] Budhiraja, A., Dutta, K., Shrivastava, M., & Reddy, R. (2018). Towards word embeddings for improved duplicate bug report retrieval in software repositories. In Proceedings of the 2018 ACM SIGIR international conference on theory of information retrieval (pp. 167-170). New York: Association for Computing Machinery. doi: 10.1145/3234944.3234949.
[9] Bykova, T.M. (2016). Specifics of editing bibliographic entries in the electronic catalogue: From experience of the Scientific Library of Odesa I.I. Mechnikov National University. Bulletin of Odessa National University. Library Science, Bibliography, Book Studies, 21(1), 137-149.
[10] Chauhan, R., Sharma, S., & Goyal, A. (2023). DENATURE: Duplicate detection and type identification in open source bug repositories. International Journal of System Assurance Engineering and Management, 14, 275-292. doi: 10.1007/s13198-023-01855-x.
[11] Coffman, A. (2021). Unicode character look-alikes. Retrieved from https://gist.github.com/StevenACoffman/ a5f6f682d94e38ed804182dc2693ed4b.
[12] Discover your character: C. (n. d.). Retrieved from https://www.amp-what.com/unicode/search/с.
[13] Gesicho, M., Were, M., & Babic, A. (2020). Data cleaning process for HIV-indicator data extracted from DHIS2 national reporting system: A case study of Kenya. BMC Medical Informatics and Decision Making, 20, article number 293. doi: 10.1186/s12911-020-01315-7.
[14] Gorman, M. (2004). Categories and normativity. In Categories: Historical and systematic essays (pp. 1-22). Washington: Catholic University of America Press.
[15] Guo, M., Wang, Y., Yang, Q., Li, R., Zhao, Y., Li, C., Zhu, M., Cui, Y., Jiang, X., Sheng, S., Li, Q., & Gao, R. (2023). Normal workflow and key strategies for data cleaning toward real-world data: Viewpoint. Interactive Journal of Medical Research, 12, article number e44310. doi: 10.2196/44310.
[16] Hammer, B., Virgili, E., & Bilotta, F. (2023). Evidence-based literature review: De-duplication a cornerstone for quality. World Journal of Methodology, 13(5), 390-398. doi: 10.5662/wjm.v13.i5.390.
[17] Hillmann, D. (2004). Metadata in practice. Chicago: ALA Editions.
[18] Hopkinson, A. (2008). Unimarc manual: Bibliographic format. Munich: Walter de Gruyter.
[19] IFLA Library Reference Model (IFLA LRM). (2017). Retrieved from https://www.ifla.org/news/ifla-libraryreference-model-lrm-march-2017-version-available/.
[20] Jahan, M., & Hasan, M. (2021). A robust fuzzy approach for gene expression data clustering. Soft Computing, 25, 14583-14596. doi: 10.1007/s00500-021-06397-7.
[21] Joudrey, D.N., & Taylor, A.G. (2017). The organization of information. Santa Barbara: Libraries Unlimited.
[22] Levenson, H.N., Amato, S., Bogus, I., Brody, F.E., Miller, M., & Nadal, L. (2024). Assessing bibliographic inaccuracy as a contributing factor for unintended loss in shared print monograph programs. College & Research Libraries (C&RL), 85(7), article number 1021. doi: 10.5860/crl.85.7.1021.
[23] Lizunov, P., Biloshchytskyi, A., Kuchansky, A., Andrashko, Y., Biloshchytska, S., & Serbin, O. (2021). Development of the combined method of identification of near duplicates in electronic scientific works. Eastern-European Journal of Enterprise Technologies, 4(4(112)), 57-63. doi: 10.15587/1729-4061.2021.238318.
[24] Lobuzina, K.V. (2013). Problems of creating comprehensive digital resources of historical, cultural and scientific heritage. Special Historical Disciplines: Issues of Theory and Methodology, 21, 121-129.
[25] Lunny, C., Pieper, D., Thabet, P., & Kanji, S. (2021). Managing overlap of primary study results across systematic reviews: Practical considerations for authors of overviews of reviews. BMC Medical Research Methodology, 21, article number 140. doi: 10.1186/s12874-021-01269-y.
[26] McKeown, S., & Mir, Z.M. (2021). Considerations for conducting systematic reviews: Evaluating the performance of different methods for de-duplicating references. Systematic Reviews, 10, article number 38. doi: 10.1186/ s13643-021-01583-y.
[27] Negro-Calduch, E., Azzopardi-Muscat, N., Krishnamurthy, R.S., & Novillo-Ortiz, D. (2021). Technological progress in electronic health record system optimization: Systematic review of systematic literature reviews. International Journal of Medical Informatics, 152, article number 104507. doi: 10.1016/j.ijmedinf.2021.104507.
[28] Neysiani, S.B., Babamir, S.M., & Aritsugi, M. (2020). Efficient feature extraction model for validation performance improvement of duplicate bug report detection in software bug triage systems. Information and Software Technology, 126, article number 106344. doi: 10.1016/j.infsof.2020.106344.
[29] Page, L. (2005). The high cost of dirty data. Materials Management in Health Care, 14(11), 22-25.
[30] Raja, P., & Thangavel, K. (2020). Missing value imputation using unsupervised machine learning techniques. Soft Computing, 24(6), 4361-4392. doi: 10.1007/s00500-019-04199-6.
[31] Sehra, S.S., Abdou, T., Başar, A., & Sehra, S.K. (2020). Amalgamated models for detecting duplicate bug reports. In Advances in artificial intelligence (pp. 470-482). Cham: Springer. doi: 10.1007/978-3-030-47358-7_49.
[32] Thwel, T.T., & Sinha, G. (2021). Data deduplication approaches: Concepts, strategies, and challenges. Amsterdam: Elsevier Inc. doi: 10.1016/C2020-0-00104-0.
[33] Tillett, B.B. (1989). Authority control in the online environment: Considerations and practices. Santa Barbara: Libraries Unlimited.
[34] Walters, W., & Wilder, E. (2023). Fabrication and errors in the bibliographic citations generated by ChatGPT. Scientific Reports, 13, article number 14045. doi: 10.1038/s41598-023-41032-5.