Study of the impact of different categorical feature encoding techniques on cluster structures
Abstract
Categorical features are a common type of data used in data analysis, but their non-metric nature makes it difficult to apply standard clustering algorithms. The relevance of the study is conditioned by the need to assess the impact of different methods of recoding (digitisation) of such features on the effectiveness of cluster analysis. The purpose of the study was to investigate how different techniques of categorical data processing affect the quality and structure of clusters. The methodology included the implementation of three models with different approaches to variable coding: without taking into account domain specifics, considering the content of the features, and with alternating the order of application of clustering and dimensionality reduction approaches. LabelEncoder, OrdinalEncoder, One-Hot Encoding, Mapping, and MultiLabelBinarizer were used for coding. In each of the models, clustering was performed using two algorithms – K-Means and agglomerative clustering, which allowed comparison of their sensitivity to changes in data representation. The t-SNE dimensionality reduction method was used to visualise the cluster structure in two-dimensional space. The quality of clustering was evaluated using the Silhouette Score, Dunn Index, Davies-Bouldin Index, and CalinskiHarabasz Index metrics. The data for the analysis were obtained from an open source and contained information about the psycho-emotional state of students. The study found that the basic recoding of categorical features without considering their semantics and context negatively affected the quality of clustering, reducing the accuracy of the division and complicating the interpretation of the results. Instead, the use of domain-oriented coding approaches ensured the development of clusters with clearer boundaries and a more logical internal structure. In addition, it was found that changing the sequence of clustering and dimensionality reduction affects the preservation of local relationships in the data. It was analysed that different approaches change both the number and quality of clusters, which was reflected in the values of the evaluation metrics. The practical significance of the results lies in the possibility of their application by data analysts and machine learning specialists to improve the accuracy of segmentation of complex categorical data
Keywords
data analysis; machine learning; unsupervised learning; automatic object grouping; segmentation
References
- Anitha, M., Savarimuthu, N., & Bhanu, S. (2025). Chi-square target encoding for categorical data representation: A real-world sensor data case study. SN Computer Science, 6, article number 228. doi: 10.1007/s42979-02503766-z.
- Ashfaq, V.A. (n.d.). Student mental health survey. Retrieved from https://www.kaggle.com/datasets/ abdullahashfaqvirk/student-mental-health-survey/data.
- Behzadidoost, R., & Izadkhah, H. (2025). Identifying effective algorithms and measures for enhanced clustering quality: A comprehensive examination of arbitrary decisions in hierarchical clustering algorithms. Journal of Classification, 42, 457-489. doi: 10.1007/s00357-025-09506-5.
- Breskuvienė, D., & Dzemyda, G. (2023). Categorical feature encoding techniques for improved classifier performance when dealing with imbalanced data of fraudulent transactions. International Journal of Computers Communications & Control, 18(3). doi: 10.15837/ijccc.2023.3.5433.
- Di Nuzzo, C. (2024). Advancing spectral clustering for categorical and mixed-type data: Insights and applications. Mathematics, 12(4), article number 508. doi: 10.3390/math12040508.
- Dinh, T., Hauchi, W., Fournier-Viger, P., Lisik, D., Ha, M.-Q., Dam, H.-C., & Huynh, V.-N. (2024). Categorical data clustering: 25 years beyond K-modes. ArXiv. doi: 10.48550/arXiv.2408.17244.
- Hafid, H., & Annisa, S. (2025). Implementation of K-medoids and K-prototypes clustering for early detection of hypertension disease. Barekeng: Journal of Mathematics and Its Application, 19(1), 465-476. doi: 10.30598/ barekengvol19iss1pp465-476.
- Ikotun, A.M., Ezugwu, A.E., Abualigah, L., Abuhaija, B., & Heming, J. (2023). K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sciences, 622, 178-210. doi: 10.1016/j.ins.2022.11.139.
- Kondruk, N.E. (2019). A comparative study of cluster validity indices. Radio Electronics, Computer Science, Control, 4, 59-67. doi: 10.15588/1607-3274-2019-4-6.
- Kondruk, N.E. (2023). Analysis of dimensionality reduction techniques in machine learning. Scientific Bulletin of Uzhhorod University. Series of Mathematics and Informatics, 42(1), 181-187. doi: 10.24144/26167700.2023.42(1).181-187.
- Liang, Z. (2025). Efficient representations for high-cardinality categorical variables in machine learning. ArXiv. doi: 10.48550/arXiv.2501.05646.
- Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137. doi: 10.1109/TIT.1982.1056489.
- Matteucci, F., Arzamasov, V., & Böhm, K. (2023). A benchmark of categorical encoders for binary classification. In Advances in neural information processing systems 36 (NeurIPS 2023) (pp. 54855-54875). doi: 10.48550/ arXiv.2307.09191.
- Miyamoto, S. (2022). Theory of agglomerative hierarchical clustering. Singapoure: Springer Nature.doi: 10.1007/978-981-19-0420-2.
- Pargent, F., Pfisterer, F., Thomas, J., & Bischl, B. (2022). Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Computational Statistics, 37(5), 2671-2692. doi: 10.1007/s00180-022-01207-6.
- Sánchez Vinces, B.V., Schubert, E., Zimek, A., & Cordeiro, R.L. (2025). A comparative evaluation of clusteringbased outlier detection. Data Mining and Knowledge Discovery, 39(2), article number 13. doi: 10.1007/s10618024-01086-z.
- Sieranoja, S., & Fränti, P. (2025). Fast agglomerative clustering using approximate traveling salesman solutions. Journal of Big Data, 12(1), article number 21. doi: 10.1186/s40537-024-01053-x.
- Smith, H.L., Biggs, P.J., French, N.P., Smith, A.N., & Marshall, J.C. (2024). Out of (the) bag – encoding categorical predictors impacts out-of-bag samples. PeerJ Computer Science, 10, srticle number e2445. doi: 10.7717/peerjcs.2445.
- Soemitro, D., & Neto, J.F.S.R. (2024). Spectral clustering of categorical and mixed-type data via extra graph nodes. ArXiv. doi: 10.48550/arXiv.2403.05669.
- Tokuda, E.K., Comin, C.H., & Costa, L.D.F. (2022). Revisiting agglomerative clustering. Physica A: Statistical Mechanics and Its Applications, 585, article number 126433. doi: 10.1016/j.physa.2021.126433.
- Valdez-Valenzuela, E., Kuri-Morales, A., & Gomez-Adorno, H. (2024). Statistical evaluation of categorical encoders for pattern preservation in machine learning tasks. International Journal of Combinatorial Problems and Informatics, 15(2), 160-172. doi: 10.61467/2007.1558.2024.v15i2.456.
- Wegmann, M., Zipperling, D., Hillenbrand, J., & Fleischer, J. (2021). A review of systematic selection of clustering algorithms and their evaluation. ArXiv. doi: 10.48550/arXiv.2106.12792.
- World Medical Association’s Declaration of Helsinki. (2013). Retrieved from https://www.wma.net/what-wedo/medical-ethics/declaration-of-helsinki.
- Zhu, W., Qiu, R., & Fu, Y. (2024). Comparative study on the performance of categorical variable encoders in classification and regression tasks. ArXiv. doi: 10.48550/arXiv.2401.09682.