A method for keyword recognition in voice signals in resource-constrained computer systems
Received 17.06.2025, Revised 20.10.2025, Accepted 15.12.2025
Abstract
Keyword spotting on embedded platforms must balance accuracy and strict resource limits while remaining independent of network connectivity. The aim of the study was to develop and experimentally validate a classical, frugal recognition method that increases feature informativeness without increasing model complexity and is suitable for autonomous use on edge devices that rely only on a central processing unit. A weighted acoustic f ingerprinting mechanism was proposed. Mel-frequency cepstral coefficients, together with their derivatives, were reweighted, aggregated, and serialised into compact discrete “fingerprints”, which were then classified using the Levenshtein edit distance. Experiments were carried out on a Ukrainian-language command corpus from six native speakers (three male, three female), recorded with both headsets and far-field microphones; lexicons of 10, 100, and 200 words were evaluated under speaker-independent splits of 70%/15%/15%. The methodology comprised f ixed parametrisation of mel-frequency cepstral coefficients, construction of a static weighting vector, voice activity detection with spectral subtraction, uniform quantisation and serialisation, and deterministic edit-distance classification; for comparison, equal-weight baselines, hidden Markov models with Gaussian mixture emissions, Dynamic Time Warping, a lightweight convolutional neural network, and a reference depthwise-separable convolutional neural network were considered. The proposed method achieved macro-averaged harmonic means of precision and recall of 0.96/0.92/0.89 for 10/100/200-word lexicons in clean audio, and 0.78 at a signal-to-noise ratio of 5 decibels (100-word lexicon). The implementation required approximately 250 kilobytes of memory and operated with a real-time factor of 0.005 on Raspberry Pi 4 with 4 gigabytes, i.e., faster than real time. Superiority over equal-weight baselines, hidden Markov models with Gaussian mixture emissions, and Dynamic Time Warping was demonstrated, with performance approaching that of a compact convolutional neural network. It is concluded that weighted acoustic fingerprinting provides a robust, efficient, and autonomous keyword-spotting solution for deployments that use only a central processing unit
Keywords:
embedded edge computing; acoustic fingerprinting; feature reweighting; edit-distance-based classification; robust speech commands; resource-constrained devices
Suggested citation
References
- Alharbi, S., Alrazgan, M., Alrashed, A., Alnomasi, T., Almojel, R., & Alharbi, R. (2021). Automatic speech recognition: Systematic literature review. IEEE Access, 9, 131858-131876. doi: 10.1109/ACCESS.2021.3112535.
- Bae, S., Kim, H., Lee, S.-P., & Yoo, J. (2023). FPGA implementation of keyword spotting system using depthwise separable binarised and ternarised neural networks. Sensors, 23(12), article number 5701. doi: 10.3390/ s23125701.
- Casebeer, J., Wu, J., & Smaragdis, P. (2024). META-AF echo cancellation for improved keyword spotting. In ICASSP 2024 – IEEE international conference on acoustics, speech and signal processing (pp. 676-680). Seoul: IEEE doi: 10.1109/ICASSP48485.2024.10448040.
- Chen, G., Parada, C., & Heigold, G. (2014). Small-footprint keyword spotting using deep neural networks. In Proceedings of the 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4087-4091). Florence: IEEE. doi: 10.1109/ICASSP.2014.6854370.
- Choi, S., Seo, S., Shin, B., Byun, H., Kersner, M., Kim, B., Kim, D., & Ha, S. (2019). Temporal convolution for realtime keyword spotting on mobile devices. ArXiv. doi: 10.48550/arXiv.1904.03814.
- Dua, S., Kumar, S.S., Albagory, Y., Ramalingam, R., Dumka, A., Singh, R., Rashid, M., Gehlot, A., Alshamrani, S.S., & AlGhamdi, A.S. (2022). Developing a speech recognition system for recognizing tonal speech signals using a convolutional neural network. Applied Sciences, 12(12), article number 6223. doi: 10.3390/app12126223.
- Dychka, I.A., Tereikovskyi, I.A., Didus, A.V., Tereikovska, L.O., & Bojarynova, Yu.Ye. (2023). Evaluation of the effectiveness of keyword recognition tools in a voice signal. Scientific Notes of V.I. Vernadsky Taurida National University. Series: Technical Sciences, 34(73(3)), 123-129. doi: 10.32782/2663-5941/2023.3.1/19.
- Furtuna, T.F. (2008). Dynamic programming algorithms in speech recognition. Informatica Economica, 12(2), 94-98.
- Kandji, A.K., Ba, C., & Ndiaye, S. (2024). State-of-the-art review on recent trends in automatic speech recognition. In Proceedings of the 2023 international conference on emerging technologies for developing countries (AFRICATEK 2023), lecture notes of the institute for computer sciences, social informatics and telecommunications engineering (LNICST) (pp. 185-203). Cham: Springer. doi: 10.1007/978-3-031-63999-9_11.
- Kuo, S.-S., & Agazzi, O.E. (1994). Automatic keyword recognition using hidden Markov models. Journal of Visual Communication and Image Representation, 5(3), 265-272. doi: 10.1006/jvci.1994.1024.
- Leggetter, C.J., & Woodland, P.C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech & Language, 9(2), 171-185. doi: 10.1006/ csla.1995.0010.
- Mahmud, A., & Kose, U. (2021). Speech recognition based on convolutional neural networks and MFCC algorithm. Advances in Artificial Intelligence Research, 1(1), 6-12.
- Morwal, S., Jahan, N., & Chopra, D. (2012). Named entity recognition using hidden Markov model (HMM). International Journal on Natural Language Computing (IJNLC), 1(4), 15-23. doi: 10.5121/ijnlc.2012.1402.
- O’Shaughnessy, D. (2024). Trends and developments in automatic speech recognition research. Computer Speech & Language, 83, article number 101538. doi: 10.1016/j.csl.2023.101538.
- Sainath, T.N., & Parada, C. (2015). Convolutional neural networks for small-footprint keyword spotting. In Interspeech 2015 (pp. 1478-1482). Dresden: ISCA. 1478-1482. doi: 10.21437/INTERSPEECH.2015-352.
- Seo, D., Oh, H.-S., & Jung, Y. (2021). Wav2KWS: Transfer learning from speech representations for keyword spotting. IEEE Access, 9, 80682-80691. doi: 10.1109/ACCESS.2021.3078715.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In 31st conference on neural information processing systems (NIPS 2017) (pp. 1-11). Long Beach: ACM.
- Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Chen, P.-Y., Siniscalchi, S.M., & Ma, X. (2021). Decentralizing feature extraction with quantum convolutional neural network for automatic speech recognition. In Proceedings of the ICASSP 2021 – 2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6523-6527). Toronto: IEEE. doi: 10.1109/ICASSP39728.2021.9413453.
- Yang, G.P., Gu, Y., Tang, Q., Du, D., & Liu, Y. (2023). On-device constrained self-supervised speech representation learning for keyword spotting via knowledge distillation. ArXiv. doi: 10.48550/arXiv.2307.02720.
- Zhang, Y., Li, X., & Wang, H. (2024). Automatic speech recognition: A survey of deep learning approaches. Journal of Artificial Intelligence and Data Science, 6, 201-237. doi: 10.1016/j.jaids.2024.05.057.