TECHNOLOGICAL BASIS OF “INDUSTRY 4.0”

On the acoustic unit choice for the keyword spotting problem

  • 1 Department of Information Technologies, Vilnius Gediminas Technical University, Vilnius, Lithuania

Abstract

In this paper we examine the results of using different acoustic units for the building of keyword spotting system. The choice of the acoustic unit greatly influences the quality of the resulting system given the dataset and the model complexity. Decomposing the keyword into simple acoustic units requires the prior knowledge. This knowledge might make the task easier (so the resulting accuracy will be higher), but on the other hand even slightly incorrect priors could mislead the model so the quality might drop significantly. We compare using phonemes, syllables, words and several synthetic acoustic units for Russian language. We show that for modern keyword spotting systems phonemes is a robust and high quality choice, especially in low-resource setting.

Keywords

References

  1. J. R. Rohlicek, W. Russell, S. Roukos, and H. Gish, ―Continuous hidden markov modeling for speaker-independent word spotting,‖ in International Conference on Acoustics, Speech, and Signal Processing, May 1989, pp. 627–630 vol.1.
  2. S. Myer and V. S. Tomar, ―Efficient keyword spotting using time delay neural networks,‖ in Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018., B. Yegnanarayana, Ed. ISCA, 2018, pp. 1264–1268. [Online]. Available: https://doi.org/10.21437/Interspeech.2018-1979
  3. G. Chen, C. Parada, and G. Heigold, ―Small-footprint keyword spotting using deep neural networks,‖ in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4-9, 2014. IEEE, 2014, pp. 4087–4091. [Online]. Available: https://doi.org/10.1109/ICASSP.2014.6854370
  4. M. Wu, S. Panchapagesan, M. Sun, J. Gu, R. Thomas, S. N. P. Vitaladevuni, B. Hoffmeister, and A. Mandal, ―Monophone-based background modeling for two-stage on-device wake word detection‖ in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018. IEEE, 2018, pp. 5494–5498. [Online]. Available: https://doi.org/10.1109/ICASSP.2018.8462227
  5. A. Kolesau and D. Šešok, ―Voice activation systems for embedded devices: Systematic literature review,‖ Informatica, vol. 31, no. 1, pp. 65–88, 2020.
  6. Wikipedia contributors, ―Phoneme — Wikipedia, the free encyclopedia,‖ https://en.wikipedia.org/w/index.php?title=Phoneme&oldid=10065 18035, 2021, [Online; accessed 21-February-2021].
  7. Wikipedia contributors, ―Syllable — Wikipedia, the free encyclopedia,‖ https://en.wikipedia.org/w/index.php?title=Syllable&oldid=100668 6720, 2021, [Online; accessed 21-February-2021].
  8. T. Zeppenfeld and A. H. Waibel, ―A hybrid neural network, dynamic programming word spotter,‖ in [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, March 1992, pp. 77–80.

Article full text

Download PDF