On the acoustic unit choice for the keyword spotting problem

    Industry 4.0, Vol. 6 (2021), Issue 1, pg(s) 7-9

    In this paper we examine the results of using different acoustic units for the building of keyword spotting system. The choice of the acoustic unit greatly influences the quality of the resulting system given the dataset and the model complexity. Decomposing the keyword into simple acoustic units requires the prior knowledge. This knowledge might make the task easier (so the resulting accuracy will be higher), but on the other hand even slightly incorrect priors could mislead the model so the quality might drop significantly. We compare using phonemes, syllables, words and several synthetic acoustic units for Russian language. We show that for modern keyword spotting systems phonemes is a robust and high quality choice, especially in low-resource setting.