End to End based Nepali Speech Recognition System
Keywords:
Nepali speech recognition, Automatic speech recognition, End to end speech recognition, Gated recurrent unit, Convolution neural network, CNN, GRUAbstract
Today, technology is an indispensable part of life. To make familiar with the technology, Automatics Speech Recognition (ASR) system plays an important role. For the Nepali language due to inadequate spoken corpus, there has not been much research work, and there is not such a good model that can perform ASR. This paper presents an idea for constructing the end-to-end based Nepali ASR system and the necessary data (spoken corpus) for the Nepali language. The Nepali ASR system is able to translate spoken Nepali language to its correct textual representation. The system is built using the MFCC feature extraction, CNN for spatial feature extraction, GRU to construct the acoustic model, and CTC for decoding. The best model is built by using tuning the batch size and varying the number of the GRU units and GRU layers. This model (without using language model) provides the WER of 49.85%, 46.39%, and 52.89% on the train, validation, and test data respectively. And by using the uni-gram language model, the final model provides the WER of 35.40%, 37.50%, and 39.72% on train, validation, and test data respectively.