End to End based Nepali Speech Recognition System

Basanta Joshi; Bharat Bhatta; Ram Krishna Maharjan

Authors

Basanta Joshi Department of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering, Tribhuvan University, Nepal https://orcid.org/0000-0003-1648-7776
Bharat Bhatta Department of Electronics and Computer Engineering, Sagarmatha Engineering College, Institute of Engineering, Tribhuvan University, Nepal
Ram Krishna Maharjan Department of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering, Tribhuvan University, Nepal

Keywords:

Nepali speech recognition, Automatic speech recognition, End to end speech recognition, Gated recurrent unit, Convolution neural network, CNN, GRU

Abstract

Today, technology is an indispensable part of life. To make familiar with the technology, Automatics Speech Recognition (ASR) system plays an important role. For the Nepali language due to inadequate spoken corpus, there has not been much research work, and there is not such a good model that can perform ASR. This paper presents an idea for constructing the end-to-end based Nepali ASR system and the necessary data (spoken corpus) for the Nepali language. The Nepali ASR system is able to translate spoken Nepali language to its correct textual representation. The system is built using the MFCC feature extraction, CNN for spatial feature extraction, GRU to construct the acoustic model, and CTC for decoding. The best model is built by using tuning the batch size and varying the number of the GRU units and GRU layers. This model (without using language model) provides the WER of 49.85%, 46.39%, and 52.89% on the train, validation, and test data respectively. And by using the uni-gram language model, the final model provides the WER of 35.40%, 37.50%, and 39.72% on train, validation, and test data respectively.

End to End based Nepali Speech Recognition System

Authors

Keywords:

Abstract

Published

How to Cite

Issue

Section