End to End based Nepali Speech Recognition System

Authors

  • Basanta Joshi Department of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering, Tribhuvan University, Nepal https://orcid.org/0000-0003-1648-7776
  • Bharat Bhatta Department of Electronics and Computer Engineering, Sagarmatha Engineering College, Institute of Engineering, Tribhuvan University, Nepal
  • Ram Krishna Maharjan Department of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering, Tribhuvan University, Nepal

Keywords:

Nepali speech recognition, Automatic speech recognition, End to end speech recognition, Gated recurrent unit, Convolution neural network, CNN, GRU

Abstract

Today, technology is an indispensable part of life. To make familiar with the technology, Automatics Speech Recognition (ASR) system plays an important role. For the Nepali language due to inadequate spoken corpus, there has not been much research work, and there is not such a good model that can perform ASR. This paper presents an idea for constructing the end-to-end based Nepali ASR system and the necessary data (spoken corpus) for the Nepali language. The Nepali ASR system is able to translate spoken Nepali language to its correct textual representation. The system is built using the MFCC feature extraction, CNN for spatial feature extraction, GRU to construct the acoustic model, and CTC for decoding. The best model is built by using tuning the batch size and varying the number of the GRU units and GRU layers. This model (without using language model) provides the WER of 49.85%, 46.39%, and 52.89% on the train, validation, and test data respectively. And by using the uni-gram language model, the final model provides the WER of 35.40%, 37.50%, and 39.72% on train, validation, and test data respectively.

Published

2023-04-12

How to Cite

[1]
B. Joshi, B. Bhatta, and R. K. Maharjan, “End to End based Nepali Speech Recognition System”, JIE, vol. 17, no. 1, pp. 102-109, Apr. 2023.