Author Topic: Speech Recognition By Using Recurrent Neural Networks  (Read 2118 times)

0 Members and 1 Guest are viewing this topic.

IJSER Content Writer

  • Sr. Member
  • ****
  • Posts: 327
  • Karma: +0/-1
    • View Profile
Speech Recognition By Using Recurrent Neural Networks
« on: August 20, 2011, 10:01:54 am »
Author : Dr.R.L.K.Venkateswarlu, Dr. R. Vasantha Kumari,  G.Vani JayaSri
International Journal of Scientific & Engineering Research Volume 2, Issue 6, June-2011
ISSN 2229-5518
Download Full Paper : PDF

Abstract -Automatic speech recognition by computers is a process where speech signals are automatically converted into the corresponding sequence of characters in text. In real life applications, however, speech recognizers are used in adverse environments. The recognition performance is typically degraded if the training and the testing environments are not the same. The study on speech recognition and understanding has been done for many years. The aim of the study was to observe the difference of English alphabet from E-set to AH-set. The aim of the study was to observe the difference of phonemes. Neural network is well-known as a technique that has the ability to classify nonlinear problem. Today, lots of researches have been done in applying Neural Network towards the solution of speech recognition. Even though positive results have been obtained from the continuous study, research on minimizing the error rate is still gaining lots of attention. This research utilizes Recurrent Neural Network, one of the Neural Network techniques to observe the difference of alphabet from E- set to AH - set. The purpose of this research is to upgrade the peoples knowledge and understanding on phonemes or word by using Recurrent Neural Network (RNN) and backpropagation through Multilayer Perceptron. 6 speakers (a mixture of male and female) are trained in quiet environment. The English language offers a number of challenges for speech recognition [4]. This paper specifies that the performance of Recurrent Neural Network is better than Multi Layer Perceptron Neural Network.
Keywords: Frames, Mel-frequency cepstral coefficient, Multi Layer Perceptron (MLP), Neural Networks, Performance, Recurrent Neural Network (RNN), Utterances.

Speech is human’s most efficient communication modality. Beyond efficiency, humans are comfort and familiar with speech. Other modalities require more concentration, restrict movement and cause body strain due to unnatural positions. Research work on English speech recognition, although lagging that other language, is becoming more intensive than before and several researches have been published in the last few years [11]. Automatic speech recognition is a process by which a machine identifies speech. The conventional method of speech recognition insist in representing each word by its feature vector & pattern matching with the statistically available vectors using neural network [3]. The promising technique for speech recognition is the neural network based approach. Artificial Neural Networks, (ANN) are biologically inspired tools for information processing [15]. Speech recognition modeling by artificial neural networks (ANN) doesn’t require a priori knowledge of speech process and this technique quickly became an attractive alternative to HMM [19]. RNN can learn the temporal relation ship of Speech – data & is capable of modeling time dependent phonemes [5].
The conventional   neural  networks of Multi- Layer Perceptron (MLP) type  have    been increasingly in  use  for  speech  recognition  and also  for  other  speech  processing  applications. Those networks work  very well as an effective classifier for vowel   sounds with stationary spectra, while their phoneme discriminating power  deteriorates  considerably  for  consonants which  are  characterized  by  variations  of  their short-term spectra. This may be attributable to a fact that feedforward multi-layer neural network are inherently unable to deal with time varying information like time-varying spectra of speech sounds. One way to cope with this problem is to incorporate feedback structure in the networks to provide them with an ability to memorize incoming time-varying information. Incorporating feedback structure in feedforward networks results in so-called Recurrent Neural Networks (RNNs) which have feedback connections between units of different layers or connections of self-loop type [6]. Speech recognition is the process of converting an acoustic signal, captured by microphone or a telephone, to a set of characters. The recognized characters can be the final results, as for applications such as commands and control, data entry and document preparation. They can also serve as the input to further linguistic processing in order to achieve speech understanding, a subject covered in section [19]. As we know, speech recognition performs their task similar with human brain. Start from phoneme, syllable, word and then sentence which is an input for speech recognition system [14].

The objective of speech recognition is to determine the sequence of sound units from the speech signal so that the linguistic message in the form of text can be decoded from the speech signal. The steps used in the present speech recognition system are discussed below.
2.1 Input Acquisation
After capturing the speech by using microphone the speech data is saved in .wav files. The speech data is converted to analog signal by using Praat object software tool. The signal is then converted into mono speech signal with 11kHz.
2.2 Front – End Analysis
The acoustic speech signal exists as pressure variations in the air. The micro phone converts these pressure variations into an electric current that is related to the pressure. The ear converts these pressure variations into a series of nerve impulses that are transmitted to the brain. Selection of features is very important for speech recognition task. Good features are required to achieve good result for recognition. Basic problem with speech recognition is identification of proper features for speech recognition task, and a strategy to extract these features from speech signal.
2.3 The Speech Utterance (Data Collection)
The source of data is a database consisting of 18 characters taken from 4 major sets and spoken 10 times by 6 speakers; those are 3 males and 3 females of various ages. Four major sets are
E set: B C D E P T G V Z
A set: J K
EH set: M N F S
AH set: I Y R
The data, which is speaker dependent, will be used for training and testing phases. In speaker dependent form, the first four utterances of each of the 18 characters spoken by every speaker are used to train the network and the remaining utterances are used to test the network. Therefore, the speech database contains 1080 utterances, which can be used for training the network, and 1080 utterances, which are available for testing. These characters are recorded by:-  1- Using Praat Object Software with sampling rate 11 kHz, 8-bit and mono is used to record the utterance. 2- In a closed room, the same microphone is used to record the spoken characters. 3- The files are saved in a .wav format.
2.4 Preprocessing
The speech signals are recorded in a low noise environment with good quality recording equipment. The signals are samples at 11kHz. Reasonable results can be achieved in isolated word recognition when the input data is surrounded by silence.
2.5 Sampling Rate
150 samples are chosen with sampling rate 11kHz, which is adequate to represent all speech sounds.

Read More: Click here...