International Journal of Scientific & Engineering Research, Volume 5, Issue 9, September-2014 34

ISSN 2229-5518

Large Vocabulary Isolated Word Recognition

Using Syllable, HMM and Normal Fit

Hemakumar G., Punitha P.

Abstract— this paper addresses the problem of large vocabulary speaker dependent isolated Kannada words recognition using the syllables, Hidden Markov Model (HMM) and Normal fit method. This experiment has covered 5.5 million words among the 10 million words from Hampi text corpus. Here 3-state Baum–Welch algorithm is used for training. For the 2 successor outputted λ(A, B, pi) is combined and passed into normal fit, the outputted normal fit parameter is labelled has syllable or sub-word. Our model is compared with Gaussian Mixture Model and HMM (3-state Baum–Welch algorithm). This paper clearly shows that for normal fit applied for HMM will reduce the memory size while building the speech models and works with excellent recognition rate. The average W RR is 91.22% and average W ER is 8.78%. All computations are done using mat lab.

Index Terms— ASR, Voice Detection, Speaker Dependent, Segmentation, LPC, Normal fit and Baum-Welch Algorithm.

—————————— ——————————

1 INTRODUCTION

Automatic speech recognition (ASR) is the process by which a computer maps an acoustic speech signal to text. The goal of speech recognition is to develop techniques and systems that enable computers to accept speech input and translate spoken words into text and commands. The problem of speech recog- nition has been actively studied since 1950s and it is natural to ask why one should continue studying speech recognition. Speech recognition is the primary way for human beings to communicate. Therefore it is only natural to use speech as the primary method to input information into computational de- vice or object needing manual input. Speech recognition is the branch of human-centric computing to make technology as user friendly as possible and to integrate it completely into human life by adapting to humans’ specifications. Currently, computers force humans to adapt to computers, which is con- trary to the spirit of human-centric computing. Speech recog- nition has the basic quality to help humans easily communi- cate with computers and reap maximum benefit from them. The performance of speech recognition has improved dramat- ically due to recent advances in speech service and computer technology with continually improving algorithms and faster computing.
The speech recognition system may be viewed as working
in a four stages namely converting analog speech signal into Digitalization (Normalization part) form, Feature extraction part, Speech Model building part, and Testing. In the speech signal, feature extraction is a categorization problem about reducing the dimensionality of the input vector while main- taining the discriminating power of the signal. As we know from fundamental formation of speech recognition system,

————————————————

• Hemakumar G. Research Scholar, Bharathiar University.

Department of Computer Science, Government College for Wom- ens’, Mandya. hemakumar7@yahoo.com

• Dr. Punitha, Professor, Department of MCA, PESIT, Bangalore.

that the number of training sets and test vector needed for the classification problem grows with the dimension of the given input, so we need feature extraction techniques. In speech pro- cessing there are so many methods for feature extraction in speech signal, but still Linear-Predictive coding (LPC) coeffi- cients and Mel-Frequency Cepstral Coefficient (MFCC) are most commonly used technique [1][5][6].
The objective of modelling technique is to generate speech
models using speaker specific feature vector. The speech recognition is divided into two parts that means speaker de- pendent and speaker independent modes. In the speaker in- dependent mode of the speech recognition the computer should ignore the speaker specific characteristics of the speech signal and extract the intended message. On the other hand in case of speaker dependent recognition machine should extract speaker characteristics in the acoustic signal. To developing speech models there are many techniques namely, Acoustic- Phonetic approach, Pattern Recognition approach, Template based approaches, Dynamic time warping, Knowledge based approaches, Statistical based approaches, Learning based ap- proaches, The artificial intelligence approach, Stochastic Ap- proach [5][6][7].
This paper discussing the large vocabulary speaker depend-
ent isolated Kannada word recognition using Syllable, HMM and Normal fit technique and compared with HMM and GMM, for the memory size required in storing the speech model and accuracy of recognition.
The remaining part of the paper is organized into four dif- ferent sections; Section 2 deals with the Text corpus and speech database creation. Section 3 deals with proposed mod- el. Section 4 deals with Experimentation. Section 5 deals with discussion and conclusion.

2 TEXT CORPUS AND SPEECH DATABASE CREATION

Text corpus of 10 million words has collected from Dr. K. Naryana Murthy, Professor, Department of Computer and

IJSER © 2014 http://www.ijser.org

International Journal of Scientific & Engineering Research, Volume 5, Issue 9, September-2014 35

ISSN 2229-5518

Information science, University of Hyderabad, Hyderabad,
India in the year 2011. The top 10,000 most frequently oc-
curred words are taken from this corpus. These 10,000 words
have occurred 6 million times in Hampi text corpus. 10,000
words is record at sampling rate of 8 KHz, 16 bps, mono
channel by one adult male speaker for 3 times each word for
training and rerecorded each word for testing purpose. These
signals were recorded at a little noisy environment, while
Gold Wave Software was used to record with the help of mini
microphone of frequency response 50 – 12500Hz.

3 PROPOSED METHOD

In this experiment we have designed algorithm in five stages for speaker dependent isolated Kannada word recogni- tion. The proposed model works in offline mode. So all speech signals are pre-recorded and stored in speech database and then passed on to our algorithm for training or testing the un- known signal.
First stage is Pre-processing stage: In this stage analog
speech signal is sampled and quantized at the rate of 8,000
samples/s. S(n) is the digitalized value. Then DC component is removed from digitalized sample value using the formula S(n)
= S(n) – mean(S). A first order (low-pass) pre-emphasis ŝ(n) = S(n) – ã * S(n-1) network formula is used to compensate for the speech spectral fall-off at higher frequencies and approximates the inverse of the mouth transmission frequency response. Then standardization is done to entire set of values to have standards amplitude. This process will increases or decreases
the amplitude of speech signal using the 𝑆(𝑛) = ŝ(𝑛) −
max(|𝑠|). Here we have used the constant value ã = 0.9955.
The second stage is Detection of Voiced/ Unvoiced part in
speech signal, also called speech signal segmentation. To solve
this problem, using dynamic threshold approach, we have designed an algorithm for automatic segmentation of speech signal into sub-word or syllable [11]. Here we have combined the short time energy and magnitude of frame. Dynamic threshold for each frame is detected. Lastly, it is checked for voiced part in that frame using that frame threshold. This is achieved by following these steps
𝑖𝑚 �𝑉𝑉𝑖𝑉𝑒𝑚𝑆𝑇𝐸 ∗ 𝑉𝑉𝑖𝑉𝑒𝑚𝑚𝑠𝑓 = 1� 𝑡ℎ𝑒𝑛
𝑡ℎ𝑚𝑡 𝑚𝑟𝑚𝑚𝑒 𝑉𝑉𝑛𝑡𝑚𝑖𝑛𝑠 𝑣𝑉𝑖𝑉𝑒, 𝑉𝑡ℎ𝑒𝑟𝑒𝑖𝑠𝑒 𝑖𝑡𝑠 𝑢𝑛𝑣𝑉𝑖𝑉𝑒𝑚 𝑚𝑟𝑚𝑚𝑒
where STE is Short Time Energy, msf is the Magnitude of
Frame, n is number of samples in the frame. The fig 1 shows
the voice part detected and segmented into syllable, sub-word or word level.
Feature Extraction is the Third stage: Here we have selected
the voiced part of signal and then frame blocking was done for N samples with adjacent frames spaced M samples apart. Typ- ical values for N and M correspond to frames of 20 ms dura- tion with adjacent frames overlap by 6.5 ms. A hamming win- dow is applied to each frame using frame same size. Next, the autocorrelation is applied to that part of signal. LPC method is applied to detect LPC coefficients. The LPC coefficients are converted into Real Cepstrum Coefficients. Here the outputted data will be of the size p*L, where p is the LPC order and it will be constant and L is the number of frames in that voice segmented parts. So it varies. In our experiment we have used LPC order p=24.
The Fourth stage is Speech model building: In this stage the real cepstrum coefficients are in dimension of p*L matrices. This matrix will be passed into k-means algorithm by keeping k=3 and outputted values are passed into 3 state Baum–Welch algorithm and each syllable or sub-word is trained. The Baum- Welch re-estimation procedure is the stochastic constraints of the HMM parameters
� π�ı = 1 − (3.5)

i=1…N

� A� ij = 1,1 ≤ i ≤ N − (3.6)

j=1…N

� B�j (k) = 1,1 ≤ j ≤ N − (3.7)

k=1…M

Are automatically incorporated at each iteration. The parame-
ter estimation problem as a constrained optimization of P( O |

𝑛

𝑖=1

𝑆𝑇𝑆
λ). Based on a standard Lagrange optimization setup using

𝑇ℎ𝑟𝑆𝑇𝐸 = ��
𝑛 � − [min(𝑆𝑇𝑆) ∗ 0.5]�
Lagrange multipliers, P is maximized by
+ min(𝑆𝑇𝑆) – (3.1)
𝜋𝑖 =
𝜋𝑖 (𝜕𝜕⁄𝜕𝜋𝑖 )
− (3.8)

𝑛

𝑖=1

𝑚𝑠𝑚
� 𝜋 (𝜕𝜕⁄𝜕𝜋 )

𝑘=1…𝑁

𝑇ℎ𝑟𝑚𝑠𝑓 = ��
𝑛 � − [min(𝑚𝑠𝑚) ∗ 0.6]� + min(𝑚𝑠𝑚)
− (3.2)
𝐴𝑖𝑗 (𝜕𝜕⁄𝜕𝐴𝑖𝑗 )

𝐴𝑖𝑗 = � 𝐴
(𝜕𝜕⁄𝜕𝐴 )
− (3.9)
𝑖𝑚 (𝑆𝑇𝑆 ≥ 𝑇ℎ𝑟𝑆𝑇𝐸 )𝑡ℎ𝑒𝑛 𝑚𝑚𝑟𝑚𝑒𝑚 ℎ𝑚𝑠 𝑉𝑉𝑖𝑉𝑒𝑚𝑆𝑇𝐸

𝑘=1…𝑁

𝑖𝑘

𝑖𝑘

= 1 − (3.3)

𝐵 (𝑚) = 𝐵𝑗 (𝑚)(𝜕𝜕⁄𝜕𝐵𝑗 (𝑚))

𝑗 � 𝐵 (𝑙)(𝜕𝜕⁄𝜕𝐵 (𝑙))

− (3.10)
𝑖𝑚 �𝑚𝑠𝑚 > 𝑇ℎ𝑟𝑚𝑠𝑓 � 𝑡ℎ𝑒𝑛 𝑚𝑚𝑟𝑚𝑒𝑚 ℎ𝑚𝑠 𝑉𝑉𝑖𝑉𝑒𝑚𝑚𝑠𝑓
= 1 − (3.4)

𝑙=1…𝑀 𝑗 𝑗

Normal fit is applied for 2 consecutive HMM parameter

IJSER © 2014 http://www.ijser.org

International Journal of Scientific & Engineering Research, Volume 5, Issue 9, September-2014 36

ISSN 2229-5518

λ(A, B, pi) and Normal fit parameters are computed. Her the trained two consecutive λ(A, B, pi) are considered has sample data. So, we will be having a sample (x1 … xn ), for this a nor-
mal parameter N(𝜇, 𝜎�2 ) is computed by using the

𝑛

TABLE 1

SHOW S THE MEMORY REQUIRED TO STORE SPEECH MODELS HAS

A W ORD REPRESENTATIVE IN KILO BYTES FOR DIFFERENT VOCAB- ULARY SIZE.

𝜇 = 𝑥 ≡

1
� 𝑥𝑖 and 𝜎� 2~
𝑛

𝑖 =1

𝜎2

. 𝑋 2 − (3.11)

𝑛 𝑛−1
The labelled 𝜇 and 𝜎�2 value will be classified according to
acoustic classes and then stored. Those data have representa-
tives of syllables or sub-words in that particular class. In Lan-
guage model we have designed bi-syllable language model for
each word.
The Fifth stage is Recognition part / Testing Unknown Sig-
nal: Initially, for the unknown speech signals HMM parame- ters are computed and passed into normal fit method. Subse-
quently, the outputted 𝜇 and 𝜎�2 value is identified and then
matched with trained set of data by retaining threshold val-
ues. The outputted syllables or sub-words are matched with
the bi-syllable language model. The concatenation of output- ted syllables and sub-words are done for word building. On this basis decision is taken has recognized word by checking for top ranked.

4 EXPERIMENTATION

In this paper experimentation are done on recognition of isolated Kannada words using HMM (3 state Baum-Welch Algorithm alone), GMM and compared with proposed model for same speech database. To experiment programs are writ- ten in mat lab and ruined on Intel Core i5 processor speed of
2.67 GHz and RAM of 3 GB. The table 1 shows the details of memory required to store speech models for different vocabu- lary size, figures are in Kilo bytes. This shows that our model requires the less memory to store speech models. The table 2 shows the average accuracy rate for different size of vocabu- lary.

5 DISCUSSION AND CONCLUSION

In this paper, ASR model is designed by combination of HMM and Normal fit method and experimented for recogniz- ing the isolated Kannada words. Our ASR model is compared with HMM (3-state Baum-Welch Algorithm alone) and GMM for same speech database. The space required to store the model datum has syllable or sub-word representatives in the HMM and GMM required more memory than storing the normal fit parameters. A normal fit method shows the better accuracy rate then the other two methods. This experiment shows that using normal fit (Normal Parameter estimation), ASR model can be designed and it takes less space with good accuracy rate compared to GMM and HMM models. Using our model ASR can be designed for small, medium and large vocabulary.

TABLE 2

Shows the Average Accuracy Rate measured for different vocabulary size.

Methods/ Words

HMM + Normal fit

HMM

GMM

1000 Words

92.98%

83.45%

91.90%

2000 Words

92.13%

82.99%

91.54%

3000 Words

92.02%

82.21%

91%

4000 Words

92.73%

82.01%

90.78%

5000 Words

92.69%

81.90%

90.12%

6000 Words

91.11%

81.77%

90.01%

7000 Words

90.30%

81.01%

89.05%

8000 Words

89.44%

80.32%

88.75%

9000 Words

89.42%

80.15%

88.66%

10000 Words

89.39%

80.05%

88.45%

Average

91.22%

81.59%

90.03%

ACKNOWLEDGMENT

The authors would like to thank for Bharathiar University for giving an opportunity to pursuing part-time PhD degree. Au- thors would like to thanks for Prof M.R. Nandan, Former Principal, GCWM and all our friends, reviewers and Editorial staff for their help during preparation of this paper.

REFERENCES

[1] Hemakumar G and Punitha P (2013), Speaker Independent Isolated Kannada Word Recognizer, published by P. P. Swamy and D. S. Guru (eds.) (2013), Multimedia Processing, Communication and Computing Applications, Lecture Notes in Electrical Engineering

213, DOI: 10.1007/978-81-322-1143-3_27, Springer India, Page No 333-

345.

[2] Bishnu Prasad Das and Ranjan Parekh (2012), Recognition of Isolated

Words using Features based on LPC, MFCC, ZCR and STE, with

IJSER © 2014 http://www.ijser.org

International Journal of Scientific & Engineering Research, Volume 5, Issue 9, September-2014 37

ISSN 2229-5518

Neural Network Classifiers, International Journal of Modern

Engineering Research (IJMER), Vol.2, Issue.3, May-June 2012 pp-854-

858.

[3] Siva Prasad Nandyala and T. Kishore Kumar (2012), Real Time Isolated Word Recognition using Adaptive Algorithm, International Conference on Industrial and Intelligent Information (ICIII 2012), IPCSIT vol.31 © (2012) IACSIT Press, Singapore.

[4] http://www.mathworks.in/help/stats/statset.html.

[5] Hemakumar G. and Punitha P., (2013), Speech Recognition Technology: A Survey on Indian Languages, International Journal of Information Science and Intelligent System, Vol. 2, No.4, 2013, Page No 1-38

[6] Santosh K.Gaikwad et al., (November 2010), A Review on Speech Recognition Technique, International Journal of Computer Applications (0975 – 8887), Volume 10– No.3.

[7] Rabiner L, Jung B-H (1993), Fundamentals of speech recognition, Pearson Education (Singapore) Private Limited, Indian Branch, 482

F.I.E Patpargans, Delhi 110092, India.

[8] David Doria (2009), Expectation-Maximization: Application to Gaussian Mixture Model Parameter Estimation, Lecture notes published on April 23.

[9] Lawrence R. Rabiner et al., (1979), Speaker-Independent Recognition of Isolated Words Using Clustering Techniques, IEEE Transactions on Acoustics, Speech, And Signal Processing, Vol. Assp-27, No. 4, August 1979, page No. 336-349.

[10] Carlo Tomasi, Estimating Gaussian Mixture Densities with EM – A Tutorial, by Duke University.

[11] Dimo Dimov and Ivan Azmanov (2005), Experimental specifics of using HMM in isolated word speech recognition, International Conference on Computer Systems and Technologies – CompSysTech..

[12] Sukhminder Singh Grewal1 & Dinesh Kumar (2010), Isolated Word Recognition System For English Language, International Journal of Information Technology and Knowledge Management, July- December 2010, Volume 2, No. 2, pp. 447-450.

[13] A.Revathi (2009) et al., Text Independent Speaker Recognition and Speaker Independent Speech Recognition Using Iterative Clustering Approach, International Journal of Computer science & Information Technology (IJCSIT), Vol. 1, No 2, November 2009.

[14] Hemakumar G. and Punitha P. (2013), Automatic Segmentation of Kannada Speech Signal into Syllables and Sub-words: Noised and Noiseless signals, International Journal of Scientific & Engineering Research, Volume 5, Issue 1, January-2014, pag no. 1707-1711.

IJSER © 2014 http://www.ijser.org