摘要-这篇文章探讨在一个交互式语音应答系统中语音识别器通过矢量量化编码技术即多切换分裂矢量量化技术来对各种语音信号进行编码的性能。识别编码输出的过程可用于语音银行应用程序。编码的识别技术用于识别语音信号是隐马尔科夫模型的技术。光谱失真性能,计算复杂度,多切换分裂矢量量化的内存需求和在不同的比特率下计算的语音识别器的性能。从结果发现在24位/框架中语音识别器可以显示出更好的性能,发现对于不同的比特率识别百分比会从100%变到93.33%。
关键词-线性预测编码、语音识别、语音银行、多切换分裂矢量量化,隐马尔科夫模型,线性预测系数。
介绍-INTRODUCTION
本文以语音银行应用程序的优点和检查在一个交互式语音应答系统中通过使用多切换分裂矢量量化技术(MSSVQ)在不同的比特率下语音识别器编码输出的性能。MSSVQ已经证明,相比其他产品代码矢量量化技术,它具有更好的光谱失真性能,更少的计算复杂度和更少的内存需求。因此本文使用MSSVQ作为矢量量化技术来进行编码。
语音编码和识别-Speech coding and recognition
Abstract—This paper investigates the performance of a speech recognizer in an interactive voice response system for various coded speech signals, coded by using a vector quantization technique namely Multi Switched Split Vector Quantization Technique. The process of recognizing the coded output can be used in Voice banking application. The recognition technique used for the recognition of the coded speech signals is the Hidden Markov Model technique. The spectral distortion performance, computational complexity, and memory requirements of Multi Switched Split Vector Quantization Technique and the performance of the speech recognizer at various bit rates have been computed. From results it is found that the speech recognizer is showing better performance at 24 bits/frame and it is found that the percentage of recognition is being varied from 100% to 93.33% for various bit rates.
Keywords—Linear predictive coding, Speech Recognition, Voice banking, Multi Switched Split Vector Quantization, Hidden Markov Model, Linear Predictive Coefficients.
介绍-INTRODUCTION
This paper takes the advantage of voice banking application and examined the performance of a speech recognizer in an Interactive voice response system for the coded output obtained by using Multi switched split vector quantization technique (MSSVQ) at various bit rates. MSSVQ has already been proved that it has better Spectral distortion performance, less Computational complexity and less Memory requirements when compared to other product code vector quantization techniques. So this paper uses MSSVQ as the vector quantization technique for coding.
Voice Banking is a tremendous telephone banking service that makes the user to be in touch with his account information and other banking services 24 hours a day 365 days a year by making a simple phone call. In voice banking customers can speak their choices, or can use a touch tone keypad to enter selections.
The speech techniques involved in voice banking are the speech coding, speech enhancement and speech recognition. This paper investigates the performance of a speech recognizer using hidden markov model (HMM) technique ([1],[2],[3]) for the coded outputs obtained by using a hybrid vector quantization technique. The hybrid vector quantization technique used for coding is the Multi Switched Split vector quantization (MSSVQ) technique ([4],[5],[6],[7]). The speech parameters used for coding are the line spectral frequencies (LSF) ([8],[9],[10]) so as to ensure the filter stability, the codebooks used for coding are generated by using the Linde Buzo Gray (LBG) algorithm [11] the generation of the codebooks is a tedious and time consuming process requiring large amounts of memory for generation and storing purposes, the memory required for the generation of the codebooks increases with the number of training vectors number of samples per vector and bits used for codebook generation.
The speech recognition technique used for recognition is the hidden markov model technique. HMM is a collection of various statistical modeling techniques, in which the transition probability matrix is estimated by using the Baum Welch algorithm ([1],[2]), the emission matrix is generated by using the K-means clustering algorithm and is estimated by using the Baum Welch algorithm. The Viterbi algorithm can also be used for the estimation of the transition and emission matrices. For a given sequence the most likely sequence path is estimated by using the Viterbi algorithm ([1],[2]), from which probability of a particular sequence is estimated by using the forward algorithm or the backward algorithm.
The aim of this article is to investigate the performance of the speech recognizer using HMM for a coded output obtained by using multi switched split vector quantization technique at different bit rates. The speech parameters that can be used for recognition are the Linear predictive coefficients (LPC) and Mel Cepstrum coefficients (MFCC) .In this paper LPC coefficients were used for recognition and Line spectral frequencies were used for coding To improve the performance of recognition energy, delta and acceleration coefficients must be used but in this paper they were not used because if they were used the generation of codebooks during coding becomes a problem.
语音编码和识别-Speech coding and recognition
This paper is intended for voice banking application, so it requires the technology of speech coding and recognition. The enhancement technique used is the Spectral subtraction technique ([11],[12],[13]). The coding technique used is the Multi Switched Split Vector Quantization technique (MSSVQ). The recognition technique used is the Hidden Markov model technique. The steps involved in speech coding and recognition intended for voice banking are
Firstly the silence part of the speech signal is removed by using the voice activation and detection technique and next the channel noise included in the speech signal must be removed by using an enhancement technique.
Secondly the speech signal must be coded by using the MSSVQ technique.
Thirdly the coded output with added channel noise must be enhanced by using the spectral subtraction technique.
Next the enhanced speech signal must be given to a voice bank recognizer so as to recognize the coded output.
Finally the percentage of recognition was computed as a measure of the recognition accuracy.
By using these speech techniques it is found that the recognition accuracy is being varied from 100% to 93.33% for the coded outputs at different bit rates.
多切换分裂矢量量化-MULTI SWITCHED SPLIT VECTOR QUANTIZATION
In MSSVQ for a particular switch the generation of codebooks at different stages is shown in Fig 1.
Initially the codebook at the first stage is generated by using the Linde, Buzo and Gray (LBG) [14] algorithm with the training vectors set as an input.
Secondly the training difference vectors are extracted from the input training vectors set and the quantized training vectors of the first stage.
Finally the training difference vectors are used to generate the codebook of the second stage.
This procedure is continued for the required number of stages and the number of codebooks to be generated will be equal to the number of stages used for quantization.
A p x m x s MSSVQ is shown in Fig 2, where p corresponds to the number of stages, m corresponds to the number of switches, and s corresponds to the number of splits.
Each input vector x that is to be quantized is applied to SSVQ at the first stage so as to obtain the approximate vectors at each codebook of the first stage.
Extract the approximate vector with minimum distortion from the set of approximate vectors at the first stage i,e. =Q[x1].
Compute the error vector resulting at the first stage of quantization and let the error vector be,.
The error vector at the first stage is given as an input to the second stage so as to obtain the quantized version of the error vector.
This process is continued for the required number of stages. Finally the decoder takes the indices,, from each stage and adds the quantized vectors at each stage so as to obtain the reconstructed vector given by . Where Q[x1] is the quantized input vector at the first stage, Q[e1] is the quantized error vector at the second stage and Q[e2] is the quantized error vector at the third stage and so on.. As this process involves the quantization of the error vectors and summing of the error vectors with the approximate vector at the first stage the spectral distortion performance can be greatly improved when compared to SSVQ and SVQ.
光谱失真-SPECTRAL DISTORTION
In order to objectively measure the distortion between a coded and uncoded LPC parameter vector, the spectral distortion is often used in narrow band speech coding. For the ith frame the spectral distortion (in dB),, [5] is defined as#p#分页标题#e#
Where FS is the sampling frequency and and are the LPC power spectra of the uncoded and coded ith frame, respectively. f is the frequency in Hz, and the frequency range is given by f1 and f2. the frequency range used in practice is 0-4000Hz. The average spectral distortion SD is given by
The conditions for transparent speech from narrowband LPC parameter quantization are.
The average spectral distortion (SD) must be less than or equal to 1dB.
There must be no outlier frames having a spectral distortion grater than 4dB.
The no of outlier frames between 2 to 4dB must be less than 2%.
结果-RESULTS
Tables 1 to 4 gives the probability of recognizing an utterance ONE at bit rates 24, 23, 22, 21. From tables it is observed that the recognition accuracy is being varied from 100% to 93.33% for different bit rates and it is found that the recognition accuracy is good at 24 and 23 bits/frame. The reason for choosing multi switched split vector quantization technique is that it is having better spectral distortion performance, less computational complexity and less memory requirements when compared to other product code vector quantization techniques which can be observed from tables 5 to 8. As a result the cost of the product will be less when using MSSVQ and can have better marketability. The decrease in spectral distortion, complexity and memory requirements for MSSVQ can also be observed from Fig's 3 to 5. The spectral distortion is measured in units of decibles (dB), computational complexity is measured in units of kflops/frame, and memory requirements are measured in units of floats.
总结-CONCLUSION
The Speech recognizer using HMM performs well for the coded output obtained by using MSSVQ, It has been observed that the percentage of recognition varies from 100% to 93.33% for different bit rates. Another advantage with MSSVQ is that it provides better trade-off between bit rate and spectral distortion performance, computational complexity, and memory requirements, when compared to other product code vector quantization schemes like Split vector quantization (SVQ), Multi stage vector quantization (MSVQ), and Switched Split vector quantization (SSVQ). So MSSVQ is proved to be better. When compared to all the product code vector quantization techniques. So MSSVQ is proved to be the better LPC coding technique for voice banking application. The performance can better improved by increasing the number of training vectors and bits for codebook generation, by increasing the number of states of an utterance, by using an efficient algorithm for the generation of emission matrix that takes into account the entire training set unless the K-means clustering that randomly picks vectors from the training set for the generation of an emission matrix., and by using a software having grater degree of precision. With Matlab it is difficult to obtain grater degree of precision when a large number of states are taken for a particular utterance.
应答-Acknowledgments
The authors place on record their grateful thanks to the authorities of Chalapathi Institute of Technology, Mothadaka, Guntur, AP, INDIA, R.V.R & J.C.College of Engineering, Guntur, A.P, INDIA, K L College of Engineering, Guntur, A.P, INDIA, and Jawaharlal Nehru Technological University, College of Engineering, Hyderabad, INDIA for providing the facilities.
引用-References
Rabiner Lawrence, Juang Bing-Hwang, Fundamentals of speech Recognition, Prentice Hall, New Jersey, 1993, ISBN 0-13-015157-2.
Lawrence R.Rabiner, A tutorial on Hidden Markov Models and selected applications in speech recognition, Poceedings of the IEEE, Vol 77, no.2, Feb 1989, pp.154-161.
Rabiner L.R, Levinson S.E., Rosenberg A.E. & Wilpon J.G, Speaker independent recognition of isolated words using clustering techniques, IEEE Trans. Acoustics, Speech, Signal Proc., 1979, pp.336-349.
M.Satya Sai Ram., P.Siddaiah., & M.MadhaviLatha, Multi Switched Split Vector Quantization of Narrow Band Speech Signals, Proceedings World Academy of Science, Engineering and Technology, WASET, Vol.27, Feb 2008, pp.236-239.
M.Satya Sai Ram., P.Siddaiah., & M.MadhaviLatha, Multi Switched Split Vector Quantizer, International Journal of Computer, Information, and Systems science, and Engineering, IJCISSE, WASET, Vol.2, no.1, May 2008, pp.1-6.
Paliwal. K.K, Atal. B.S, Efficient vector quantization of LPC Parameters at 24 bits/frame, IEEE Trans. Speech Audio Process, 1993, pp. 3-14.
Stephen. So, & Paliwal. K. K, Efficient product code vector quantization using switched split vector quantizer, Digital Signal Processing journal, Elsevier, Vol 17, Jan 2007, pp.138-171.
Bastiaan Kleijn. W, Tom Backstrom, & Paavo Alku, On Line Spectral Frequencies," IEEE Signal Processing Letters, Vol.10, no.3, 2003.
Soong. F, Juang. B,Line spectrum pair (LSP) and speech data compression, IEEE Conference. On Acoustics, Speech Signal Processing, vol 9, no.1, Mar 1984, pp. 37-40.
P. Kabal, & P. Rama Chandran, The Computation of Line Spectral Frequencies Using Chebyshev polynomials, IEEE Trans. On Acoustics, Speech Signal Processing, Vol 34, no.6, 1986, pp. 1419-1426.
P. Lockwood and J. Boudy, .Experiments with a Nonlinear Spectral Subtraction (NSS), Hidden Markov Models and the Projection, for Robust Speech Recognition in Cars. Speech Communiaction, vol. 11, 1992 , pp. 215.228.
S.F. Boll, Suppression of Acoustic Noise in Speech using Spectral Subtraction, IEEE Trans. on ASSP, vol. 27(2), 1979, pp.113-120.
M. Berouti, R. Schwartz, and J. Makhoul, Enhancement of Speech Corrupted by Acoustic Noise. in Proc. ICASSP, 1979, pp. 208.211.
Linde .Y, Buzo. A, & Gray. R.M, An Algorithm for Vector Quantizer Design, IEEE Trans.Commun, 28, Jan.1980, pp. 84-95.
M.Satya Sai Ram obtained B.Tech degree in Electronics and Communication Engineering from Nagarjuna University, Guntur in 2003. He received his M.Tech degree from Nagarjuna University, Guntur in 2005. He started his career as a lecturer at R.V.R & J.C. College of Engineering, Guntur, AP, INDIA in 2005 and promoted as a Sr.Lecturer in the year 2007. At present M.Satya Sai Ram is working as an Associate professor in the department of Electronics and Communication Engineering, at Chalapathi Institute of Technology, Mothadaka, Guntur, AP, INDIA. He actively involved in research and guiding Projects for Post Graduate students in the area of Speech & Signal Processing,. He has taught a wide variety of courses for UG students and guided several projects. He has published more than Six papers in International Conferences and Journals.
Dr. P.Siddaiah obtained B.Tech degree in Electronics and Communication Engineering from JNTU college of Engineering in 1988. He received his M.Tech degree from SV University, Tirupathi. He did his PhD program in JNTU, Hyderabad. He is the chief Investigator for several outsourcing project sponsored by Defense organizations and AICTE. He started his career as lecturer at SV University in 1993. At present Dr P. Siddaiah is working as an Professor & HOD in the department of Electronics and Communication Engineering, KL College of Engineering and actively involved in research and guiding students in the area of Antennas, Speech & Signal Processing,.. He has taught a wide variety of courses for UG & PG students and guided several projects. Several members pursuing their PhD degree under guidance. He has published several papers in National and International Journals and Conferences. He is the life member of FIETE, IE, and MISTE.
M. Madhavi Latha graduated in B. Tech from NU in 1986, Post Graduation in M.Tech from JNTU in 1993 and Ph. D from JNTU in 2002. She has been actively involved in research and guiding students in the area of Signal & Image Processing, VLSI (Mixed Signal design) and hardware implementation of Speech CODECs. She has published more than 30 papers in National/ International Conferences and Journals. Currently, she has been working as Professor in ECE, JNTU College of Engineering, Hyderabad, Andhra Pradesh. She is the life member of FIETE, MISTE, MIEEE.
|