A Mutual Information Algorithm for Text-Independent Voice Conversion

  IJETT-book-cover  International Journal of Engineering Trends and Technology (IJETT)          
© 2015 by IJETT Journal
Volume-30 Number-8
Year of Publication : 2015
Authors : Seyed Mehdi Iranmanesh, Behnam Dehghan
DOI :  10.14445/22315381/IJETT-V30P275


Seyed Mehdi Iranmanesh, Behnam Dehghan"A Mutual Information Algorithm for Text-Independent Voice Conversion", International Journal of Engineering Trends and Technology (IJETT), V30(8),400-404 December 2015. ISSN:2231-5381. www.ijettjournal.org. published by seventh sense research group

Most of voice conversion systems require parallel corpora for their training stage which means that source and target speakers should utter the same sentences. But in most practical applications, it is impossible to obtain parallel corpora. To solve this problem, text-independent voice conversion has been introduced. The main problem in text-independent voice conversion is data alignment. In this paper we introduce a novel algorithm based on mutual information for data alignment which shows the similar results to those of text-dependent systems. This algorithm does not require phonetic labeling and can be used in practical applications.


[1] Z. Cao and N. Schmid. Matching heterogeneous periocular regions: Short and long standoff distances. In Image Processing (ICIP), 2014 IEEE International Conference on, pages 4967–4971, Oct 2014.
[2] Rabiner L, Juang B-H. Fundamental of Speech Recognition. NJ: Prentice Hall; 1993
[3] Motiian S, Pergami P, Guffey K, Mancinelli C.A, Doretto G. Automated extraction and validation of children?s gait parameters with the Kinect. BioMedEngOnLine. 2015;14:112.
[4] S. Sempena, N. Maulidevi, P. Aryan, Human action recognition using dynamic time warping, ICEEI, IEEE (2011) 1–5.
[5] M. Toman, M. Pucher, S. Moosmuller, and D. Schabus, “Unsupervised interpolation of language varieties for speech synthesis,” Speech Communication, 2015.
[6] J. Dean et aI., "Large scale distributed deep networks," NIPS, 2012. [II] L. Deng and X. Li. "Machine learning paradigms for speech recognition: An overview," IEEE Trans. Audio, Speech & Lang. Proc., Vol. 21, No. 5, May 2013.
[7] S. Desai, E. V. Raghavendra, B. Yegnanarayana, A. W. Black, and K. Prahallad, “Voice conversion using artificial neural networks,” Proc. ICASSP, pp. 3893–3896, 2009.
[8] S. Motiian, K. Feng, H. Bharthavarapu, S. Sharlemin, and G. Doretto, “Pairwise kernels for human interaction recognition,” in Advances in Visual Computing, 2013, vol. 8034, pp. 210–221.
[9] C. W. Han, T. G. Kang, D. H. Hong, N. S. Kim, K. Eom, and J. Lee, “Switching linear dynamic transducer for stereo data based speech feature mapping,” in Proc. IEEE ICASSP, May 2011, pp. 4776–4779.
[10] N. S. Kim, T. G. Kang, S. J. Kang, C. W. Han, and D. H. Hong, “Speech feature mapping based on switching linear dynamic system,” IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 2, pp. 620–631, Feb. 2012.
[11] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, "Voice conversion through vector quantization," in Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on, 1988, pp. 655-658 vol.1.
[12] Y. Stylianou, O. Cappe, and E. Moulines, "Continuous probabilistic transform for voice conversion," Speech and Audio Processing, IEEE Transactions on, vol. 6, pp. 131-142, 1998
[13] A. Kain and M. W. Macon, "Spectral voice conversion for textto- speech synthesis," in Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, 1998, pp. 285-288 vol.1.
[14] Machado, A.F. and Queirozm M.: „Voice Conversion: A Critical Survey?, Proceedings of Sound and Music Computing (SMC) (2010).
[15] Y. Stylianou, “Voice transformation: a survey,” in ICASSP 2009.
[16] A. B. D. Sündermann, H. Ney, and H. Höge, "A first step towards text-independent voice conversion," in in Proc. Int. Conf. Spoken Lang. Process, 2004, pp. 1173–1176
[17] D. Erro, A. Moreno, and A. Bonafonte, "INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora," Audio, Speech, and Language Processing, IEEE Transactions on, vol. 18, pp. 944-953, 2010
[18] Z. Cao and N. A. Schmid, “Fusion of operators for heterogeneous periocular recognition at varying ranges,” IEEE International Conference on Image Processing, 2014, pp. 4967-4971.
[19] Fazlollah M. Reza, “An Introduction to Information Theory. Dover Publications,” Inc, New York. ISBN 0-486-68210-2.
[20] T. Toda, A. W. Black, and K. Tokuda, "Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory," Audio, Speech, and Language Processing, IEEE Transactions on, vol. 15, pp. 2222-2235, 2007.
[21] D. Chazan, R. Hoory, G. Cohen, and M. Zibulski, "Speech reconstruction from mel frequency cepstral coefficients and pitch frequency," in Acoustics, Speech, and Signal Processing, 2000
[22] J. Tao, M. Zhang, J. Nurminen, J. Tian, X. Wang, “Supervisory Data Alignment for Text-Independent Voice Conversion,” IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010

Text-Independent Voice conversion, Mutual Information, Frame alignment, Mel cepstral frequency warping.