Personalized Learning of Tibetan Academic Mandarin Pronunciation Integrating Mask-GCT Voice CloningTechnology

Zhenye Gan; Wenhao Wei

doi:10.54097/ew3r3e84

Authors

Zhenye Gan
Wenhao Wei

DOI:

https://doi.org/10.54097/ew3r3e84

Keywords:

Deep Neural Networks, Pronunciation Error Detection, Mandarin Pronunciation

Abstract

This study proposes a personalized Mandarin pronunciation learning framework for Tibetan learners that integrates Mask‑GCT voice‑cloning technology as a back‑end data‑augmentation module. By leveraging deep neural networks, the voice‑cloning component reconstructs key speaker characteristics—timbre, intonation, and prosody—from limited samples, generating high‑fidelity, individualized speech data. These synthetic samples not only alleviate labeled‑data scarcity but also introduce diverse pronunciation scenarios, particularly modeling the tonal, vowel, and consonant errors characteristic of Tibetan students’ Mandarin production. Adjustable cloning parameters enable simulation of multiple error patterns and subtle phonetic variations, thereby enriching training data and enhancing the model’s capacity to detect, adapt to, and correct a wide range of pronunciation deviations. Experimental results demonstrate that our approach significantly improves error recognition accuracy and model generalization compared to baseline systems lacking voice‑cloning augmentation. The flexible, controllable synthesis process provides empirical support for targeted pronunciation remediation, offering a scalable methodology for assisting Tibetan learners in mastering academic Mandarin pronunciation.

Downloads

Download data is not yet available.

References

[1] Hunt A J, Black A W. Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database [C]// Proceedings of the 3rd International Conference on Spoken Language Processing (ICSLP). Philadelphia, PA, 1996.

[2] Dutoit T. An Introduction to Text-to-Speech Synthesis [M]. Springer, 1997.

[3] Moulines E, Charpentier F. Pitch-Synchronous Waveform Processing Techniques for Text-to-Speech Synthesis Using Diphones [J]. Speech Communication, 1990, 9(5-6): 453-467.

[4] Kominek J, Black A W. The CMU Arctic Speech Databases [C]// Fifth ISCA Workshop on Speech Synthesis. 2004.

[5] CAI Lian-hong, CUI Dan-dan, CAI Rui. TH-CoSS,a Mandarin Speech Corpus for TTS [J], Journal of Chinese information processing, 2007: 96-101.

[6] JING Xiao-yang, LUO Fei, WANG Ya-qi. Overview of the Chinese Voice Synthesis Technique [J], Computer Science, 2012

[7] Tokuda K, Yoshimura T, Masuko T, et al. Speech Parameter Generation Algorithms for HMM-Based Speech Synthesis [C]// 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2000, 3: 1315-1318.

[8] Zen H, Tokuda K, Black A W. Statistical Parametric Speech Synthesis [J]. Speech Communication, 2009, 51(11): 1039-1064.

[9] Taylor P. Text-to-Speech Synthesis [M]. Cambridge University Press, 2009: 147-162.

[10] Black A W, Taylor P. Automatically Clustering Similar Units for Unit Selection in Speech Synthesis [C]// Eurospeech. 1997.

[11] Chen W, Zhang Y, Lei J, et al. DiffWave: A Versatile Diffusion Model for Audio Synthesis [C]// Proceedings of Interspeech 2020. Shanghai, China: ISCA, 2020: 3565-3569.

[12] Kim Y, Kong J, Son J, et al. VITS: Conditional Variational Inference with Adversarial Learning for End-to-End Text-to-Speech [EB/OL]. (2021-06-11) [2024-07-20]. https://arxiv.org/abs/2106.06103.

[13] Chen N, Zhang Y, Zheng H, et al. VALL-E: A Text-to-Speech System for Zero-Shot Voice Cloning [EB/OL]. (2022-12-15) [2024-07-20]. https://arxiv.org/abs/2212.08025.

[14] Yuancheng W, Haoyue Z, Liwei L, Ruihong Z, Haotian G, Jiachen Z, Qiang Z, Xueyao Z, Shunsi Z, Zhizheng W, et al. MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer [J]. CoRR, 2024, abs/2409.00750.

[15] MacIntyre P, Gregersen T. Emotions That Facilitate Language Learning: The Positive-Broadening Power of the Imagination [J]. Studies in Second Language Learning and Teaching, 2012, 2(2): 193-193.

[16] Montoya R M, Horton R S, Kirchner J. Is Actual Similarity Necessary for Attraction? A Meta-Analysis of Actual and Perceived Similarity [J]. Journal of Social and Personal Relationships, 2008, 25(6): 889-922.