Personalized Learning of Tibetan Academic Mandarin Pronunciation Integrating Mask-GCT Voice CloningTechnology
DOI:
https://doi.org/10.54097/ew3r3e84Keywords:
Deep Neural Networks, Pronunciation Error Detection, Mandarin PronunciationAbstract
This study proposes a personalized Mandarin pronunciation learning framework for Tibetan learners that integrates Mask‑GCT voice‑cloning technology as a back‑end data‑augmentation module. By leveraging deep neural networks, the voice‑cloning component reconstructs key speaker characteristics—timbre, intonation, and prosody—from limited samples, generating high‑fidelity, individualized speech data. These synthetic samples not only alleviate labeled‑data scarcity but also introduce diverse pronunciation scenarios, particularly modeling the tonal, vowel, and consonant errors characteristic of Tibetan students’ Mandarin production. Adjustable cloning parameters enable simulation of multiple error patterns and subtle phonetic variations, thereby enriching training data and enhancing the model’s capacity to detect, adapt to, and correct a wide range of pronunciation deviations. Experimental results demonstrate that our approach significantly improves error recognition accuracy and model generalization compared to baseline systems lacking voice‑cloning augmentation. The flexible, controllable synthesis process provides empirical support for targeted pronunciation remediation, offering a scalable methodology for assisting Tibetan learners in mastering academic Mandarin pronunciation.
Downloads
References
[1] Hunt A J, Black A W. Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database [C]// Proceedings of the 3rd International Conference on Spoken Language Processing (ICSLP). Philadelphia, PA, 1996.
[2] Dutoit T. An Introduction to Text-to-Speech Synthesis [M]. Springer, 1997.
[3] Moulines E, Charpentier F. Pitch-Synchronous Waveform Processing Techniques for Text-to-Speech Synthesis Using Diphones [J]. Speech Communication, 1990, 9(5-6): 453-467.
[4] Kominek J, Black A W. The CMU Arctic Speech Databases [C]// Fifth ISCA Workshop on Speech Synthesis. 2004.
[5] CAI Lian-hong, CUI Dan-dan, CAI Rui. TH-CoSS,a Mandarin Speech Corpus for TTS [J], Journal of Chinese information processing, 2007: 96-101.
[6] JING Xiao-yang, LUO Fei, WANG Ya-qi. Overview of the Chinese Voice Synthesis Technique [J], Computer Science, 2012
[7] Tokuda K, Yoshimura T, Masuko T, et al. Speech Parameter Generation Algorithms for HMM-Based Speech Synthesis [C]// 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2000, 3: 1315-1318.
[8] Zen H, Tokuda K, Black A W. Statistical Parametric Speech Synthesis [J]. Speech Communication, 2009, 51(11): 1039-1064.
[9] Taylor P. Text-to-Speech Synthesis [M]. Cambridge University Press, 2009: 147-162.
[10] Black A W, Taylor P. Automatically Clustering Similar Units for Unit Selection in Speech Synthesis [C]// Eurospeech. 1997.
[11] Chen W, Zhang Y, Lei J, et al. DiffWave: A Versatile Diffusion Model for Audio Synthesis [C]// Proceedings of Interspeech 2020. Shanghai, China: ISCA, 2020: 3565-3569.
[12] Kim Y, Kong J, Son J, et al. VITS: Conditional Variational Inference with Adversarial Learning for End-to-End Text-to-Speech [EB/OL]. (2021-06-11) [2024-07-20]. https://arxiv.org/abs/2106.06103.
[13] Chen N, Zhang Y, Zheng H, et al. VALL-E: A Text-to-Speech System for Zero-Shot Voice Cloning [EB/OL]. (2022-12-15) [2024-07-20]. https://arxiv.org/abs/2212.08025.
[14] Yuancheng W, Haoyue Z, Liwei L, Ruihong Z, Haotian G, Jiachen Z, Qiang Z, Xueyao Z, Shunsi Z, Zhizheng W, et al. MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer [J]. CoRR, 2024, abs/2409.00750.
[15] MacIntyre P, Gregersen T. Emotions That Facilitate Language Learning: The Positive-Broadening Power of the Imagination [J]. Studies in Second Language Learning and Teaching, 2012, 2(2): 193-193.
[16] Montoya R M, Horton R S, Kirchner J. Is Actual Similarity Necessary for Attraction? A Meta-Analysis of Actual and Perceived Similarity [J]. Journal of Social and Personal Relationships, 2008, 25(6): 889-922.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Journal of Computer Science and Artificial Intelligence

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.