GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech
Abstract
Cross-lingual timbre and style generalizable text-to-speech (TTS) aims to synthesize speech with a specific reference timbre or style that is never trained in the target language.
It encounters the following challenges:
1) timbre and pronunciation are correlated since multilingual speech of a specific speaker is usually hard to obtain;
2) style and pronunciation are mixed because the speech style contains language-agnostic and language-specific parts.
To address these challenges, we propose GenerTTS, which mainly includes the following works:
1) we elaborately design a HuBERT-based information bottleneck to disentangle timbre and pronunciation/style;
2) we minimize the MI between style and language to discard the language-specific information in the style embedding modeled in the text encoder. The experiments indicate that GenerTTS outperforms
baseline systems in terms of style similarity and pronunciation accuracy, and enables cross-lingual timbre and style generalization.
Figure 1: The overall pipeline of our system.
Figure 2: The overall architecture for proposed system.
System Evaluation
Cross-lingual timbre and style text-to-speech(TTS) to synthesize speech with a specific reference timbre and style that are never trained in the target language.
Para
: parallel tacotron.
M3
: A multi-speaker multi-style multi-language speech synthesis baseline system.
Ours
: Proposed system, which based on Hubert and style adaptor.