GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech

Abstract

Cross-lingual timbre and style generalizable text-to-speech (TTS) aims to synthesize speech with a specific reference timbre or style that is never trained in the target language. It encounters the following challenges: 1) timbre and pronunciation are correlated since multilingual speech of a specific speaker is usually hard to obtain; 2) style and pronunciation are mixed because the speech style contains language-agnostic and language-specific parts. To address these challenges, we propose GenerTTS, which mainly includes the following works: 1) we elaborately design a HuBERT-based information bottleneck to disentangle timbre and pronunciation/style; 2) we minimize the MI between style and language to discard the language-specific information in the style embedding modeled in the text encoder. The experiments indicate that GenerTTS outperforms baseline systems in terms of style similarity and pronunciation accuracy, and enables cross-lingual timbre and style generalization.



Figure 1: The overall pipeline of our system.


Figure 2: The overall architecture for proposed system.


System Evaluation

Cross-lingual timbre and style text-to-speech(TTS) to synthesize speech with a specific reference timbre and style that are never trained in the target language.

Target Timbre
Text zh-CN: 社会你金姨罕见上课发火。 zh-CN: 我有一个小癖好,看完一部电影后,很喜欢到处找影评,看看人家是什么样的观点,是不是有什么全新的理解,有没有哪些我忽视的小细节。 en-US: The drinking started my freshman year of college. en-US: Patrick , you're in charge of Ireland.
Target Style1 Novel narrator style.
Para
M3
Ours
Target Style2 Customer service style.
Para
M3
Ours
Target Style3. Children's style.
Para
M3
Ours
Target Style4. Gentle style.
Para
M3
Ours
Target Style5. Dynamic style.
Para
M3
Ours

Huber Analysis

Voice conversion on HuBERT for unseen speakers to verify that the HuBERT representation is able to remove timbre and preserve style information.

Text 嗯,一般的话呢,就是因为像他们这个做的比较多,所以库存量呢就比较大。 只有你肯讲狠话敲打他。
Source Female/Male
Target Female/Male
F2F/M2F
F2M/M2M