Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis

Zaïdi, Julian; Seuté, Hugo; van Niekerk, Benjamin; Carbonneau, Marc-André

doi:10.21437/Interspeech.2022-10761

Computer Science > Sound

arXiv:2108.02271 (cs)

[Submitted on 4 Aug 2021 (v1), last revised 5 Apr 2022 (this version, v2)]

Title:Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis

Authors:Julian Zaïdi, Hugo Seuté, Benjamin van Niekerk, Marc-André Carbonneau

View PDF

Abstract:This paper presents Daft-Exprt, a multi-speaker acoustic model advancing the state-of-the-art for cross-speaker prosody transfer on any text. This is one of the most challenging, and rarely directly addressed, task in speech synthesis, especially for highly expressive data. Daft-Exprt uses FiLM conditioning layers to strategically inject different prosodic information in all parts of the architecture. The model explicitly encodes traditional low-level prosody features such as pitch, loudness and duration, but also higher level prosodic information that helps generating convincing voices in highly expressive styles. Speaker identity and prosodic information are disentangled through an adversarial training strategy that enables accurate prosody transfer across speakers. Experimental results show that Daft-Exprt significantly outperforms strong baselines on inter-text cross-speaker prosody transfer tasks, while yielding naturalness comparable to state-of-the-art expressive models. Moreover, results indicate that the model discards speaker identity information from the prosody representation, and consistently generate speech with the desired voice. We publicly release our code and provide speech samples from our experiments.

Comments:	Submitted to Interspeech 2022, 5 pages, 5 figures, 2 tables
Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2108.02271 [cs.SD]
	(or arXiv:2108.02271v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2108.02271
Journal reference:	Proc. Interspeech (2022) 4591-4595
Related DOI:	https://doi.org/10.21437/Interspeech.2022-10761

Submission history

From: Julian Zaidi [view email]
[v1] Wed, 4 Aug 2021 20:13:00 UTC (299 KB)
[v2] Tue, 5 Apr 2022 18:06:21 UTC (316 KB)

Computer Science > Sound

Title:Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators