PortaSpeech: Portable and High-Quality Generative Text-to-Speech

Ren, Yi; Liu, Jinglin; Zhao, Zhou

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2109.15166 (eess)

[Submitted on 30 Sep 2021 (v1), last revised 13 Feb 2022 (this version, v5)]

Title:PortaSpeech: Portable and High-Quality Generative Text-to-Speech

Authors:Yi Ren, Jinglin Liu, Zhou Zhao

View PDF

Abstract:Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 and Glow-TTS can synthesize high-quality speech from the given text in parallel. After analyzing two kinds of generative NAR-TTS models (VAE and normalizing flow), we find that: VAE is good at capturing the long-range semantics features (e.g., prosody) even with small model size but suffers from blurry and unnatural results; and normalizing flow is good at reconstructing the frequency bin-wise details but performs poorly when the number of model parameters is limited. Inspired by these observations, to generate diverse speech with natural details and rich prosody using a lightweight architecture, we propose PortaSpeech, a portable and high-quality generative text-to-speech model. Specifically, 1) to model both the prosody and mel-spectrogram details accurately, we adopt a lightweight VAE with an enhanced prior followed by a flow-based post-net with strong conditional inputs as the main architecture. 2) To further compress the model size and memory footprint, we introduce the grouped parameter sharing mechanism to the affine coupling layers in the post-net. 3) To improve the expressiveness of synthesized speech and reduce the dependency on accurate fine-grained alignment between text and speech, we propose a linguistic encoder with mixture alignment combining hard inter-word alignment and soft intra-word alignment, which explicitly extracts word-level semantic information. Experimental results show that PortaSpeech outperforms other TTS models in both voice quality and prosody modeling in terms of subjective and objective evaluation metrics, and shows only a slight performance degradation when reducing the model parameters to 6.7M (about 4x model size and 3x runtime memory compression ratio compared with FastSpeech 2). Our extensive ablation studies demonstrate that each design in PortaSpeech is effective.

Comments:	Accepted by NeurIPS 2021. Source code: this https URL
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2109.15166 [eess.AS]
	(or arXiv:2109.15166v5 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2109.15166

Submission history

From: Yi Ren [view email]
[v1] Thu, 30 Sep 2021 14:35:47 UTC (1,487 KB)
[v2] Mon, 8 Nov 2021 07:02:00 UTC (1,492 KB)
[v3] Fri, 21 Jan 2022 02:20:22 UTC (1,489 KB)
[v4] Sun, 30 Jan 2022 02:38:59 UTC (1,489 KB)
[v5] Sun, 13 Feb 2022 09:00:40 UTC (1,489 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:PortaSpeech: Portable and High-Quality Generative Text-to-Speech

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:PortaSpeech: Portable and High-Quality Generative Text-to-Speech

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators