Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation

Varadhan, Praveen Srinivasa; Gulati, Amogh; Sankar, Ashwin; Anand, Srija; Gupta, Anirudh; Mukherjee, Anirudh; Marepally, Shiva Kumar; Bhatia, Ankur; Jaju, Saloni; Bhooshan, Suvrat; Khapra, Mitesh M.

Computer Science > Computation and Language

arXiv:2411.12719 (cs)

[Submitted on 19 Nov 2024]

Title:Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation

Authors:Praveen Srinivasa Varadhan, Amogh Gulati, Ashwin Sankar, Srija Anand, Anirudh Gupta, Anirudh Mukherjee, Shiva Kumar Marepally, Ankur Bhatia, Saloni Jaju, Suvrat Bhooshan, Mitesh M. Khapra

View PDF HTML (experimental)

Abstract:Despite rapid advancements in TTS models, a consistent and robust human evaluation framework is still lacking. For example, MOS tests fail to differentiate between similar models, and CMOS's pairwise comparisons are time-intensive. The MUSHRA test is a promising alternative for evaluating multiple TTS systems simultaneously, but in this work we show that its reliance on matching human reference speech unduly penalises the scores of modern TTS systems that can exceed human speech quality. More specifically, we conduct a comprehensive assessment of the MUSHRA test, focusing on its sensitivity to factors such as rater variability, listener fatigue, and reference bias. Based on our extensive evaluation involving 471 human listeners across Hindi and Tamil we identify two primary shortcomings: (i) reference-matching bias, where raters are unduly influenced by the human reference, and (ii) judgement ambiguity, arising from a lack of clear fine-grained guidelines. To address these issues, we propose two refined variants of the MUSHRA test. The first variant enables fairer ratings for synthesized samples that surpass human reference quality. The second variant reduces ambiguity, as indicated by the relatively lower variance across raters. By combining these approaches, we achieve both more reliable and more fine-grained assessments. We also release MANGO, a massive dataset of 47,100 human ratings, the first-of-its-kind collection for Indian languages, aiding in analyzing human preferences and developing automatic metrics for evaluating TTS systems.

Comments:	19 pages, 12 Figures
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2411.12719 [cs.CL]
	(or arXiv:2411.12719v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2411.12719

Submission history

From: Praveen S V [view email]
[v1] Tue, 19 Nov 2024 18:37:45 UTC (5,032 KB)

Computer Science > Computation and Language

Title:Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators