USEV: Universal Speaker Extraction with Visual Cue

Pan, Zexu; Ge, Meng; Li, Haizhou

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2109.14831 (eess)

[Submitted on 30 Sep 2021 (v1), last revised 30 Aug 2022 (this version, v2)]

Title:USEV: Universal Speaker Extraction with Visual Cue

Authors:Zexu Pan, Meng Ge, Haizhou Li

View PDF

Abstract:A speaker extraction algorithm seeks to extract the target speaker's speech from a multi-talker speech mixture. The prior studies focus mostly on speaker extraction from a highly overlapped multi-talker speech mixture. However, the target-interference speaker overlapping ratios could vary over a wide range from 0% to 100% in natural speech communication, furthermore, the target speaker could be absent in the speech mixture, the speech mixtures in such universal multi-talker scenarios are described as general speech mixtures. The speaker extraction algorithm requires an auxiliary reference, such as a video recording or a pre-recorded speech, to form top-down auditory attention on the target speaker. We advocate that a visual cue, i.e., lip movement, is more informative than an audio cue, i.e., pre-recorded speech, to serve as the auxiliary reference for speaker extraction in disentangling the target speaker from a general speech mixture. In this paper, we propose a universal speaker extraction network with a visual cue, that works for all multi-talker scenarios. In addition, we propose a scenario-aware differentiated loss function for network training, to balance the network performance over different target-interference speaker pairing scenarios. The experimental results show that our proposed method outperforms various competitive baselines for general speech mixtures in terms of signal fidelity.

Comments:	Accepted by TASLP
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2109.14831 [eess.AS]
	(or arXiv:2109.14831v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2109.14831

Submission history

From: Zexu Pan [view email]
[v1] Thu, 30 Sep 2021 03:37:10 UTC (606 KB)
[v2] Tue, 30 Aug 2022 18:41:57 UTC (812 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:USEV: Universal Speaker Extraction with Visual Cue

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:USEV: Universal Speaker Extraction with Visual Cue

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators