WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition

Li, Feng; Luo, Jiusong; Xia, Wanjun

Computer Science > Sound

arXiv:2412.05558 (cs)

[Submitted on 7 Dec 2024]

Title:WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition

Authors:Feng Li, Jiusong Luo, Wanjun Xia

View PDF HTML (experimental)

Abstract:Speech emotion recognition (SER) remains a challenging yet crucial task due to the inherent complexity and diversity of human emotions. To address this problem, researchers attempt to fuse information from other modalities via multimodal learning. However, existing multimodal fusion techniques often overlook the intricacies of cross-modal interactions, resulting in suboptimal feature representations. In this paper, we propose WavFusion, a multimodal speech emotion recognition framework that addresses critical research problems in effective multimodal fusion, heterogeneity among modalities, and discriminative representation learning. By leveraging a gated cross-modal attention mechanism and multimodal homogeneous feature discrepancy learning, WavFusion demonstrates improved performance over existing state-of-the-art methods on benchmark datasets. Our work highlights the importance of capturing nuanced cross-modal interactions and learning discriminative representations for accurate multimodal SER. Experimental results on two benchmark datasets (IEMOCAP and MELD) demonstrate that WavFusion succeeds over the state-of-the-art strategies on emotion recognition.

Comments:	Accepted by 31st International Conference on MultiMedia Modeling (MMM2025)
Subjects:	Sound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2412.05558 [cs.SD]
	(or arXiv:2412.05558v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2412.05558

Submission history

From: Feng Li [view email]
[v1] Sat, 7 Dec 2024 06:43:39 UTC (132 KB)

Computer Science > Sound

Title:WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators