Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation

Padi, Sarala; Sadjadi, Seyed Omid; Manocha, Dinesh; Sriram, Ram D.

Computer Science > Sound

arXiv:2108.02510 (cs)

[Submitted on 5 Aug 2021 (v1), last revised 16 Aug 2021 (this version, v4)]

Title:Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation

Authors:Sarala Padi, Seyed Omid Sadjadi, Dinesh Manocha, Ram D. Sriram

View PDF

Abstract:Automatic speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction. One of the main challenges in SER is data scarcity, i.e., insufficient amounts of carefully labeled data to build and fully explore complex deep learning models for emotion classification. This paper aims to address this challenge using a transfer learning strategy combined with spectrogram augmentation. Specifically, we propose a transfer learning approach that leverages a pre-trained residual network (ResNet) model including a statistics pooling layer from speaker recognition trained using large amounts of speaker-labeled data. The statistics pooling layer enables the model to efficiently process variable-length input, thereby eliminating the need for sequence truncation which is commonly used in SER systems. In addition, we adopt a spectrogram augmentation technique to generate additional training data samples by applying random time-frequency masks to log-mel spectrograms to mitigate overfitting and improve the generalization of emotion recognition models. We evaluate the effectiveness of our proposed approach on the interactive emotional dyadic motion capture (IEMOCAP) dataset. Experimental results indicate that the transfer learning and spectrogram augmentation approaches improve the SER performance, and when combined achieve state-of-the-art results.

Comments:	Accepted at ACM/SIGCHI ICMI'21
Subjects:	Sound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2108.02510 [cs.SD]
	(or arXiv:2108.02510v4 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2108.02510

Submission history

From: Omid Sadjadi [view email]
[v1] Thu, 5 Aug 2021 10:39:39 UTC (1,189 KB)
[v2] Sun, 8 Aug 2021 19:53:52 UTC (1,148 KB)
[v3] Wed, 11 Aug 2021 14:12:36 UTC (725 KB)
[v4] Mon, 16 Aug 2021 14:47:00 UTC (725 KB)

Computer Science > Sound

Title:Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators