MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models

Liu, Shansong; Hussain, Atin Sakkeer; Wu, Qilong; Sun, Chenshuo; Shan, Ying

Computer Science > Sound

arXiv:2412.06660 (cs)

[Submitted on 9 Dec 2024]

Title:MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models

Authors:Shansong Liu, Atin Sakkeer Hussain, Qilong Wu, Chenshuo Sun, Ying Shan

View PDF HTML (experimental)

Abstract:Research on large language models has advanced significantly across text, speech, images, and videos. However, multi-modal music understanding and generation remain underexplored due to the lack of well-annotated datasets. To address this, we introduce a dataset with 167.69 hours of multi-modal data, including text, images, videos, and music annotations. Based on this dataset, we propose MuMu-LLaMA, a model that leverages pre-trained encoders for music, images, and videos. For music generation, we integrate AudioLDM 2 and MusicGen. Our evaluation across four tasks--music understanding, text-to-music generation, prompt-based music editing, and multi-modal music generation--demonstrates that MuMu-LLaMA outperforms state-of-the-art models, showing its potential for multi-modal music applications.

Subjects:	Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2412.06660 [cs.SD]
	(or arXiv:2412.06660v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2412.06660

Submission history

From: Qilong Wu [view email]
[v1] Mon, 9 Dec 2024 16:59:35 UTC (36,195 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.SD

< prev | next >

new | recent | 2024-12

Change to browse by:

cs
cs.MM
eess
eess.AS

References & Citations

export BibTeX citation

Computer Science > Sound

Title:MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators