X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

Li, Yehao; Pan, Yingwei; Chen, Jingwen; Yao, Ting; Mei, Tao

Computer Science > Computer Vision and Pattern Recognition

arXiv:2108.08217 (cs)

[Submitted on 18 Aug 2021]

Title:X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

Authors:Yehao Li, Yingwei Pan, Jingwen Chen, Ting Yao, Tao Mei

View PDF

Abstract:With the rise and development of deep learning over the past decade, there has been a steady momentum of innovation and breakthroughs that convincingly push the state-of-the-art of cross-modal analytics between vision and language in multimedia field. Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion. In this work, we propose X-modaler -- a versatile and high-performance codebase that encapsulates the state-of-the-art cross-modal analytics into several general-purpose stages (e.g., pre-processing, encoder, cross-modal interaction, decoder, and decode strategy). Each stage is empowered with the functionality that covers a series of modules widely adopted in state-of-the-arts and allows seamless switching in between. This way naturally enables a flexible implementation of state-of-the-art algorithms for image captioning, video captioning, and vision-language pre-training, aiming to facilitate the rapid development of research community. Meanwhile, since the effective modular designs in several stages (e.g., cross-modal interaction) are shared across different vision-language tasks, X-modaler can be simply extended to power startup prototypes for other tasks in cross-modal analytics, including visual question answering, visual commonsense reasoning, and cross-modal retrieval. X-modaler is an Apache-licensed codebase, and its source codes, sample projects and pre-trained models are available on-line: this https URL.

Comments:	Accepted by 2021 ACMMM Open Source Software Competition. Source code: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
Cite as:	arXiv:2108.08217 [cs.CV]
	(or arXiv:2108.08217v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2108.08217

Submission history

From: Ting Yao [view email]
[v1] Wed, 18 Aug 2021 16:05:30 UTC (1,115 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators