MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition

Chen, Jiawei; Ho, Chiu Man

Computer Science > Computer Vision and Pattern Recognition

arXiv:2108.09322 (cs)

[Submitted on 20 Aug 2021 (v1), last revised 12 Nov 2021 (this version, v2)]

Title:MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition

Authors:Jiawei Chen, Chiu Man Ho

View PDF

Abstract:This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-ViT), for video action recognition. Different from other schemes which solely utilize the decoded RGB frames, MM-ViT operates exclusively in the compressed video domain and exploits all readily available modalities, i.e., I-frames, motion vectors, residuals and audio waveform. In order to handle the large number of spatiotemporal tokens extracted from multiple modalities, we develop several scalable model variants which factorize self-attention across the space, time and modality dimensions. In addition, to further explore the rich inter-modal interactions and their effects, we develop and compare three distinct cross-modal attention mechanisms that can be seamlessly integrated into the transformer building block. Extensive experiments on three public action recognition benchmarks (UCF-101, Something-Something-v2, Kinetics-600) demonstrate that MM-ViT outperforms the state-of-the-art video transformers in both efficiency and accuracy, and performs better or equally well to the state-of-the-art CNN counterparts with computationally-heavy optical flow.

Comments:	Winter Conference on Applications of Computer Vision (WACV) 2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2108.09322 [cs.CV]
	(or arXiv:2108.09322v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2108.09322

Submission history

From: Jiawei Chen [view email]
[v1] Fri, 20 Aug 2021 18:05:39 UTC (1,638 KB)
[v2] Fri, 12 Nov 2021 23:40:37 UTC (1,638 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators