Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding

Zhao, Heng; Zhou, Joey Tianyi; Ong, Yew-Soon

Computer Science > Computer Vision and Pattern Recognition

arXiv:2108.00205 (cs)

[Submitted on 31 Jul 2021]

Title:Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding

Authors:Heng Zhao, Joey Tianyi Zhou, Yew-Soon Ong

View PDF

Abstract:Current one-stage methods for visual grounding encode the language query as one holistic sentence embedding before fusion with visual feature. Such a formulation does not treat each word of a query sentence on par when modeling language to visual attention, therefore prone to neglect words which are less important for sentence embedding but critical for visual grounding. In this paper we propose Word2Pix: a one-stage visual grounding network based on encoder-decoder transformer architecture that enables learning for textual to visual feature correspondence via word to pixel attention. The embedding of each word from the query sentence is treated alike by attending to visual pixels individually instead of single holistic sentence embedding. In this way, each word is given equivalent opportunity to adjust the language to vision attention towards the referent target through multiple stacks of transformer decoder layers. We conduct the experiments on RefCOCO, RefCOCO+ and RefCOCOg datasets and the proposed Word2Pix outperforms existing one-stage methods by a notable margin. The results obtained also show that Word2Pix surpasses two-stage visual grounding models, while at the same time keeping the merits of one-stage paradigm namely end-to-end training and real-time inference speed intact.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2108.00205 [cs.CV]
	(or arXiv:2108.00205v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2108.00205

Submission history

From: Heng Zhao [view email]
[v1] Sat, 31 Jul 2021 10:20:15 UTC (12,089 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators