LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision

Liu, Zhijian; Stent, Simon; Li, Jie; Gideon, John; Han, Song

Computer Science > Computer Vision and Pattern Recognition

arXiv:2108.11950 (cs)

[Submitted on 26 Aug 2021]

Title:LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision

Authors:Zhijian Liu, Simon Stent, Jie Li, John Gideon, Song Han

View PDF

Abstract:Computer vision tasks such as object detection and semantic/instance segmentation rely on the painstaking annotation of large training datasets. In this paper, we propose LocTex that takes advantage of the low-cost localized textual annotations (i.e., captions and synchronized mouse-over gestures) to reduce the annotation effort. We introduce a contrastive pre-training framework between images and captions and propose to supervise the cross-modal attention map with rendered mouse traces to provide coarse localization signals. Our learned visual features capture rich semantics (from free-form captions) and accurate localization (from mouse traces), which are very effective when transferred to various downstream vision tasks. Compared with ImageNet supervised pre-training, LocTex can reduce the size of the pre-training dataset by 10x or the target dataset by 2x while achieving comparable or even improved performance on COCO instance segmentation. When provided with the same amount of annotations, LocTex achieves around 4% higher accuracy than the previous state-of-the-art "vision+language" pre-training approach on the task of PASCAL VOC image classification.

Comments:	ICCV 2021. Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2108.11950 [cs.CV]
	(or arXiv:2108.11950v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2108.11950

Submission history

From: Zhijian Liu [view email]
[v1] Thu, 26 Aug 2021 17:59:07 UTC (8,387 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2021-08

Change to browse by:

cs
cs.CL

References & Citations

DBLP - CS Bibliography

listing | bibtex

Zhijian Liu
Simon Stent
Jie Li
John Gideon
Song Han

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators