Learning to Prompt for Vision-Language Models

Zhou, Kaiyang; Yang, Jingkang; Loy, Chen Change; Liu, Ziwei

doi:10.1007/s11263-022-01653-1

Computer Science > Computer Vision and Pattern Recognition

arXiv:2109.01134 (cs)

[Submitted on 2 Sep 2021 (v1), last revised 6 Oct 2022 (this version, v6)]

Title:Learning to Prompt for Vision-Language Models

Authors:Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu

View PDF

Abstract:Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to a downstream task via prompting, i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming -- one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose Context Optimization (CoOp), a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition. Concretely, CoOp models a prompt's context words with learnable vectors while the entire pre-trained parameters are kept fixed. To handle different image recognition tasks, we provide two implementations of CoOp: unified context and class-specific context. Through extensive experiments on 11 datasets, we demonstrate that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements over prompt engineering with more shots, e.g., with 16 shots the average gain is around 15% (with the highest reaching over 45%). Despite being a learning-based approach, CoOp achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.

Comments:	International Journal of Computer Vision (IJCV), 2022. Update: Adds results on the DOSCO (DOmain Shift in COntext) benchmark
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2109.01134 [cs.CV]
	(or arXiv:2109.01134v6 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2109.01134
Related DOI:	https://doi.org/10.1007/s11263-022-01653-1

Submission history

From: Kaiyang Zhou [view email]
[v1] Thu, 2 Sep 2021 17:57:31 UTC (1,891 KB)
[v2] Tue, 21 Sep 2021 10:18:43 UTC (1,890 KB)
[v3] Sun, 6 Feb 2022 12:10:40 UTC (2,040 KB)
[v4] Sat, 30 Jul 2022 14:07:52 UTC (1,888 KB)
[v5] Fri, 12 Aug 2022 08:12:06 UTC (1,888 KB)
[v6] Thu, 6 Oct 2022 11:36:09 UTC (1,890 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Learning to Prompt for Vision-Language Models

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Learning to Prompt for Vision-Language Models

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators