Caption Generation on Scenes with Seen and Unseen Object Categories

Demirel, Berkan; Cinbis, Ramazan Gokberk

doi:10.1016/j.imavis.2022.104515

Computer Science > Computer Vision and Pattern Recognition

arXiv:2108.06165 (cs)

[Submitted on 13 Aug 2021 (v1), last revised 1 Jul 2022 (this version, v2)]

Title:Caption Generation on Scenes with Seen and Unseen Object Categories

Authors:Berkan Demirel, Ramazan Gokberk Cinbis

View PDF

Abstract:Image caption generation is one of the most challenging problems at the intersection of vision and language domains. In this work, we propose a realistic captioning task where the input scenes may incorporate visual objects with no corresponding visual or textual training examples. For this problem, we propose a detection-driven approach that consists of a single-stage generalized zero-shot detection model to recognize and localize instances of both seen and unseen classes, and a template-based captioning model that transforms detections into sentences. To improve the generalized zero-shot detection model, which provides essential information for captioning, we define effective class representations in terms of class-to-class semantic similarities, and leverage their special structure to construct an effective unseen/seen class confidence score calibration mechanism. We also propose a novel evaluation metric that provides additional insights for the captioning outputs by separately measuring the visual and non-visual contents of generated sentences. Our experiments highlight the importance of studying captioning in the proposed zero-shot setting, and verify the effectiveness of the proposed detection-driven zero-shot captioning approach.

Comments:	Accepted for Publication at Image and Vision Computing (IMAVIS)
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2108.06165 [cs.CV]
	(or arXiv:2108.06165v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2108.06165
Related DOI:	https://doi.org/10.1016/j.imavis.2022.104515

Submission history

From: Berkan Demirel [view email]
[v1] Fri, 13 Aug 2021 10:43:20 UTC (12,094 KB)
[v2] Fri, 1 Jul 2022 11:47:46 UTC (14,299 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Caption Generation on Scenes with Seen and Unseen Object Categories

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Caption Generation on Scenes with Seen and Unseen Object Categories

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators