[HTML][HTML] Cpt: Colorful prompt tuning for pre-trained vision-language models
Abstract Vision-Language Pre-training (VLP) models have shown promising capabilities in
grounding natural language in image data, facilitating a broad range of cross-modal tasks …
grounding natural language in image data, facilitating a broad range of cross-modal tasks …
Open-vocabulary object detection using captions
Despite the remarkable accuracy of deep neural networks in object detection, they are costly
to train and scale due to supervision requirements. Particularly, learning more object …
to train and scale due to supervision requirements. Particularly, learning more object …
Multi-modal knowledge graph construction and application: A survey
Recent years have witnessed the resurgence of knowledge engineering which is featured
by the fast growth of knowledge graphs. However, most of existing knowledge graphs are …
by the fast growth of knowledge graphs. However, most of existing knowledge graphs are …
Consensus-aware visual-semantic embedding for image-text matching
Image-text matching plays a central role in bridging vision and language. Most existing
approaches only rely on the image-text instance pair to learn their representations, thereby …
approaches only rely on the image-text instance pair to learn their representations, thereby …
Robust referring video object segmentation with cyclic structural consensus
Abstract Referring Video Object Segmentation (R-VOS) is a challenging task that aims to
segment an object in a video based on a linguistic expression. Most existing R-VOS …
segment an object in a video based on a linguistic expression. Most existing R-VOS …
Motion-appearance co-memory networks for video question answering
Abstract Video Question Answering (QA) is an important task in understanding video
temporal structure. We observe that there are three unique attributes of video QA compared …
temporal structure. We observe that there are three unique attributes of video QA compared …
Counterfactual contrastive learning for weakly-supervised vision-language grounding
Weakly-supervised vision-language grounding aims to localize a target moment in a video
or a specific region in an image according to the given sentence query, where only video …
or a specific region in an image according to the given sentence query, where only video …
More grounded image captioning by distilling image-text matching model
Visual attention not only improves the performance of image captioners, but also serves as a
visual interpretation to qualitatively measure the caption rationality and model transparency …
visual interpretation to qualitatively measure the caption rationality and model transparency …