Knowledge graphs meet multi-modal learning: A comprehensive survey

Z Chen, Y Zhang, Y Fang, Y Geng, L Guo… - ar** and recognition are important components of visual scene understanding, eg, for
object detection and semantic segmentation. With end-to-end deep learning systems …

[HTML][HTML] Cpt: Colorful prompt tuning for pre-trained vision-language models

Y Yao, A Zhang, Z Zhang, Z Liu, TS Chua, M Sun - AI Open, 2024 - Elsevier
Abstract Vision-Language Pre-training (VLP) models have shown promising capabilities in
grounding natural language in image data, facilitating a broad range of cross-modal tasks …

Open-vocabulary object detection using captions

A Zareian, KD Rosa, DH Hu… - Proceedings of the …, 2021 - openaccess.thecvf.com
Despite the remarkable accuracy of deep neural networks in object detection, they are costly
to train and scale due to supervision requirements. Particularly, learning more object …

Multi-modal knowledge graph construction and application: A survey

X Zhu, Z Li, X Wang, X Jiang, P Sun… - … on Knowledge and …, 2022 - ieeexplore.ieee.org
Recent years have witnessed the resurgence of knowledge engineering which is featured
by the fast growth of knowledge graphs. However, most of existing knowledge graphs are …

Consensus-aware visual-semantic embedding for image-text matching

H Wang, Y Zhang, Z Ji, Y Pang, L Ma - … , Glasgow, UK, August 23–28, 2020 …, 2020 - Springer
Image-text matching plays a central role in bridging vision and language. Most existing
approaches only rely on the image-text instance pair to learn their representations, thereby …

Robust referring video object segmentation with cyclic structural consensus

X Li, J Wang, X Xu, X Li, B Raj… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract Referring Video Object Segmentation (R-VOS) is a challenging task that aims to
segment an object in a video based on a linguistic expression. Most existing R-VOS …

Motion-appearance co-memory networks for video question answering

J Gao, R Ge, K Chen, R Nevatia - Proceedings of the IEEE …, 2018 - openaccess.thecvf.com
Abstract Video Question Answering (QA) is an important task in understanding video
temporal structure. We observe that there are three unique attributes of video QA compared …

Counterfactual contrastive learning for weakly-supervised vision-language grounding

Z Zhang, Z Zhao, Z Lin, X He - Advances in Neural …, 2020 - proceedings.neurips.cc
Weakly-supervised vision-language grounding aims to localize a target moment in a video
or a specific region in an image according to the given sentence query, where only video …

More grounded image captioning by distilling image-text matching model

Y Zhou, M Wang, D Liu, Z Hu… - Proceedings of the …, 2020 - openaccess.thecvf.com
Visual attention not only improves the performance of image captioners, but also serves as a
visual interpretation to qualitatively measure the caption rationality and model transparency …