Google Académico

Z Chen, Y Zhang, Y Fang, Y Geng, L Guo… - ar** and recognition are important components of visual scene understanding, eg, for
object detection and semantic segmentation. With end-to-end deep learning systems …

Guardar Citar Citado por 557 Artículos relacionados Las 6 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[HTML] sciencedirect.com

[HTML][HTML] Cpt: Colorful prompt tuning for pre-trained vision-language models

Y Yao, A Zhang, Z Zhang, Z Liu, TS Chua, M Sun - AI Open, 2024 - Elsevier

Abstract Vision-Language Pre-training (VLP) models have shown promising capabilities in
grounding natural language in image data, facilitating a broad range of cross-modal tasks …

Guardar Citar Citado por 276 Artículos relacionados Las 4 versiones

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Open-vocabulary object detection using captions

A Zareian, KD Rosa, DH Hu… - Proceedings of the …, 2021 - openaccess.thecvf.com

Despite the remarkable accuracy of deep neural networks in object detection, they are costly
to train and scale due to supervision requirements. Particularly, learning more object …

Guardar Citar Citado por 447 Artículos relacionados Las 6 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Multi-modal knowledge graph construction and application: A survey

X Zhu, Z Li, X Wang, X Jiang, P Sun… - … on Knowledge and …, 2022 - ieeexplore.ieee.org

Recent years have witnessed the resurgence of knowledge engineering which is featured
by the fast growth of knowledge graphs. However, most of existing knowledge graphs are …

Guardar Citar Citado por 197 Artículos relacionados Las 7 versiones

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Consensus-aware visual-semantic embedding for image-text matching

H Wang, Y Zhang, Z Ji, Y Pang, L Ma - … , Glasgow, UK, August 23–28, 2020 …, 2020 - Springer

Image-text matching plays a central role in bridging vision and language. Most existing
approaches only rely on the image-text instance pair to learn their representations, thereby …

Guardar Citar Citado por 202 Artículos relacionados Las 5 versiones

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Robust referring video object segmentation with cyclic structural consensus

X Li, J Wang, X Xu, X Li, B Raj… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Abstract Referring Video Object Segmentation (R-VOS) is a challenging task that aims to
segment an object in a video based on a linguistic expression. Most existing R-VOS …

Guardar Citar Citado por 31 Artículos relacionados Las 3 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Motion-appearance co-memory networks for video question answering

J Gao, R Ge, K Chen, R Nevatia - Proceedings of the IEEE …, 2018 - openaccess.thecvf.com

Abstract Video Question Answering (QA) is an important task in understanding video
temporal structure. We observe that there are three unique attributes of video QA compared …

Guardar Citar Citado por 301 Artículos relacionados Las 9 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Counterfactual contrastive learning for weakly-supervised vision-language grounding

Z Zhang, Z Zhao, Z Lin, X He - Advances in Neural …, 2020 - proceedings.neurips.cc

Weakly-supervised vision-language grounding aims to localize a target moment in a video
or a specific region in an image according to the given sentence query, where only video …

Guardar Citar Citado por 137 Artículos relacionados Las 5 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

More grounded image captioning by distilling image-text matching model

Y Zhou, M Wang, D Liu, Z Hu… - Proceedings of the …, 2020 - openaccess.thecvf.com

Visual attention not only improves the performance of image captioners, but also serves as a
visual interpretation to qualitatively measure the caption rationality and model transparency …

Guardar Citar Citado por 173 Artículos relacionados Las 9 versiones Versión en HTML

Crear alerta

Citar

Búsqueda avanzada

Guardado en Mi biblioteca

Knowledge aided consistency for weakly supervised phrase grounding

Knowledge graphs meet multi-modal learning: A comprehensive survey

[HTML][HTML] Cpt: Colorful prompt tuning for pre-trained vision-language models

Open-vocabulary object detection using captions

Multi-modal knowledge graph construction and application: A survey

Consensus-aware visual-semantic embedding for image-text matching

Robust referring video object segmentation with cyclic structural consensus

Motion-appearance co-memory networks for video question answering

Counterfactual contrastive learning for weakly-supervised vision-language grounding

More grounded image captioning by distilling image-text matching model