- Academic Search

Speichern Zitieren Zitiert von: 396 Ähnliche Artikel Alle 11 Versionen

From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

Speichern Zitieren Zitiert von: 732 Ähnliche Artikel Alle 3 Versionen HTML-Version

Llama-adapter: Efficient fine-tuning of language models with zero-init attention

R Zhang, J Han, C Liu, P Gao, A Zhou, X Hu… - arxiv preprint arxiv …, 2023 - arxiv.org

We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA
into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter …

Speichern Zitieren Zitiert von: 212 Ähnliche Artikel Alle 5 Versionen HTML-Version

Prompting large language models with answer heuristics for knowledge-based visual question answering

Z Shao, Z Yu, M Wang, J Yu - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Abstract Knowledge-based visual question answering (VQA) requires external knowledge
beyond the image to answer the question. Early studies retrieve required knowledge from …

Speichern Zitieren Zitiert von: 732 Ähnliche Artikel Alle 6 Versionen HTML-Version

Flava: A foundational language and vision alignment model

A Singh, R Hu, V Goswami… - Proceedings of the …, 2022 - openaccess.thecvf.com

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic
pretraining for obtaining good performance on a variety of downstream tasks. Generally …

Speichern Zitieren Zitiert von: 411 Ähnliche Artikel Alle 7 Versionen HTML-Version

[PDF] neurips.cc

Merlot: Multimodal neural script knowledge models

R Zellers, X Lu, J Hessel, Y Yu… - Advances in neural …, 2021 - proceedings.neurips.cc

As humans, we understand events in the visual world contextually, performing multimodal
reasoning across time to make inferences about the past, present, and future. We introduce …

Speichern Zitieren Zitiert von: 460 Ähnliche Artikel Alle 3 Versionen HTML-Version

How much can clip benefit vision-and-language tasks?

S Shen, LH Li, H Tan, M Bansal, A Rohrbach… - arxiv preprint arxiv …, 2021 - arxiv.org

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using
a relatively small set of manually-annotated data (as compared to web-crawled data), to …

Speichern Zitieren Zitiert von: 747 Ähnliche Artikel Alle 8 Versionen HTML-Version

Less is more: Clipbert for video-and-language learning via sparse sampling

J Lei, L Li, L Zhou, Z Gan, TL Berg… - Proceedings of the …, 2021 - openaccess.thecvf.com

The canonical approach to video-and-language learning (eg, video question answering)
dictates a neural model to learn from offline-extracted dense video features from vision …

Speichern Zitieren Zitiert von: 1125 Ähnliche Artikel Alle 8 Versionen HTML-Version

Vinvl: Revisiting visual representations in vision-language models

P Zhang, X Li, X Hu, J Yang, L Zhang… - Proceedings of the …, 2021 - openaccess.thecvf.com

This paper presents a detailed study of improving vision features and develops an improved
object detection model for vision language (VL) tasks. Compared to the most widely used …