A comprehensive overview of large language models

H Naveed, AU Khan, S Qiu, M Saqib, S Anwar… - arxiv preprint arxiv …, 2023 - arxiv.org
Large Language Models (LLMs) have recently demonstrated remarkable capabilities in
natural language processing tasks and beyond. This success of LLMs has led to a large …

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Llava-onevision: Easy visual task transfer

B Li, Y Zhang, D Guo, R Zhang, F Li, H Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed
by consolidating our insights into data, models, and visual representations in the LLaVA …

Re-imagen: Retrieval-augmented text-to-image generator

W Chen, H Hu, C Saharia, WW Cohen - arxiv preprint arxiv:2209.14491, 2022 - arxiv.org
Research on text-to-image generation has witnessed significant progress in generating
diverse and photo-realistic images, driven by diffusion and auto-regressive models trained …

Promptcap: Prompt-guided task-aware image captioning

Y Hu, H Hua, Z Yang, W Shi, NA Smith… - arxiv preprint arxiv …, 2022 - arxiv.org
Knowledge-based visual question answering (VQA) involves questions that require world
knowledge beyond the image to yield the correct answer. Large language models (LMs) like …

K-lite: Learning transferable visual models with external knowledge

S Shen, C Li, X Hu, Y **e, J Yang… - Advances in …, 2022 - proceedings.neurips.cc
The new generation of state-of-the-art computer vision systems are trained from natural
language supervision, ranging from simple object category names to descriptive captions …

Fine-tuning multimodal llms to follow zero-shot demonstrative instructions

J Li, K Pan, Z Ge, M Gao, W Ji, W Zhang… - The Twelfth …, 2023 - openreview.net
Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing
Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can …

Uniir: Training and benchmarking universal multimodal information retrievers

C Wei, Y Chen, H Chen, H Hu, G Zhang, J Fu… - … on Computer Vision, 2024 - Springer
Existing information retrieval (IR) models often assume a homogeneous format, limiting their
applicability to diverse user needs, such as searching for images with text descriptions …

Promptcap: Prompt-guided image captioning for vqa with gpt-3

Y Hu, H Hua, Z Yang, W Shi… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Knowledge-based visual question answering (VQA) involves questions that require
world knowledge beyond the image to yield the correct answer. Large language models …

Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities

H Hu, Y Luan, Y Chen, U Khandelwal… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large-scale multi-modal pre-training models such as CLIP and PaLI exhibit strong
generalization on various visual domains and tasks. However, existing image classification …