A comprehensive overview of large language models
Large Language Models (LLMs) have recently demonstrated remarkable capabilities in
natural language processing tasks and beyond. This success of LLMs has led to a large …
natural language processing tasks and beyond. This success of LLMs has led to a large …
Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Llava-onevision: Easy visual task transfer
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed
by consolidating our insights into data, models, and visual representations in the LLaVA …
by consolidating our insights into data, models, and visual representations in the LLaVA …
Re-imagen: Retrieval-augmented text-to-image generator
Research on text-to-image generation has witnessed significant progress in generating
diverse and photo-realistic images, driven by diffusion and auto-regressive models trained …
diverse and photo-realistic images, driven by diffusion and auto-regressive models trained …
Promptcap: Prompt-guided task-aware image captioning
Knowledge-based visual question answering (VQA) involves questions that require world
knowledge beyond the image to yield the correct answer. Large language models (LMs) like …
knowledge beyond the image to yield the correct answer. Large language models (LMs) like …
K-lite: Learning transferable visual models with external knowledge
The new generation of state-of-the-art computer vision systems are trained from natural
language supervision, ranging from simple object category names to descriptive captions …
language supervision, ranging from simple object category names to descriptive captions …
Fine-tuning multimodal llms to follow zero-shot demonstrative instructions
Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing
Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can …
Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can …
Uniir: Training and benchmarking universal multimodal information retrievers
Existing information retrieval (IR) models often assume a homogeneous format, limiting their
applicability to diverse user needs, such as searching for images with text descriptions …
applicability to diverse user needs, such as searching for images with text descriptions …
Promptcap: Prompt-guided image captioning for vqa with gpt-3
Abstract Knowledge-based visual question answering (VQA) involves questions that require
world knowledge beyond the image to yield the correct answer. Large language models …
world knowledge beyond the image to yield the correct answer. Large language models …
Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities
Large-scale multi-modal pre-training models such as CLIP and PaLI exhibit strong
generalization on various visual domains and tasks. However, existing image classification …
generalization on various visual domains and tasks. However, existing image classification …