Survey of hallucination in natural language generation
Natural Language Generation (NLG) has improved exponentially in recent years thanks to
the development of sequence-to-sequence deep learning technologies such as Transformer …
the development of sequence-to-sequence deep learning technologies such as Transformer …
Language is not all you need: Aligning perception with language models
A big convergence of language, multimodal perception, action, and world modeling is a key
step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a …
step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a …
Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models
The cost of vision-and-language pre-training has become increasingly prohibitive due to
end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and …
end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and …
Flamingo: a visual language model for few-shot learning
Building models that can be rapidly adapted to novel tasks using only a handful of annotated
examples is an open challenge for multimodal machine learning research. We introduce …
examples is an open challenge for multimodal machine learning research. We introduce …
From images to textual prompts: Zero-shot visual question answering with frozen large language models
Large language models (LLMs) have demonstrated excellent zero-shot generalization to
new language tasks. However, effective utilization of LLMs for zero-shot visual question …
new language tasks. However, effective utilization of LLMs for zero-shot visual question …
Distilling large vision-language model with out-of-distribution generalizability
Large vision-language models have achieved outstanding performance, but their size and
computational requirements make their deployment on resource-constrained devices and …
computational requirements make their deployment on resource-constrained devices and …
Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering
Abstract Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to
utilize knowledge from external knowledge bases to answer visually-grounded questions …
utilize knowledge from external knowledge bases to answer visually-grounded questions …
Tinyclip: Clip distillation via affinity mimicking and weight inheritance
In this paper, we propose a novel cross-modal distillation method, called TinyCLIP, for large-
scale language-image pre-trained models. The method introduces two core techniques …
scale language-image pre-trained models. The method introduces two core techniques …
Language models are general-purpose interfaces
Foundation models have received much attention due to their effectiveness across a broad
range of downstream applications. Though there is a big convergence in terms of …
range of downstream applications. Though there is a big convergence in terms of …
Plug-and-play vqa: Zero-shot vqa by conjoining large pretrained models with zero training
Visual question answering (VQA) is a hallmark of vision and language reasoning and a
challenging task under the zero-shot setting. We propose Plug-and-Play VQA (PNP-VQA), a …
challenging task under the zero-shot setting. We propose Plug-and-Play VQA (PNP-VQA), a …