The llama 3 herd of models
Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …
presents a new set of foundation models, called Llama 3. It is a herd of language models …
Minigpt-v2: large language model as a unified interface for vision-language multi-task learning
Large language models have shown their remarkable capabilities as a general interface for
various language-related applications. Motivated by this, we target to build a unified …
various language-related applications. Motivated by this, we target to build a unified …
Cambrian-1: A fully open, vision-centric exploration of multimodal llms
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …
centric approach. While stronger language models can enhance multimodal capabilities, the …
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Is vision good enough for language? Recent advancements in multimodal models primarily
stem from the powerful reasoning abilities of large language models (LLMs). However the …
stem from the powerful reasoning abilities of large language models (LLMs). However the …
Probing the 3d awareness of visual foundation models
Recent advances in large-scale pretraining have yielded visual foundation models with
strong capabilities. Not only can recent models generalize to arbitrary images for their …
strong capabilities. Not only can recent models generalize to arbitrary images for their …
Veclip: Improving clip training via visual-enriched captions
Large-scale web-crawled datasets are fundamental for the success of pre-training vision-
language models, such as CLIP. However, the inherent noise and potential irrelevance of …
language models, such as CLIP. However, the inherent noise and potential irrelevance of …
Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model
The revolutionary capabilities of large language models (LLMs) have paved the way for
multimodal large language models (MLLMs) and fostered diverse applications across …
multimodal large language models (MLLMs) and fostered diverse applications across …
Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models
Today's most advanced multimodal models remain proprietary. The strongest open-weight
models rely heavily on synthetic data from proprietary VLMs to achieve good performance …
models rely heavily on synthetic data from proprietary VLMs to achieve good performance …
The Neglected Tails in Vision-Language Models
Vision-language models (VLMs) excel in zero-shot recognition but their performance varies
greatly across different visual concepts. For example although CLIP achieves impressive …
greatly across different visual concepts. For example although CLIP achieves impressive …
Proxyclip: Proxy attention improves clip for open-vocabulary segmentation
Open-vocabulary semantic segmentation requires models to effectively integrate visual
representations with open-vocabulary semantic labels. While Contrastive Language-Image …
representations with open-vocabulary semantic labels. While Contrastive Language-Image …