Combating misinformation in the age of llms: Opportunities and challenges
Misinformation such as fake news and rumors is a serious threat for information ecosystems
and public trust. The emergence of large language models (LLMs) has great potential to …
and public trust. The emergence of large language models (LLMs) has great potential to …
Generative multimodal models are in-context learners
Humans can easily solve multimodal tasks in context with only a few demonstrations or
simple instructions which current multimodal systems largely struggle to imitate. In this work …
simple instructions which current multimodal systems largely struggle to imitate. In this work …
Monkey: Image resolution and text label are important things for large multi-modal models
Z Li, B Yang, Q Liu, Z Ma, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Large Multimodal Models (LMMs) have shown promise in vision-language tasks but
struggle with high-resolution input and detailed scene understanding. Addressing these …
struggle with high-resolution input and detailed scene understanding. Addressing these …
Eagle: Exploring the design space for multimodal llms with mixture of encoders
The ability to accurately interpret complex visual information is a crucial topic of multimodal
large language models (MLLMs). Recent work indicates that enhanced visual perception …
large language models (MLLMs). Recent work indicates that enhanced visual perception …
The (r) evolution of multimodal large language models: A survey
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …
this reason, inspired by the success of large language models, significant research efforts …
A simple recipe for contrastively pre-training video-first encoders beyond 16 frames
Understanding long real-world videos requires modeling of long-range visual
dependencies. To this end we explore video-first architectures building on the common …
dependencies. To this end we explore video-first architectures building on the common …
Distilling vision-language models on millions of videos
The recent advance in vision-language models is largely attributed to the abundance of
image-text data. We aim to replicate this success for video-language models but there …
image-text data. We aim to replicate this success for video-language models but there …
Building and better understanding vision-language models: insights and future directions
The field of vision-language models (VLMs), which take images and texts as inputs and
output texts, is rapidly evolving and has yet to reach consensus on several key aspects of …
output texts, is rapidly evolving and has yet to reach consensus on several key aspects of …
SemiVL: semi-supervised semantic segmentation with vision-language guidance
In semi-supervised semantic segmentation, a model is trained with a limited number of
labeled images along with a large corpus of unlabeled images to reduce the high annotation …
labeled images along with a large corpus of unlabeled images to reduce the high annotation …
Hrvda: High-resolution visual document assistant
Leveraging vast training data multimodal large language models (MLLMs) have
demonstrated formidable general visual comprehension capabilities and achieved …
demonstrated formidable general visual comprehension capabilities and achieved …