- Academic Search

C Chen, K Shu - AI Magazine, 2024 - Wiley Online Library

Misinformation such as fake news and rumors is a serious threat for information ecosystems
and public trust. The emergence of large language models (LLMs) has great potential to …

Save Cite Cited by 125 Related articles All 4 versions Free GPT-4

[Free GPT-4]

[PDF] thecvf.com

Generative multimodal models are in-context learners

Q Sun, Y Cui, X Zhang, F Zhang, Q Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Humans can easily solve multimodal tasks in context with only a few demonstrations or
simple instructions which current multimodal systems largely struggle to imitate. In this work …

Save Cite Cited by 197 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

Monkey: Image resolution and text label are important things for large multi-modal models

Z Li, B Yang, Q Liu, Z Ma, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Large Multimodal Models (LMMs) have shown promise in vision-language tasks but
struggle with high-resolution input and detailed scene understanding. Addressing these …

Save Cite Cited by 209 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Eagle: Exploring the design space for multimodal llms with mixture of encoders

M Shi, F Liu, S Wang, S Liao, S Radhakrishnan… - arxiv preprint arxiv …, 2024 - arxiv.org

The ability to accurately interpret complex visual information is a crucial topic of multimodal
large language models (MLLMs). Recent work indicates that enhanced visual perception …

Save Cite Cited by 42 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

The (r) evolution of multimodal large language models: A survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arxiv preprint arxiv …, 2024 - arxiv.org

Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

Save Cite Cited by 43 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

A simple recipe for contrastively pre-training video-first encoders beyond 16 frames

P Papalampidi, S Koppula, S Pathak… - Proceedings of the …, 2024 - openaccess.thecvf.com

Understanding long real-world videos requires modeling of long-range visual
dependencies. To this end we explore video-first architectures building on the common …

Save Cite Cited by 17 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

Distilling vision-language models on millions of videos

Y Zhao, L Zhao, X Zhou, J Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com

The recent advance in vision-language models is largely attributed to the abundance of
image-text data. We aim to replicate this success for video-language models but there …

Save Cite Cited by 13 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] openreview.net

Building and better understanding vision-language models: insights and future directions

H Laurençon, A Marafioti, V Sanh… - … on Responsibly Building …, 2024 - openreview.net

The field of vision-language models (VLMs), which take images and texts as inputs and
output texts, is rapidly evolving and has yet to reach consensus on several key aspects of …

Save Cite Cited by 28 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

SemiVL: semi-supervised semantic segmentation with vision-language guidance

L Hoyer, DJ Tan, MF Naeem, L Van Gool… - European Conference on …, 2024 - Springer

In semi-supervised semantic segmentation, a model is trained with a limited number of
labeled images along with a large corpus of unlabeled images to reduce the high annotation …

Save Cite Cited by 16 Related articles All 3 versions Free GPT-4

[Free GPT-4]

[PDF] thecvf.com

Hrvda: High-resolution visual document assistant

C Liu, K Yin, H Cao, X Jiang, X Li… - Proceedings of the …, 2024 - openaccess.thecvf.com

Leveraging vast training data multimodal large language models (MLLMs) have
demonstrated formidable general visual comprehension capabilities and achieved …

Save Cite Cited by 16 Related articles All 3 versions Free GPT-4 View as HTML

Create alert

Cite

Advanced search

Saved to My library

Pali-3 vision language models: Smaller, faster, stronger

Combating misinformation in the age of llms: Opportunities and challenges

Generative multimodal models are in-context learners

Monkey: Image resolution and text label are important things for large multi-modal models

Eagle: Exploring the design space for multimodal llms with mixture of encoders

The (r) evolution of multimodal large language models: A survey

A simple recipe for contrastively pre-training video-first encoders beyond 16 frames

Distilling vision-language models on millions of videos

Building and better understanding vision-language models: insights and future directions

SemiVL: semi-supervised semantic segmentation with vision-language guidance

Hrvda: High-resolution visual document assistant