Google Akademik

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org

Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

Kaydet Alıntı yap Alıntılanma sayısı: 136 İlgili makaleler 2 sürümün hepsi

[Free GPT-4]

[PDF] nowpublishers.com

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Kaydet Alıntı yap Alıntılanma sayısı: 197 İlgili makaleler 7 sürümün hepsi Kütüphane Araması HTML olarak görüntüle

[Free GPT-4]

[PDF] arxiv.org

Mmbench: Is your multi-modal model an all-around player?

Y Liu, H Duan, Y Zhang, B Li, S Zhang, W Zhao… - European conference on …, 2024 - Springer

Large vision-language models (VLMs) have recently achieved remarkable progress,
exhibiting impressive multimodal perception and reasoning abilities. However, effectively …

Kaydet Alıntı yap Alıntılanma sayısı: 726 İlgili makaleler 3 sürümün hepsi

[Free GPT-4]

[PDF] mlr.press

Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models

J Li, D Li, S Savarese, S Hoi - International conference on …, 2023 - proceedings.mlr.press

The cost of vision-and-language pre-training has become increasingly prohibitive due to
end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and …

Kaydet Alıntı yap Alıntılanma sayısı: 4815 İlgili makaleler 7 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]

[PDF] arxiv.org

MM1: methods, analysis and insights from multimodal LLM pre-training

B McKinzie, Z Gan, JP Fauconnier, S Dodge… - … on Computer Vision, 2024 - Springer

In this work, we discuss building performant Multimodal Large Language Models (MLLMs).
In particular, we study the importance of various architecture components and data choices …

Kaydet Alıntı yap Alıntılanma sayısı: 180 İlgili makaleler 2 sürümün hepsi

[Free GPT-4]

[PDF] arxiv.org

A survey on multimodal large language models

S Yin, C Fu, S Zhao, K Li, X Sun, T Xu… - arxiv preprint arxiv …, 2023 - arxiv.org

Multimodal Large Language Model (MLLM) recently has been a new rising research
hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform …

Kaydet Alıntı yap Alıntılanma sayısı: 1067 İlgili makaleler 6 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]

[PDF] arxiv.org

Evaluating object hallucination in large vision-language models

Y Li, Y Du, K Zhou, J Wang, WX Zhao… - arxiv preprint arxiv …, 2023 - arxiv.org

Inspired by the superior language abilities of large language models (LLM), large vision-
language models (LVLM) have been recently explored by integrating powerful LLMs for …

Kaydet Alıntı yap Alıntılanma sayısı: 739 İlgili makaleler 6 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]

[PDF] arxiv.org

Next-gpt: Any-to-any multimodal llm

S Wu, H Fei, L Qu, W Ji, TS Chua - arxiv preprint arxiv:2309.05519, 2023 - arxiv.org

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides,
they mostly fall prey to the limitation of only input-side multimodal understanding, without the …

Kaydet Alıntı yap Alıntılanma sayısı: 483 İlgili makaleler 4 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]

[PDF] thecvf.com

Scaling language-image pre-training via masking

Y Li, H Fan, R Hu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract We present Fast Language-Image Pre-training (FLIP), a simple and more efficient
method for training CLIP. Our method randomly masks out and removes a large portion of …

Kaydet Alıntı yap Alıntılanma sayısı: 315 İlgili makaleler 6 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]

[PDF] arxiv.org

Mm-vet: Evaluating large multimodal models for integrated capabilities

W Yu, Z Yang, L Li, J Wang, K Lin, Z Liu… - arxiv preprint arxiv …, 2023 - arxiv.org

We propose MM-Vet, an evaluation benchmark that examines large multimodal models
(LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing …

Kaydet Alıntı yap Alıntılanma sayısı: 493 İlgili makaleler 3 sürümün hepsi HTML olarak görüntüle

Uyarı oluştur

Alıntı yap

Gelişmiş arama

Kitaplığım'a kaydedildi

Nocaps: Novel object captioning at scale

Foundation Models Defining a New Era in Vision: a Survey and Outlook

Vision-language pre-training: Basics, recent advances, and future trends

Mmbench: Is your multi-modal model an all-around player?

Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models

MM1: methods, analysis and insights from multimodal LLM pre-training

A survey on multimodal large language models

Evaluating object hallucination in large vision-language models

Next-gpt: Any-to-any multimodal llm

Scaling language-image pre-training via masking

Mm-vet: Evaluating large multimodal models for integrated capabilities