Google Академик

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org

With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

Сачувај Цитирај 1248 пута наведен Сродни чланци Све верзије (12)

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Сачувај Цитирај 229 пута наведен Сродни чланци Све верзије (6) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Improved baselines with visual instruction tuning

H Liu, C Li, Y Li, YJ Lee - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com

Large multimodal models (LMM) have recently shown encouraging progress with visual
instruction tuning. In this paper we present the first systematic study to investigate the design …

Сачувај Цитирај 1897 пута наведен Сродни чланци Све верзије (10) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] nowpublishers.com

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com

Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Сачувај Цитирај 231 пута наведен Сродни чланци Све верзије (7) Претрага библиотека HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer

In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Сачувај Цитирај 392 пута наведен Сродни чланци Све верзије (4)

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Z Chen, J Wu, W Wang, W Su, G Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com

The exponential growth of large language models (LLMs) has opened up numerous
possibilities for multi-modal AGI systems. However the progress in vision and vision …

Сачувај Цитирај 185 пута наведен Сродни чланци Све верзије (8) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

MM1: methods, analysis and insights from multimodal LLM pre-training

B McKinzie, Z Gan, JP Fauconnier, S Dodge… - … on Computer Vision, 2024 - Springer

In this work, we discuss building performant Multimodal Large Language Models (MLLMs).
In particular, we study the importance of various architecture components and data choices …

Сачувај Цитирај 203 пута наведен Сродни чланци Све верзије (7)

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Glamm: Pixel grounding large multimodal model

H Rasheed, M Maaz, S Shaji… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Large Multimodal Models (LMMs) extend Large Language Models to the vision
domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual …

Сачувај Цитирај 167 пута наведен Сродни чланци Све верзије (9) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mini-gemini: Mining the potential of multi-modality vision language models

Y Li, Y Zhang, C Wang, Z Zhong, Y Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-
modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating …

Сачувај Цитирај 211 пута наведен Сродни чланци Све верзије (5) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Llama-vid: An image is worth 2 tokens in large language models

Y Li, C Wang, J Jia - European Conference on Computer Vision, 2024 - Springer

In this work, we present a novel method to tackle the token generation challenge in Vision
Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current …

Сачувај Цитирај 221 пута наведен Сродни чланци Све верзије (7)

Направи обавештење

Цитирај

Напредна претрага

Сачувано у мојој библиотеци

Lisa: Reasoning segmentation via large language model

A Survey of Multimodel Large Language Models

Mm-llms: Recent advances in multimodal large language models

Improved baselines with visual instruction tuning

Multimodal foundation models: From specialists to general-purpose assistants

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

MM1: methods, analysis and insights from multimodal LLM pre-training

Glamm: Pixel grounding large multimodal model

Mini-gemini: Mining the potential of multi-modality vision language models

Llama-vid: An image is worth 2 tokens in large language models