- Academic Search

H Liu, W Xue, Y Chen, D Chen, X Zhao, K Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent development of Large Vision-Language Models (LVLMs) has attracted growing
attention within the AI landscape for its practical implementation potential. However,`` …

Opslaan Citeren Geciteerd door 148 Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] ieee.org

Explainable artificial intelligence for autonomous driving: A comprehensive overview and field guide for future research directions

S Atakishiyev, M Salameh, H Yao, R Goebel - IEEE Access, 2024 - ieeexplore.ieee.org

Autonomous driving has achieved significant milestones in research and development over
the last two decades. There is increasing interest in the field as the deployment of …

Opslaan Citeren Geciteerd door 234 Verwante artikelen Alle 7 versies

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Gemini: a family of highly capable multimodal models

G Team, R Anil, S Borgeaud, JB Alayrac, J Yu… - arxiv preprint arxiv …, 2023 - arxiv.org

This report introduces a new family of multimodal models, Gemini, that exhibit remarkable
capabilities across image, audio, video, and text understanding. The Gemini family consists …

Opslaan Citeren Geciteerd door 3152 Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

The llama 3 herd of models

A Dubey, A Jauhri, A Pandey, A Kadian… - arxiv preprint arxiv …, 2024 - arxiv.org

Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …

Opslaan Citeren Geciteerd door 2691 Verwante artikelen Alle 4 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Next-gpt: Any-to-any multimodal llm

S Wu, H Fei, L Qu, W Ji, TS Chua - Forty-first International …, 2024 - openreview.net

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides,
they mostly fall prey to the limitation of only input-side multimodal understanding, without the …

Opslaan Citeren Geciteerd door 498 Verwante artikelen Alle 6 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] nowpublishers.com

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com

Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Opslaan Citeren Geciteerd door 228 Verwante artikelen Alle 7 versies In bibliotheek zoeken HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Llava-onevision: Easy visual task transfer

B Li, Y Zhang, D Guo, R Zhang, F Li, H Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed
by consolidating our insights into data, models, and visual representations in the LLaVA …

Opslaan Citeren Geciteerd door 279 Verwante artikelen Alle 3 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Chat-univi: Unified visual representation empowers large language models with image and video understanding

P **, R Takanobu, W Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Large language models have demonstrated impressive universal capabilities across a wide
range of open-ended tasks and have extended their utility to encompass multimodal …

Opslaan Citeren Geciteerd door 177 Verwante artikelen Alle 6 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment

B Zhu, B Lin, M Ning, Y Yan, J Cui, HF Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

The video-language (VL) pretraining has achieved remarkable improvement in multiple
downstream tasks. However, the current VL pretraining framework is hard to extend to …

Opslaan Citeren Geciteerd door 158 Verwante artikelen Alle 4 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

L Chen, H Zhao, T Liu, S Bai, J Lin, C Zhou… - … on Computer Vision, 2024 - Springer

In this study, we identify the inefficient attention phenomena in Large Vision-Language
Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat, and Video …

Opslaan Citeren Geciteerd door 107 Verwante artikelen Alle 10 versies

Melding maken

Citeren

Geavanceerd zoeken

Opgeslagen in Mijn bibliotheek

Video-llava: Learning united visual representation by alignment before projection

A survey on hallucination in large vision-language models

Explainable artificial intelligence for autonomous driving: A comprehensive overview and field guide for future research directions

Gemini: a family of highly capable multimodal models

The llama 3 herd of models

Next-gpt: Any-to-any multimodal llm

Multimodal foundation models: From specialists to general-purpose assistants

Llava-onevision: Easy visual task transfer

Chat-univi: Unified visual representation empowers large language models with image and video understanding

Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models