- Academic Search

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org

With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

保存引用被引用次数：151 相关文章所有 7 个版本

[Free GPT-4]

[PDF] arxiv.org

A survey on hallucination in large vision-language models

H Liu, W Xue, Y Chen, D Chen, X Zhao, K Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent development of Large Vision-Language Models (LVLMs) has attracted growing
attention within the AI landscape for its practical implementation potential. However,`` …

保存引用被引用次数：140 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Video-llava: Learning united visual representation by alignment before projection

B Lin, Y Ye, B Zhu, J Cui, M Ning, P **… - arxiv preprint arxiv …, 2023 - arxiv.org

The Large Vision-Language Model (LVLM) has enhanced the performance of various
downstream tasks in visual-language understanding. Most existing approaches encode …

保存引用被引用次数：428 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?

R Zhang, D Jiang, Y Zhang, H Lin, Z Guo, P Qiu… - … on Computer Vision, 2024 - Springer

The remarkable progress of Multi-modal Large Language Models (MLLMs) has gained
unparalleled attention. However, their capabilities in visual math problem-solving remain …

保存引用被引用次数：105 相关文章所有 2 个版本

[Free GPT-4]

[PDF] arxiv.org

Pointllm: Empowering large language models to understand point clouds

R Xu, X Wang, T Wang, Y Chen, J Pang… - European Conference on …, 2024 - Springer

The unprecedented advancements in Large Language Models (LLMs) have shown a
profound impact on natural language processing but are yet to fully embrace the realm of 3D …

保存引用被引用次数：122 相关文章所有 3 个版本

[Free GPT-4]

[PDF] thecvf.com

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

保存引用被引用次数：103 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models

F Li, R Zhang, H Zhang, Y Zhang, B Li, W Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Visual instruction tuning has made considerable strides in enhancing the capabilities of
Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single …

保存引用被引用次数：100 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

One-peace: Exploring one general representation model toward unlimited modalities

P Wang, S Wang, J Lin, S Bai, X Zhou, J Zhou… - arxiv preprint arxiv …, 2023 - arxiv.org

In this work, we explore a scalable way for building a general representation model toward
unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B …

保存引用被引用次数：121 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] thecvf.com

Onellm: One framework to align all modalities with language

J Han, K Gong, Y Zhang, J Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Multimodal large language models (MLLMs) have gained significant attention due to their
strong multimodal understanding capability. However existing works rely heavily on modality …

保存引用被引用次数：95 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Brave: Broadening the visual encoding of vision-language models

OF Kar, A Tonioni, P Poklukar, A Kulshrestha… - … on Computer Vision, 2024 - Springer

Vision-language models (VLMs) are typically composed of a vision encoder, eg CLIP, and a
language model (LM) that interprets the encoded features to solve downstream tasks …

保存引用被引用次数：27 相关文章所有 2 个版本

创建快讯

引用

高级搜索

已保存到“我的图书馆”

Imagebind-llm: Multi-modality instruction tuning

A Survey of Multimodel Large Language Models

A survey on hallucination in large vision-language models

Video-llava: Learning united visual representation by alignment before projection

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?

Pointllm: Empowering large language models to understand point clouds

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models

One-peace: Exploring one general representation model toward unlimited modalities

Onellm: One framework to align all modalities with language

Brave: Broadening the visual encoding of vision-language models