- Academic Search

H Liu, W Xue, Y Chen, D Chen, X Zhao, K Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent development of Large Vision-Language Models (LVLMs) has attracted growing
attention within the AI landscape for its practical implementation potential. However,`` …

Lưu Trích dẫn Trích dẫn 159 bài viết Bài viết có liên quan Tất cả 2 phiên bản Xem dạng HTML

[Free GPT-4]
[DeepSeek]

[PDF] mdpi.com

From large language models to large multimodal models: A literature review

D Huang, C Yan, Q Li, X Peng - Applied Sciences, 2024 - mdpi.com

With the deepening of research on Large Language Models (LLMs), significant progress has
been made in recent years on the development of Large Multimodal Models (LMMs), which …

Lưu Trích dẫn Trích dẫn 20 bài viết Bài viết có liên quan Tất cả 2 phiên bản Bản lưu

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

X Yue, Y Ni, K Zhang, T Zheng, R Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …

Lưu Trích dẫn Trích dẫn 609 bài viết Bài viết có liên quan Tất cả 7 phiên bản Xem dạng HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer

In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Lưu Trích dẫn Trích dẫn 403 bài viết Bài viết có liên quan Tất cả 4 phiên bản

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Visual autoregressive modeling: Scalable image generation via next-scale prediction

K Tian, Y Jiang, Z Yuan, B Peng… - Advances in neural …, 2025 - proceedings.neurips.cc

Abstract We present Visual AutoRegressive modeling (VAR), a new generation paradigm
that redefines the autoregressive learning on images as coarse-to-fine" next-scale …

Lưu Trích dẫn Trích dẫn 168 bài viết Bài viết có liên quan Tất cả 5 phiên bản Xem dạng HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Sharegpt4video: Improving video understanding and generation with better captions

L Chen, X Wei, J Li, X Dong, P Zhang… - Advances in …, 2025 - proceedings.neurips.cc

Abstract We present the ShareGPT4Video series, aiming to facilitate the video
understanding of large video-language models (LVLMs) and the video generation of text-to …

Lưu Trích dẫn Trích dẫn 102 bài viết Bài viết có liên quan Tất cả 5 phiên bản Xem dạng HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Lưu Trích dẫn Trích dẫn 135 bài viết Bài viết có liên quan Tất cả 5 phiên bản

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Are we on the right way for evaluating large vision-language models?

L Chen, J Li, X Dong, P Zhang, Y Zang, Z Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

Large vision-language models (LVLMs) have recently achieved rapid progress, sparking
numerous studies to evaluate their multi-modal capabilities. However, we dig into current …

Lưu Trích dẫn Trích dẫn 175 bài viết Bài viết có liên quan Tất cả 4 phiên bản Xem dạng HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Moe-llava: Mixture of experts for large vision-language models

B Lin, Z Tang, Y Ye, J Cui, B Zhu, P **, J Huang… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent advances demonstrate that scaling Large Vision-Language Models (LVLMs)
effectively improves downstream task performances. However, existing scaling methods …

Lưu Trích dẫn Trích dẫn 185 bài viết Bài viết có liên quan Tất cả 3 phiên bản Xem dạng HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Paligemma: A versatile 3b vlm for transfer

L Beyer, A Steiner, AS Pinto, A Kolesnikov… - arxiv preprint arxiv …, 2024 - arxiv.org

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …

Lưu Trích dẫn Trích dẫn 140 bài viết Bài viết có liên quan Tất cả 3 phiên bản Xem dạng HTML

Tạo thông báo

Trích dẫn

Tìm kiếm nâng cao

Đã lưu vào Thư viện của tôi

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

A survey on hallucination in large vision-language models

From large language models to large multimodal models: A literature review

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Visual autoregressive modeling: Scalable image generation via next-scale prediction

Sharegpt4video: Improving video understanding and generation with better captions

Internvideo2: Scaling foundation models for multimodal video understanding

Are we on the right way for evaluating large vision-language models?

Moe-llava: Mixture of experts for large vision-language models

Paligemma: A versatile 3b vlm for transfer