- Academic Search

Lưu Trích dẫn Trích dẫn 155 bài viết Bài viết có liên quan Tất cả 2 phiên bản Xem dạng HTML

A survey on hallucination in large vision-language models

H Liu, W Xue, Y Chen, D Chen, X Zhao, K Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent development of Large Vision-Language Models (LVLMs) has attracted growing
attention within the AI landscape for its practical implementation potential. However,`` …

Lưu Trích dẫn Trích dẫn 489 bài viết Bài viết có liên quan Tất cả 7 phiên bản

Sharegpt4v: Improving large multi-modal models with better captions

L Chen, J Li, X Dong, P Zhang, C He, J Wang… - … on Computer Vision, 2024 - Springer

Modality alignment serves as the cornerstone for large multi-modal models (LMMs).
However, the impact of different attributes (eg, data type, quality, and scale) of training data …

Lưu Trích dẫn Trích dẫn 274 bài viết Bài viết có liên quan Tất cả 8 phiên bản Xem dạng HTML

Mvbench: A comprehensive multi-modal video understanding benchmark

K Li, Y Wang, Y He, Y Li, Y Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com

With the rapid development of Multi-modal Large Language Models (MLLMs) a number of
diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities …

Lưu Trích dẫn Trích dẫn 143 bài viết Bài viết có liên quan Tất cả 8 phiên bản Xem dạng HTML

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

TS Chen, A Siarohin, W Menapace… - Proceedings of the …, 2024 - openaccess.thecvf.com

The quality of the data and annotation upper-bounds the quality of a downstream model.
While there exist large text corpora and image-text pairs high-quality video-text data is much …

Lưu Trích dẫn Trích dẫn 179 bài viết Bài viết có liên quan Tất cả 6 phiên bản Xem dạng HTML

Chat-univi: Unified visual representation empowers large language models with image and video understanding

P **, R Takanobu, W Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Large language models have demonstrated impressive universal capabilities across a wide
range of open-ended tasks and have extended their utility to encompass multimodal …

Lưu Trích dẫn Trích dẫn 124 bài viết Bài viết có liên quan Tất cả 7 phiên bản Xem dạng HTML

Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

Lưu Trích dẫn Trích dẫn 232 bài viết Bài viết có liên quan Tất cả 4 phiên bản Xem dạng HTML

Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model

X Dong, P Zhang, Y Zang, Y Cao, B Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-
form text-image composition and comprehension. This model goes beyond conventional …

Lưu Trích dẫn Trích dẫn 126 bài viết Bài viết có liên quan Tất cả 5 phiên bản

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?

R Zhang, D Jiang, Y Zhang, H Lin, Z Guo, P Qiu… - … on Computer Vision, 2024 - Springer

The remarkable progress of Multi-modal Large Language Models (MLLMs) has gained
unparalleled attention. However, their capabilities in visual math problem-solving remain …