- Academic Search

D Jiang, X He, H Zeng, C Wei, M Ku, Q Liu… - ar** Large
Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a …

Lưu Trích dẫn Trích dẫn 51 bài viết Bài viết có liên quan Tất cả 3 phiên bản Xem dạng HTML

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Colpali: Efficient document retrieval with vision language models

M Faysse, H Sibille, T Wu, B Omrani… - The Thirteenth …, 2024 - openreview.net

Documents are visually rich structures that convey information through text, but also figures,
page layouts, tables, or even fonts. Since modern retrieval systems mainly rely on the textual …

Lưu Trích dẫn Trích dẫn 24 bài viết Bài viết có liên quan Tất cả 3 phiên bản Xem dạng HTML

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Mint-1t: Scaling open-source multimodal data by 10x: A multimodal dataset with one trillion tokens

A Awadalla, L Xue, O Lo, M Shu, H Lee… - The Thirty-eight …, 2024 - openreview.net

Multimodal interleaved datasets featuring free-form interleaved sequences of images and
text are crucial for training frontier large multimodal models (LMMs). Despite the rapid …

Lưu Trích dẫn Trích dẫn 21 bài viết Bài viết có liên quan Tất cả 4 phiên bản Xem dạng HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Points: Improving your vision-language model with affordable strategies

Y Liu, Z Zhao, Z Zhuang, L Tian, X Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org

In recent years, vision-language models have made significant strides, excelling in tasks like
optical character recognition and geometric problem-solving. However, several critical …

Lưu Trích dẫn Trích dẫn 7 bài viết Bài viết có liên quan Tất cả 3 phiên bản Xem dạng HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Task Vectors are Cross-Modal

G Luo, T Darrell, A Bar - arxiv preprint arxiv:2410.22330, 2024 - arxiv.org

We investigate the internal representations of vision-and-language models (VLMs) and how
they encode task representations. We consider tasks specified through examples or …

Lưu Trích dẫn Trích dẫn 1 bài viết Bài viết có liên quan Tất cả 3 phiên bản Xem dạng HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Comics Datasets Framework: Mix of Comics datasets for detection benchmarking

E Vivoli, I Campaioli, M Nardoni, N Biondi… - … on Document Analysis …, 2024 - Springer

Comics, as a medium, uniquely combine text and images in styles often distinct from real-
world visuals. For the past three decades, computational research on comics has evolved …

Lưu Trích dẫn Trích dẫn 2 bài viết Bài viết có liên quan Tất cả 7 phiên bản

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting

A Mishra, R Noh, H Fu, M Li, M Kim - arxiv preprint arxiv:2502.14780, 2025 - arxiv.org

Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern
smartphones with powerful cameras become primary interfaces for human-computer …

Lưu Trích dẫn Bài viết có liên quan Xem dạng HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Retrospective Learning from Interactions

Z Chen, MO Gul, Y Chen, G Geng, A Wu… - arxiv preprint arxiv …, 2024 - arxiv.org

Multi-turn interactions between large language models (LLMs) and users naturally include
implicit feedback signals. If an LLM responds in an unexpected way to an instruction, the …

Lưu Trích dẫn Bài viết có liên quan Tất cả 2 phiên bản Xem dạng HTML

Tạo thông báo

Trích dẫn

Tìm kiếm nâng cao

Đã lưu vào Thư viện của tôi

What matters when building vision-language models?(2024)

Mantis: Interleaved multi-image instruction tuning

Colpali: Efficient document retrieval with vision language models

Mint-1t: Scaling open-source multimodal data by 10x: A multimodal dataset with one trillion tokens

Points: Improving your vision-language model with affordable strategies

Task Vectors are Cross-Modal

Comics Datasets Framework: Mix of Comics datasets for detection benchmarking

ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting

Retrospective Learning from Interactions