„Google“ mokslinčius

Z Wu, X Chen, Z Pan, X Liu, W Liu, D Dai… - arxiv preprint arxiv …, 2024 - arxiv.org

We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …

Išsaugoti Cituoti Cituoja 14 Susiję straipsniai Visos 3 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Apollo: An exploration of video understanding in large multimodal models

O Zohar, X Wang, Y Dubois, N Mehta, T **ao… - arxiv preprint arxiv …, 2024 - arxiv.org

Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …

Išsaugoti Cituoti Cituoja 6 Susiję straipsniai Visos 2 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Minimax-01: Scaling foundation models with lightning attention

A Li, B Gong, B Yang, B Shan, C Liu, C Zhu… - arxiv preprint arxiv …, 2025 - arxiv.org

We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are
comparable to top-tier models while offering superior capabilities in processing longer …

Išsaugoti Cituoti Cituoja 5 Susiję straipsniai Visos 3 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

MEGA-Bench: Scaling multimodal evaluation to over 500 real-world tasks

J Chen, T Liang, S Siu, Z Wang, K Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500
real-world tasks, to address the highly heterogeneous daily use cases of end users. Our …

Išsaugoti Cituoti Cituoja 3 Susiję straipsniai Visos 2 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Worldcuisines: A massive-scale benchmark for multilingual and multicultural visual question answering on global cuisines

GI Winata, F Hudi, PA Irawan, D Anugraha… - arxiv preprint arxiv …, 2024 - arxiv.org

Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly
in languages other than English and in underrepresented cultural contexts. To evaluate their …

Išsaugoti Cituoti Cituoja 3 Susiję straipsniai Visos 3 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Enabling harmonious human-machine interaction with visual-context augmented dialogue system: A review

H Wang, B Guo, Y Zeng, M Chen, Y Ding… - ACM Transactions on …, 2022 - dl.acm.org

The intelligent dialogue system, aiming at communicating with humans harmoniously with
natural language, is brilliant for promoting the advancement of human-machine interaction …

Išsaugoti Cituoti Cituoja 3 Susiję straipsniai Visos 3 versijos

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

TVBench: Redesigning Video-Language Evaluation

D Cores, M Dorkenwald, M Mucientes… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models have demonstrated impressive performance when integrated with
vision models even enabling video understanding. However, evaluating these video models …

Išsaugoti Cituoti Cituoja 1 Susiję straipsniai Visos 3 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

W Ren, H Yang, J Min, C Wei, W Chen - arxiv preprint arxiv:2412.00927, 2024 - arxiv.org

Current large multimodal models (LMMs) face significant challenges in processing and
comprehending long-duration or high-resolution videos, which is mainly due to the lack of …

Išsaugoti Cituoti Cituoja 1 Susiję straipsniai Visos 2 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

W Hong, Y Cheng, Z Yang, W Wang, L Wang… - arxiv preprint arxiv …, 2025 - arxiv.org

In recent years, vision language models (VLMs) have made significant advancements in
video understanding. However, a crucial capability-fine-grained motion comprehension …

Išsaugoti Cituoti Susiję straipsniai Visos 2 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

Q Jiang, Y Yang, Y **ong, Y Chen, Z Zeng… - arxiv preprint arxiv …, 2024 - arxiv.org

Perception and understanding are two pillars of computer vision. While multimodal large
language models (MLLM) have demonstrated remarkable visual understanding capabilities …

Išsaugoti Cituoti Cituoja 1 Susiję straipsniai Visos 2 versijos HTML kopija

Kurti įspėjimą

Cituoti

Išplėstinė paieška

Išsaugota skiltyje „Mano biblioteka“

Aria: An open multimodal native mixture-of-experts model

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

Apollo: An exploration of video understanding in large multimodal models

Minimax-01: Scaling foundation models with lightning attention

MEGA-Bench: Scaling multimodal evaluation to over 500 real-world tasks

Worldcuisines: A massive-scale benchmark for multilingual and multicultural visual question answering on global cuisines

Enabling harmonious human-machine interaction with visual-context augmented dialogue system: A review

TVBench: Redesigning Video-Language Evaluation

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding