Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference

Z Wan, Z Wu, C Liu, J Huang, Z Zhu, P **… - arxiv preprint arxiv …, 2024 - arxiv.org
Long-context Multimodal Large Language Models (MLLMs) demand substantial
computational resources for inference as the growth of their multimodal Key-Value (KV) …

A survey on multimodal benchmarks: In the era of large ai models

L Li, G Chen, H Shi, J **ao, L Chen - arxiv preprint arxiv:2409.18142, 2024 - arxiv.org
The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial
advancements in artificial intelligence, significantly enhancing the capability to understand …

Vsp: Assessing the dual challenges of perception and reasoning in spatial planning tasks for vlms

Q Wu, H Zhao, M Saxon, T Bui, WY Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision language models (VLMs) are an exciting emerging class of language models (LMs)
that have merged classic LM capabilities with those of image processing systems. However …

Multilingual needle in a haystack: Investigating long-context behavior of multilingual large language models

A Hengle, P Bajpai, S Dan, T Chakraborty - arxiv preprint arxiv …, 2024 - arxiv.org
While recent large language models (LLMs) demonstrate remarkable abilities in responding
to queries in diverse languages, their ability to handle long multilingual contexts is …

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

J Ge, Z Chen, J Lin, J Zhu, X Liu, J Dai… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-Language Models (VLMs) have shown promising capabilities in handling various
multimodal tasks, yet they struggle in long-context scenarios, particularly in tasks involving …

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

W Ren, H Yang, J Min, C Wei, W Chen - arxiv preprint arxiv:2412.00927, 2024 - arxiv.org
Current large multimodal models (LMMs) face significant challenges in processing and
comprehending long-duration or high-resolution videos, which is mainly due to the lack of …

MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models

W Wang, S Jain, P Kantor, J Feldman, L Gallos… - arxiv preprint arxiv …, 2024 - arxiv.org
We propose MMLU-SR, a novel dataset designed to measure the true comprehension
abilities of Large Language Models (LLMs) by challenging their performance in question …

MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark

EL Epstein, K Yao, J Li, X Bai, H Palangi - arxiv preprint arxiv:2409.18216, 2024 - arxiv.org
Evaluating instruction following capabilities for multimodal, multi-turn dialogue is
challenging. With potentially multiple instructions in the input model context, the task is time …

VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation

M Suri, P Mathur, F Dernoncourt, K Goswami… - arxiv preprint arxiv …, 2024 - arxiv.org
Understanding information from a collection of multiple documents, particularly those with
visually rich elements, is important for document-grounded question answering. This paper …

Guided Code Generation with LLMs: A Multi-Agent Framework for Complex Code Tasks

A Almorsi, M Ahmed, W Gomaa - arxiv preprint arxiv:2501.06625, 2025 - arxiv.org
Large Language Models (LLMs) have shown remarkable capabilities in code generation
tasks, yet they face significant limitations in handling complex, long-context programming …