AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

K Gong, K Feng, B Li, Y Wang, M Cheng… - arxiv preprint arxiv …, 2024 - arxiv.org
Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro,
and Reka Core, have expanded their capabilities to include vision and audio modalities …

OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis

R Luo, TE Lin, H Zhang, Y Wu, X Liu, M Yang… - arxiv preprint arxiv …, 2025 - arxiv.org
Recent advancements in omnimodal learning have been achieved in understanding and
generation across images, text, and speech, though mainly within proprietary models …

Baichuan-Omni-1.5 Technical Report

Y Li, J Liu, T Zhang, S Chen, T Li, Z Li, L Liu… - arxiv preprint arxiv …, 2025 - arxiv.org
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal
understanding capabilities but also provides end-to-end audio generation capabilities. To …

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

J Hong, S Yan, J Cai, X Jiang, Y Hu, W **e - arxiv preprint arxiv …, 2025 - arxiv.org
In this paper, we introduce WorldSense, the first benchmark to assess the multi-modal video
understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast …

From Specific-MLLM to Omni-MLLM: A Survey about the MLLMs alligned with Multi-Modality

S Jiang, J Liang, M Liu, B Qin - arxiv preprint arxiv:2412.11694, 2024 - arxiv.org
From the Specific-MLLM, which excels in single-modal tasks, to the Omni-MLLM, which
extends the range of general modalities, this evolution aims to achieve understanding and …

VCBench: A Controllable Benchmark for Symbolic and Abstract Challenges in Video Cognition

C Li, Q Chen, Z Li, F Tao, Y Zhang - arxiv preprint arxiv:2411.09105, 2024 - arxiv.org
Recent advancements in Large Video-Language Models (LVLMs) have driven the
development of benchmarks designed to assess cognitive abilities in video-based tasks …

Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model

Z Ma, Z Chen, Y Wang, ES Chng, X Chen - arxiv preprint arxiv …, 2025 - arxiv.org
Large Audio-Language Models (LALMs) have demonstrated remarkable performance in
tasks involving audio perception and understanding, such as speech recognition and audio …