AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro,
and Reka Core, have expanded their capabilities to include vision and audio modalities …
and Reka Core, have expanded their capabilities to include vision and audio modalities …
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis
Recent advancements in omnimodal learning have been achieved in understanding and
generation across images, text, and speech, though mainly within proprietary models …
generation across images, text, and speech, though mainly within proprietary models …
Baichuan-Omni-1.5 Technical Report
Y Li, J Liu, T Zhang, S Chen, T Li, Z Li, L Liu… - arxiv preprint arxiv …, 2025 - arxiv.org
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal
understanding capabilities but also provides end-to-end audio generation capabilities. To …
understanding capabilities but also provides end-to-end audio generation capabilities. To …
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
In this paper, we introduce WorldSense, the first benchmark to assess the multi-modal video
understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast …
understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast …
From Specific-MLLM to Omni-MLLM: A Survey about the MLLMs alligned with Multi-Modality
From the Specific-MLLM, which excels in single-modal tasks, to the Omni-MLLM, which
extends the range of general modalities, this evolution aims to achieve understanding and …
extends the range of general modalities, this evolution aims to achieve understanding and …
VCBench: A Controllable Benchmark for Symbolic and Abstract Challenges in Video Cognition
Recent advancements in Large Video-Language Models (LVLMs) have driven the
development of benchmarks designed to assess cognitive abilities in video-based tasks …
development of benchmarks designed to assess cognitive abilities in video-based tasks …
Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model
Large Audio-Language Models (LALMs) have demonstrated remarkable performance in
tasks involving audio perception and understanding, such as speech recognition and audio …
tasks involving audio perception and understanding, such as speech recognition and audio …