Learning in audio-visual context: A review, analysis, and new perspective

Y Wei, D Hu, Y Tian, X Li - arxiv preprint arxiv:2208.09579, 2022 - arxiv.org
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

Egoschema: A diagnostic benchmark for very long-form video language understanding

K Mangalam, R Akshulakov… - Advances in Neural …, 2023 - proceedings.neurips.cc
We introduce EgoSchema, a very long-form video question-answering dataset, and
benchmark to evaluate long video understanding capabilities of modern vision and …

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

S Chen, H Li, Q Wang, Z Zhao… - Advances in Neural …, 2023 - proceedings.neurips.cc
Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …

Onellm: One framework to align all modalities with language

J Han, K Gong, Y Zhang, J Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Multimodal large language models (MLLMs) have gained significant attention due to their
strong multimodal understanding capability. However existing works rely heavily on modality …

Vision transformers are parameter-efficient audio-visual learners

YB Lin, YL Sung, J Lei, M Bansal… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision transformers (ViTs) have achieved impressive results on various computer vision
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …

Valor: Vision-audio-language omni-perception pretraining model and dataset

S Chen, X He, L Guo, X Zhu, W Wang, J Tang… - arxiv preprint arxiv …, 2023 - arxiv.org
In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model
(VALOR) for multi-modal understanding and generation. Different from widely-studied vision …

A survey of multimodal large language model from a data-centric perspective

T Bai, H Liang, B Wan, Y Xu, X Li, S Li, L Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) enhance the capabilities of standard large
language models by integrating and processing data from multiple modalities, including text …

Video question answering: Datasets, algorithms and challenges

Y Zhong, J **ao, W Ji, Y Li, W Deng… - arxiv preprint arxiv …, 2022 - arxiv.org
Video Question Answering (VideoQA) aims to answer natural language questions according
to the given videos. It has earned increasing attention with recent research trends in joint …

Collecting cross-modal presence-absence evidence for weakly-supervised audio-visual event perception

J Gao, M Chen, C Xu - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
With only video-level event labels, this paper targets at the task of weakly-supervised audio-
visual event perception (WS-AVEP), which aims to temporally localize and categorize events …

Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios

Q Ye, Z Yu, R Shao, X **e, P Torr, X Cao - European Conference on …, 2024 - Springer
This paper focuses on the challenge of answering questions in scenarios that are composed
of rich and complex dynamic audio-visual components. Although existing Multimodal Large …