Learning in audio-visual context: A review, analysis, and new perspective
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …
understanding. To mimic human perception ability, audio-visual learning, aimed at …
Egoschema: A diagnostic benchmark for very long-form video language understanding
We introduce EgoSchema, a very long-form video question-answering dataset, and
benchmark to evaluate long video understanding capabilities of modern vision and …
benchmark to evaluate long video understanding capabilities of modern vision and …
Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset
Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …
while other modalities such as audio and subtitles in videos have not received sufficient …
Onellm: One framework to align all modalities with language
Multimodal large language models (MLLMs) have gained significant attention due to their
strong multimodal understanding capability. However existing works rely heavily on modality …
strong multimodal understanding capability. However existing works rely heavily on modality …
Vision transformers are parameter-efficient audio-visual learners
Vision transformers (ViTs) have achieved impressive results on various computer vision
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …
Valor: Vision-audio-language omni-perception pretraining model and dataset
In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model
(VALOR) for multi-modal understanding and generation. Different from widely-studied vision …
(VALOR) for multi-modal understanding and generation. Different from widely-studied vision …
A survey of multimodal large language model from a data-centric perspective
Multimodal large language models (MLLMs) enhance the capabilities of standard large
language models by integrating and processing data from multiple modalities, including text …
language models by integrating and processing data from multiple modalities, including text …
Video question answering: Datasets, algorithms and challenges
Video Question Answering (VideoQA) aims to answer natural language questions according
to the given videos. It has earned increasing attention with recent research trends in joint …
to the given videos. It has earned increasing attention with recent research trends in joint …
Collecting cross-modal presence-absence evidence for weakly-supervised audio-visual event perception
With only video-level event labels, this paper targets at the task of weakly-supervised audio-
visual event perception (WS-AVEP), which aims to temporally localize and categorize events …
visual event perception (WS-AVEP), which aims to temporally localize and categorize events …
Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios
This paper focuses on the challenge of answering questions in scenarios that are composed
of rich and complex dynamic audio-visual components. Although existing Multimodal Large …
of rich and complex dynamic audio-visual components. Although existing Multimodal Large …