Learning in audio-visual context: A review, analysis, and new perspective

Y Wei, D Hu, Y Tian, X Li - arxiv preprint arxiv:2208.09579, 2022 - arxiv.org
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

Attention bottlenecks for multimodal fusion

A Nagrani, S Yang, A Arnab, A Jansen… - Advances in neural …, 2021 - proceedings.neurips.cc
Humans perceive the world by concurrently processing and fusing high-dimensional inputs
from multiple modalities such as vision and audio. Machine perception models, in stark …

A comprehensive review of recent deep learning techniques for human activity recognition

VT Le, K Tran-Trung, VT Hoang - Computational Intelligence …, 2022 - Wiley Online Library
Human action recognition is an important field in computer vision that has attracted
remarkable attention from researchers. This survey aims to provide a comprehensive …

Slowfast networks for video recognition

C Feichtenhofer, H Fan, J Malik… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway,
operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating …

Forgerynet: A versatile benchmark for comprehensive forgery analysis

Y He, B Gan, S Chen, Y Zhou, G Yin… - Proceedings of the …, 2021 - openaccess.thecvf.com
The rapid progress of photorealistic synthesis techniques has reached at a critical point
where the boundary between real and manipulated images starts to blur. Thus …

Audiovisual slowfast networks for video recognition

F **ao, YJ Lee, K Grauman, J Malik… - arxiv preprint arxiv …, 2020 - arxiv.org
We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual
perception. AVSlowFast has Slow and Fast visual pathways that are deeply integrated with a …

Soccernet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos

A Deliege, A Cioppa, S Giancola… - Proceedings of the …, 2021 - openaccess.thecvf.com
Understanding broadcast videos is a challenging task in computer vision, as it requires
generic reasoning capabilities to appreciate the content offered by the video editing. In this …

Learning spatio-temporal representation with local and global diffusion

Z Qiu, T Yao, CW Ngo, X Tian… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
Abstract Convolutional Neural Networks (CNN) have been regarded as a powerful class of
models for visual recognition problems. Nevertheless, the convolutional filters in these …

Tsp: Temporally-sensitive pretraining of video encoders for localization tasks

H Alwassel, S Giancola… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Due to the large memory footprint of untrimmed videos, current state-of-the-art video
localization methods operate atop precomputed video clip features. These features are …

Audio visual scene-aware dialog

H Alamri, V Cartillier, A Das, J Wang… - Proceedings of the …, 2019 - openaccess.thecvf.com
We introduce the task of scene-aware dialog. Our goal is to generate a complete and natural
response to a question about a scene, given video and audio of the scene and the history of …