Unloc: A unified framework for video localization tasks

S Yan, X **ong, A Nagrani, A Arnab… - Proceedings of the …, 2023 - openaccess.thecvf.com
While large-scale image-text pretrained models such as CLIP have been used for multiple
video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos …

Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models

S Luo, C Yan, C Hu, H Zhao - Advances in Neural …, 2024 - proceedings.neurips.cc
Abstract The Video-to-Audio (V2A) model has recently gained attention for its practical
application in generating audio directly from silent videos, particularly in video/film …

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

AJ Piergiovanni, I Noble, D Kim… - Proceedings of the …, 2024 - openaccess.thecvf.com
One of the main challenges of multimodal learning is the need to combine heterogeneous
modalities (eg video audio text). For example video and audio are obtained at much higher …

An outlook into the future of egocentric vision

C Plizzari, G Goletto, A Furnari, S Bansal… - International Journal of …, 2024 - Springer
What will the future be? We wonder! In this survey, we explore the gap between current
research in egocentric vision and the ever-anticipated future, where wearable computing …

Soundingactions: Learning how actions sound from narrated egocentric videos

C Chen, K Ashutosh, R Girdhar… - Proceedings of the …, 2024 - openaccess.thecvf.com
We propose a novel self-supervised embedding to learn how actions sound from narrated in-
the-wild egocentric videos. Whereas existing methods rely on curated data with known …

Self-supervised audio-visual soundscape stylization

T Li, R Wang, PY Huang, A Owens… - … on Computer Vision, 2024 - Springer
Speech sounds convey a great deal of information about the scenes, resulting in a variety of
effects ranging from reverberation to additional ambient sounds. In this paper, we …

Action2sound: Ambient-aware generation of action sounds from egocentric videos

C Chen, P Peng, A Baid, Z Xue, WN Hsu… - … on Computer Vision, 2024 - Springer
Generating realistic audio for human actions is important for many applications, such as
creating sound effects for films or virtual reality games. Existing approaches implicitly …

Learning spatial features from audio-visual correspondence in egocentric videos

S Majumder, Z Al-Halah… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
We propose a self-supervised method for learning representations based on spatial audio-
visual correspondences in egocentric videos. Our method uses a masked auto-encoding …

Vision+ x: A survey on multimodal learning in the light of data

Y Zhu, Y Wu, N Sebe, Y Yan - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org
We are perceiving and communicating with the world in a multisensory manner, where
different information sources are sophisticatedly processed and interpreted by separate …

Computer audition: From task-specific machine learning to foundation models

A Triantafyllopoulos, I Tsangko, A Gebhard… - arxiv preprint arxiv …, 2024 - arxiv.org
Foundation models (FMs) are increasingly spearheading recent advances on a variety of
tasks that fall under the purview of computer audition--the use of machines to understand …