- Academic Search

S Yan, X **ong, A Nagrani, A Arnab… - Proceedings of the …, 2023 - openaccess.thecvf.com

While large-scale image-text pretrained models such as CLIP have been used for multiple
video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos …

Zapisz Cytuj Cytowane przez 50 Powiązane artykuły Wszystkie wersje 6 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models

S Luo, C Yan, C Hu, H Zhao - Advances in Neural …, 2024 - proceedings.neurips.cc

Abstract The Video-to-Audio (V2A) model has recently gained attention for its practical
application in generating audio directly from silent videos, particularly in video/film …

Zapisz Cytuj Cytowane przez 71 Powiązane artykuły Wszystkie wersje 6 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

AJ Piergiovanni, I Noble, D Kim… - Proceedings of the …, 2024 - openaccess.thecvf.com

One of the main challenges of multimodal learning is the need to combine heterogeneous
modalities (eg video audio text). For example video and audio are obtained at much higher …

Zapisz Cytuj Cytowane przez 16 Powiązane artykuły Wszystkie wersje 3 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] springer.com

An outlook into the future of egocentric vision

C Plizzari, G Goletto, A Furnari, S Bansal… - International Journal of …, 2024 - Springer

What will the future be? We wonder! In this survey, we explore the gap between current
research in egocentric vision and the ever-anticipated future, where wearable computing …

Zapisz Cytuj Cytowane przez 38 Powiązane artykuły Wszystkie wersje 7

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Soundingactions: Learning how actions sound from narrated egocentric videos

C Chen, K Ashutosh, R Girdhar… - Proceedings of the …, 2024 - openaccess.thecvf.com

We propose a novel self-supervised embedding to learn how actions sound from narrated in-
the-wild egocentric videos. Whereas existing methods rely on curated data with known …

Zapisz Cytuj Cytowane przez 4 Powiązane artykuły Wszystkie wersje 6 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Self-supervised audio-visual soundscape stylization

T Li, R Wang, PY Huang, A Owens… - … on Computer Vision, 2024 - Springer

Speech sounds convey a great deal of information about the scenes, resulting in a variety of
effects ranging from reverberation to additional ambient sounds. In this paper, we …

Zapisz Cytuj Cytowane przez 3 Powiązane artykuły Wszystkie wersje 8

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Action2sound: Ambient-aware generation of action sounds from egocentric videos

C Chen, P Peng, A Baid, Z Xue, WN Hsu… - … on Computer Vision, 2024 - Springer

Generating realistic audio for human actions is important for many applications, such as
creating sound effects for films or virtual reality games. Existing approaches implicitly …

Zapisz Cytuj Cytowane przez 3 Powiązane artykuły Wszystkie wersje 2

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Learning spatial features from audio-visual correspondence in egocentric videos

S Majumder, Z Al-Halah… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

We propose a self-supervised method for learning representations based on spatial audio-
visual correspondences in egocentric videos. Our method uses a masked auto-encoding …

Zapisz Cytuj Cytowane przez 3 Powiązane artykuły Wszystkie wersje 5 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Vision+ x: A survey on multimodal learning in the light of data

Y Zhu, Y Wu, N Sebe, Y Yan - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org

We are perceiving and communicating with the world in a multisensory manner, where
different information sources are sophisticatedly processed and interpreted by separate …

Zapisz Cytuj Cytowane przez 17 Powiązane artykuły Wszystkie wersje 2

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Computer audition: From task-specific machine learning to foundation models

A Triantafyllopoulos, I Tsangko, A Gebhard… - arxiv preprint arxiv …, 2024 - arxiv.org

Foundation models (FMs) are increasingly spearheading recent advances on a variety of
tasks that fall under the purview of computer audition--the use of machines to understand …

Zapisz Cytuj Cytowane przez 4 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

Utwórz alert

Cytuj

Szukanie zaawansowane

Zapisano w Mojej bibliotece

Epic-sounds: A large-scale dataset of actions that sound

Unloc: A unified framework for video localization tasks

Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

An outlook into the future of egocentric vision

Soundingactions: Learning how actions sound from narrated egocentric videos

Self-supervised audio-visual soundscape stylization

Action2sound: Ambient-aware generation of action sounds from egocentric videos

Learning spatial features from audio-visual correspondence in egocentric videos

Vision+ x: A survey on multimodal learning in the light of data

Computer audition: From task-specific machine learning to foundation models