Google znalac

Y Wei, D Hu, Y Tian, X Li - arxiv preprint arxiv:2208.09579, 2022 - arxiv.org

Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

Spremi Citiraj Spominje se 68 puta Srodni članci Svih 2 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Self-supervised multimodal learning: A survey

Y Zong, O Mac Aodha, T Hospedales - arxiv preprint arxiv:2304.01008, 2023 - arxiv.org

Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …

Spremi Citiraj Spominje se 37 puta Srodni članci Svih 3 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vision transformers are parameter-efficient audio-visual learners

YB Lin, YL Sung, J Lei, M Bansal… - Proceedings of the …, 2023 - openaccess.thecvf.com

Vision transformers (ViTs) have achieved impressive results on various computer vision
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …

Spremi Citiraj Spominje se 78 puta Srodni članci Svih 5 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

STARSS23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

K Shimada, A Politis, P Sudarsanam… - Advances in neural …, 2023 - proceedings.neurips.cc

While direction of arrival (DOA) of sound events is generally estimated from multichannel
audio data recorded in a microphone array, sound events usually derive from visually …

Spremi Citiraj Spominje se 40 puta Srodni članci Svih 7 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A survey on segment anything model (sam): Vision foundation model meets prompt engineering

C Zhang, FD Puspitasari, S Zheng, C Li, Y Qiao… - arxiv preprint arxiv …, 2023 - arxiv.org

Segment anything model (SAM) developed by Meta AI Research has recently attracted
significant attention. Trained on a large segmentation dataset of over 1 billion masks, SAM is …

Spremi Citiraj Spominje se 67 puta Srodni članci Svih 5 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Multimodal variational auto-encoder based audio-visual segmentation

Y Mao, J Zhang, M **ang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract We propose an Explicit Conditional Multimodal Variational Auto-Encoder
(ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the …

Spremi Citiraj Spominje se 34 puta Srodni članci Svih 5 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Learning audio-visual source localization via false negative aware contrastive learning

W Sun, J Zhang, J Wang, Z Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Self-supervised audio-visual source localization aims to locate sound-source objects in
video frames without extra annotations. Recent methods often approach this goal with the …

Spremi Citiraj Spominje se 43 puta Srodni članci Svih 7 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation

K Li, Z Yang, L Chen, Y Yang, J **ao - Proceedings of the 31st ACM …, 2023 - dl.acm.org

Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-
producing objects within image frames and ensure the maps faithfully adheres to the given …

Spremi Citiraj Spominje se 47 puta Srodni članci Svih 4 inačica

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Multi-modal instruction tuned llms with fine-grained visual perception

J He, Y Wang, L Wang, H Lu, JY He… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Multimodal Large Language Model (MLLMs) leverages Large Language Models as
a cognitive framework for diverse visual-language tasks. Recent efforts have been made to …

Spremi Citiraj Spominje se 15 puta Srodni članci Svih 7 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Achieving cross modal generalization with multimodal unified representation

Y **a, H Huang, J Zhu, Z Zhao - Advances in Neural …, 2023 - proceedings.neurips.cc

This paper introduces a novel task called Cross Modal Generalization (CMG), which
addresses the challenge of learning a unified discrete representation from paired …

Spremi Citiraj Spominje se 25 puta Srodni članci Svih 5 inačica Prikaži kao HTML

Stvori obavijest

Citiraj

Napredno pretraživanje

Spremljeno u Moju knjižnicu

Audio–visual segmentation

Learning in audio-visual context: A review, analysis, and new perspective

Self-supervised multimodal learning: A survey

Vision transformers are parameter-efficient audio-visual learners

STARSS23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

A survey on segment anything model (sam): Vision foundation model meets prompt engineering

Multimodal variational auto-encoder based audio-visual segmentation

Learning audio-visual source localization via false negative aware contrastive learning

Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation

Multi-modal instruction tuned llms with fine-grained visual perception

Achieving cross modal generalization with multimodal unified representation