- Academic Search

Y Wei, D Hu, Y Tian, X Li - arxiv preprint arxiv:2208.09579, 2022 - arxiv.org

Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

Simpan Kutip Dirujuk 68 kali Artikel terkait 2 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Self-supervised multimodal learning: A survey

Y Zong, O Mac Aodha, T Hospedales - arxiv preprint arxiv:2304.01008, 2023 - arxiv.org

Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …

Simpan Kutip Dirujuk 37 kali Artikel terkait 3 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vision transformers are parameter-efficient audio-visual learners

YB Lin, YL Sung, J Lei, M Bansal… - Proceedings of the …, 2023 - openaccess.thecvf.com

Vision transformers (ViTs) have achieved impressive results on various computer vision
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …

Simpan Kutip Dirujuk 78 kali Artikel terkait 5 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

STARSS23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

K Shimada, A Politis, P Sudarsanam… - Advances in neural …, 2023 - proceedings.neurips.cc

While direction of arrival (DOA) of sound events is generally estimated from multichannel
audio data recorded in a microphone array, sound events usually derive from visually …

Simpan Kutip Dirujuk 40 kali Artikel terkait 7 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A survey on segment anything model (sam): Vision foundation model meets prompt engineering

C Zhang, FD Puspitasari, S Zheng, C Li, Y Qiao… - arxiv preprint arxiv …, 2023 - arxiv.org

Segment anything model (SAM) developed by Meta AI Research has recently attracted
significant attention. Trained on a large segmentation dataset of over 1 billion masks, SAM is …

Simpan Kutip Dirujuk 67 kali Artikel terkait 5 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Multimodal variational auto-encoder based audio-visual segmentation

Y Mao, J Zhang, M **ang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract We propose an Explicit Conditional Multimodal Variational Auto-Encoder
(ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the …

Simpan Kutip Dirujuk 34 kali Artikel terkait 5 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Learning audio-visual source localization via false negative aware contrastive learning

W Sun, J Zhang, J Wang, Z Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Self-supervised audio-visual source localization aims to locate sound-source objects in
video frames without extra annotations. Recent methods often approach this goal with the …

Simpan Kutip Dirujuk 43 kali Artikel terkait 7 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation

K Li, Z Yang, L Chen, Y Yang, J **ao - Proceedings of the 31st ACM …, 2023 - dl.acm.org

Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-
producing objects within image frames and ensure the maps faithfully adheres to the given …

Simpan Kutip Dirujuk 47 kali Artikel terkait 4 versi

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Multi-modal instruction tuned llms with fine-grained visual perception

J He, Y Wang, L Wang, H Lu, JY He… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Multimodal Large Language Model (MLLMs) leverages Large Language Models as
a cognitive framework for diverse visual-language tasks. Recent efforts have been made to …

Simpan Kutip Dirujuk 15 kali Artikel terkait 7 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Achieving cross modal generalization with multimodal unified representation

Y **a, H Huang, J Zhu, Z Zhao - Advances in Neural …, 2023 - proceedings.neurips.cc

This paper introduces a novel task called Cross Modal Generalization (CMG), which
addresses the challenge of learning a unified discrete representation from paired …

Simpan Kutip Dirujuk 24 kali Artikel terkait 5 versi Versi HTML

Buat notifikasi

Kutip

Penelusuran lanjutan

Disimpan ke Koleksi saya

Audio–visual segmentation

Learning in audio-visual context: A review, analysis, and new perspective

Self-supervised multimodal learning: A survey

Vision transformers are parameter-efficient audio-visual learners

STARSS23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

A survey on segment anything model (sam): Vision foundation model meets prompt engineering

Multimodal variational auto-encoder based audio-visual segmentation

Learning audio-visual source localization via false negative aware contrastive learning

Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation

Multi-modal instruction tuned llms with fine-grained visual perception

Achieving cross modal generalization with multimodal unified representation