- Academic Search

Audio–visual segmentation

J Zhou, J Wang, J Zhang, W Sun, J Zhang… - … on Computer Vision, 2022 - Springer

We propose to explore a new problem called audio-visual segmentation (AVS), in which the
goal is to output a pixel-level map of the object (s) that produce sound at the time of the …

保存引用被引用数: 142 関連記事全 5 バージョン

Temporal sentence grounding in videos: A survey and future directions

H Zhang, A Sun, W **g, JT Zhou - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …

保存引用被引用数: 53 関連記事全 8 バージョン

Multimodal variational auto-encoder based audio-visual segmentation

Y Mao, J Zhang, M **ang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract We propose an Explicit Conditional Multimodal Variational Auto-Encoder
(ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the …

保存引用被引用数: 34 関連記事全 5 バージョン HTMLバージョン

Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation

K Li, Z Yang, L Chen, Y Yang, J **ao - Proceedings of the 31st ACM …, 2023 - dl.acm.org

Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-
producing objects within image frames and ensure the maps faithfully adheres to the given …

保存引用被引用数: 46 関連記事全 4 バージョン

Collecting cross-modal presence-absence evidence for weakly-supervised audio-visual event perception

J Gao, M Chen, C Xu - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com

With only video-level event labels, this paper targets at the task of weakly-supervised audio-
visual event perception (WS-AVEP), which aims to temporally localize and categorize events …

保存引用被引用数: 30 関連記事全 4 バージョン HTMLバージョン

Positive sample propagation along the audio-visual event line

J Zhou, L Zheng, Y Zhong, S Hao… - Proceedings of the …, 2021 - openaccess.thecvf.com

Visual and audio signals often coexist in natural environments, forming audio-visual events
(AVEs). Given a video, we aim to localize video segments containing an AVE and identify its …

保存引用被引用数: 113 関連記事全 7 バージョン HTMLバージョン

Audio-visual generalised zero-shot learning with cross-modal attention and language

OB Mercea, L Riesch, A Koepke… - Proceedings of the …, 2022 - openaccess.thecvf.com

Learning to classify video data from classes not included in the training data, ie video-based
zero-shot learning, is challenging. We conjecture that the natural alignment between the …

保存引用被引用数: 63 関連記事全 8 バージョン HTMLバージョン

Contrastive positive sample propagation along the audio-visual event line

J Zhou, D Guo, M Wang - IEEE Transactions on Pattern …, 2022 - ieeexplore.ieee.org

Visual and audio signals often coexist in natural environments, forming audio-visual events
(AVEs). Given a video, we aim to localize video segments containing an AVE and identify its …

保存引用被引用数: 54 関連記事全 7 バージョン