Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Learning in audio-visual context: A review, analysis, and new perspective
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …
understanding. To mimic human perception ability, audio-visual learning, aimed at …
Self-supervised multimodal learning: A survey
Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …
modalities, has achieved substantial progress in the supervised regime in recent years …
Vision transformers are parameter-efficient audio-visual learners
Vision transformers (ViTs) have achieved impressive results on various computer vision
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …
STARSS23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events
K Shimada, A Politis, P Sudarsanam… - Advances in neural …, 2023 - proceedings.neurips.cc
While direction of arrival (DOA) of sound events is generally estimated from multichannel
audio data recorded in a microphone array, sound events usually derive from visually …
audio data recorded in a microphone array, sound events usually derive from visually …
A survey on segment anything model (sam): Vision foundation model meets prompt engineering
Segment anything model (SAM) developed by Meta AI Research has recently attracted
significant attention. Trained on a large segmentation dataset of over 1 billion masks, SAM is …
significant attention. Trained on a large segmentation dataset of over 1 billion masks, SAM is …
Multimodal variational auto-encoder based audio-visual segmentation
Abstract We propose an Explicit Conditional Multimodal Variational Auto-Encoder
(ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the …
(ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the …
Learning audio-visual source localization via false negative aware contrastive learning
Self-supervised audio-visual source localization aims to locate sound-source objects in
video frames without extra annotations. Recent methods often approach this goal with the …
video frames without extra annotations. Recent methods often approach this goal with the …
Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation
Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-
producing objects within image frames and ensure the maps faithfully adheres to the given …
producing objects within image frames and ensure the maps faithfully adheres to the given …
Multi-modal instruction tuned llms with fine-grained visual perception
Abstract Multimodal Large Language Model (MLLMs) leverages Large Language Models as
a cognitive framework for diverse visual-language tasks. Recent efforts have been made to …
a cognitive framework for diverse visual-language tasks. Recent efforts have been made to …
Achieving cross modal generalization with multimodal unified representation
Y **a, H Huang, J Zhu, Z Zhao - Advances in Neural …, 2023 - proceedings.neurips.cc
This paper introduces a novel task called Cross Modal Generalization (CMG), which
addresses the challenge of learning a unified discrete representation from paired …
addresses the challenge of learning a unified discrete representation from paired …