Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Learning in audio-visual context: A review, analysis, and new perspective
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …
understanding. To mimic human perception ability, audio-visual learning, aimed at …
Self-supervised multimodal learning: A survey
Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …
modalities, has achieved substantial progress in the supervised regime in recent years …
Vision transformers are parameter-efficient audio-visual learners
Vision transformers (ViTs) have achieved impressive results on various computer vision
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …
STARSS23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events
While direction of arrival (DOA) of sound events is generally estimated from multichannel
audio data recorded in a microphone array, sound events usually derive from visually …
audio data recorded in a microphone array, sound events usually derive from visually …
A survey on segment anything model (sam): Vision foundation model meets prompt engineering
Segment anything model (SAM) developed by Meta AI Research has recently attracted
significant attention. Trained on a large segmentation dataset of over 1 billion masks, SAM is …
significant attention. Trained on a large segmentation dataset of over 1 billion masks, SAM is …
Multimodal variational auto-encoder based audio-visual segmentation
Abstract We propose an Explicit Conditional Multimodal Variational Auto-Encoder
(ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the …
(ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the …
Learning audio-visual source localization via false negative aware contrastive learning
Self-supervised audio-visual source localization aims to locate sound-source objects in
video frames without extra annotations. Recent methods often approach this goal with the …
video frames without extra annotations. Recent methods often approach this goal with the …
Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation
Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-
producing objects within image frames and ensure the maps faithfully adheres to the given …
producing objects within image frames and ensure the maps faithfully adheres to the given …
Multi-modal instruction tuned llms with fine-grained visual perception
Abstract Multimodal Large Language Model (MLLMs) leverages Large Language Models as
a cognitive framework for diverse visual-language tasks. Recent efforts have been made to …
a cognitive framework for diverse visual-language tasks. Recent efforts have been made to …
Achieving cross modal generalization with multimodal unified representation
This paper introduces a novel task called Cross Modal Generalization (CMG), which
addresses the challenge of learning a unified discrete representation from paired …
addresses the challenge of learning a unified discrete representation from paired …