Google Академія

T Wang, F Li, L Zhu, J Li, Z Zhang… - Proceedings of the …, 2025 - ieeexplore.ieee.org

With the exponential surge in diverse multimodal data, traditional unimodal retrieval
methods struggle to meet the needs of users seeking access to data across various …

Зберегти Послатися Цитовано в 24 джерелах Пов’язані статті Кількість версій: 3

[Free GPT-4]
[DeepSeek]

[PDF] springer.com

Automated audio captioning: An overview of recent progress and new challenges

X Mei, X Liu, MD Plumbley, W Wang - … journal on audio, speech, and music …, 2022 - Springer

Automated audio captioning is a cross-modal translation task that aims to generate natural
language descriptions for given audio clips. This task has received increasing attention with …

Зберегти Послатися Цитовано в 61 джерелах Пов’язані статті Кількість версій: 11

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Y Wu, K Chen, T Zhang, Y Hui… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org

Contrastive learning has shown remarkable success in the field of multimodal
representation learning. In this paper, we propose a pipeline of contrastive language-audio …

Зберегти Послатися Цитовано в 537 джерелах Пов’язані статті Кількість версій: 9

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research

X Mei, C Meng, H Liu, Q Kong, T Ko… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org

The advancement of audio-language (AL) multimodal learning tasks has been significant in
recent years, yet the limited size of existing audio-language datasets poses challenges for …

Зберегти Послатися Цитовано в 162 джерелах Пов’язані статті Кількість версій: 8

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

One-peace: Exploring one general representation model toward unlimited modalities

P Wang, S Wang, J Lin, S Bai, X Zhou, J Zhou… - arxiv preprint arxiv …, 2023 - arxiv.org

In this work, we explore a scalable way for building a general representation model toward
unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B …

Зберегти Послатися Цитовано в 124 джерелах Пов’язані статті Кількість версій: 3 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Separate anything you describe

X Liu, Q Kong, Y Zhao, H Liu, Y Yuan… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org

Language-queried audio source separation (LASS) is a new paradigm for computational
auditory scene analysis (CASA). LASS aims to separate a target sound from an audio …

Зберегти Послатися Цитовано в 42 джерелах Пов’язані статті Кількість версій: 8

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Improving text-audio retrieval by text-aware attention pooling and prior matrix revised loss

Y **n, D Yang, Y Zou - ICASSP 2023-2023 IEEE International …, 2023 - ieeexplore.ieee.org

In text-audio retrieval (TAR) tasks, due to the heterogeneity of contents between text and
audio, the semantic information contained in the text is only similar to certain frames within …

Зберегти Послатися Цитовано в 36 джерелах Пов’язані статті Кількість версій: 7

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Audio retrieval with wavtext5k and clap training

S Deshmukh, B Elizalde, H Wang - arxiv preprint arxiv:2209.14275, 2022 - arxiv.org

Audio-Text retrieval takes a natural language query to retrieve relevant audio files in a
database. Conversely, Text-Audio retrieval takes an audio file as a query to retrieve relevant …

Зберегти Послатися Цитовано в 53 джерелах Пов’язані статті Кількість версій: 6 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Modality-independent teachers meet weakly-supervised audio-visual event parser

YH Lai, YC Chen, F Wang - Advances in Neural Information …, 2023 - proceedings.neurips.cc

Audio-visual learning has been a major pillar of multi-modal machine learning, where the
community mostly focused on its $\textit {modality-aligned} $ setting, $\textit {ie} $, the audio …

Зберегти Послатися Цитовано в 9 джерелах Пов’язані статті Кількість версій: 6 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Flap: Fast language-audio pre-training

CF Yeh, PY Huang, V Sharma, SW Li… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org

We propose Fast Language-Audio Pre-training (FLAP), a self-supervised approach that
efficiently and effectively learns aligned audio and language representations through …

Зберегти Послатися Цитовано в 10 джерелах Пов’язані статті Кількість версій: 3

Створити сповіщення

Послатися

Розширений пошук

Збережено в моїй бібліотеці

On metric learning for audio-text cross-modal retrieval

Cross-modal retrieval: a systematic review of methods and future directions

Automated audio captioning: An overview of recent progress and new challenges

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research

One-peace: Exploring one general representation model toward unlimited modalities

Separate anything you describe

Improving text-audio retrieval by text-aware attention pooling and prior matrix revised loss

Audio retrieval with wavtext5k and clap training

Modality-independent teachers meet weakly-supervised audio-visual event parser

Flap: Fast language-audio pre-training