- Academic Search

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research

X Mei, C Meng, H Liu, Q Kong, T Ko… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org

The advancement of audio-language (AL) multimodal learning tasks has been significant in
recent years, yet the limited size of existing audio-language datasets poses challenges for …

Gem Citer Citeret af 161 Relaterede artikler Alle 8 versioner

[Free GPT-4]
[DeepSeek]

[PDF] ieee.org

Audio-Language Datasets of Scenes and Events: A Survey

G Wijngaard, E Formisano, M Esposito… - IEEE …, 2025 - ieeexplore.ieee.org

Audio-language models (ALMs) generate linguistic descriptions of sound-producing events
and scenes. Advances in dataset creation and computational power have led to significant …

Gem Citer Citeret af 2 Relaterede artikler Alle 3 versioner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation

D Niizumi, D Takeuchi, Y Ohishi, N Harada… - arxiv preprint arxiv …, 2024 - arxiv.org

Contrastive language-audio pre-training (CLAP) enables zero-shot (ZS) inference of audio
and exhibits promising performance in several classification tasks. However, conventional …

Gem Citer Citeret af 5 Relaterede artikler Alle 4 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

T-CLAP: Temporal-enhanced contrastive language-audio pretraining

Y Yuan, Z Chen, X Liu, H Liu, X Xu, D Jia… - 2024 IEEE 34th …, 2024 - ieeexplore.ieee.org

Contrastive language-audio pretraining (CLAP) has been developed to align the
representations of audio and language, achieving remarkable performance in retrieval and …

Gem Citer Citeret af 3 Relaterede artikler Alle 7 versioner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Multi-Sentence Grounding for Long-term Instructional Video

Z Li, Q Chen, T Han, Y Zhang, Y Wang… - European Conference on …, 2024 - Springer

In this paper, we aim to establish an automatic, scalable pipeline for denoising the large-
scale instructional dataset and construct a high-quality video-text dataset with multiple …

Gem Citer Citeret af 2 Relaterede artikler Alle 8 versioner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

STA-V2A: Video-to-audio generation with semantic and temporal alignment

Y Ren, C Li, M Xu, W Liang, Y Gu, R Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

Visual and auditory perception are two crucial ways humans experience the world. Text-to-
video generation has made remarkable progress over the past year, but the absence of …

Gem Citer Citeret af 2 Relaterede artikler Alle 2 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Sound-VECaps: Improving Audio Generation With Visual Enhanced Captions

Y Yuan, D Jia, X Zhuang, Y Chen, Z Liu, Z Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

Generative models have shown significant achievements in audio generation tasks.
However, existing models struggle with complex and detailed prompts, leading to potential …

Gem Citer Citeret af 2 Relaterede artikler Alle 4 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

Y Tang, D Shimada, J Bi, M Feng, H Hua… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) have demonstrated remarkable capabilities in natural
language and multimodal domains. By fine-tuning multimodal LLMs with temporal …

Gem Citer Citeret af 1 Relaterede artikler Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context

J Li, S Tao, Y Yan, X Gu, H Xu, X Zheng, Y Lyu… - arxiv preprint arxiv …, 2024 - arxiv.org

Endeavors have been made to explore Large Language Models for video analysis (Video-
LLMs), particularly in understanding and interpreting long videos. However, existing Video …

Gem Citer Relaterede artikler Alle 2 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Audio Captioning via Generative Pair-to-Pair Retrieval with Refined Knowledge Base

C Changin, L Sungjun, R Wonjong - arxiv preprint arxiv:2410.10913, 2024 - arxiv.org

Recent advances in audio understanding tasks leverage the reasoning capabilities of LLMs.
However, adapting LLMs to learn audio concepts requires massive training data and …

Gem Citer Relaterede artikler Alle 2 versioner Vis som HTML

Opret underretning

Citer

Avanceret søgning

Gemt i Min samling

A large-scale dataset for audio-language representation learning

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research

Audio-Language Datasets of Scenes and Events: A Survey

M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation

T-CLAP: Temporal-enhanced contrastive language-audio pretraining

Multi-Sentence Grounding for Long-term Instructional Video

STA-V2A: Video-to-audio generation with semantic and temporal alignment

Sound-VECaps: Improving Audio Generation With Visual Enhanced Captions

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context

Audio Captioning via Generative Pair-to-Pair Retrieval with Refined Knowledge Base