- Academic Search

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research

X Mei, C Meng, H Liu, Q Kong, T Ko… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org

The advancement of audio-language (AL) multimodal learning tasks has been significant in
recent years, yet the limited size of existing audio-language datasets poses challenges for …

保存引用被引用数: 154 関連記事全 3 バージョン

[Free GPT-4]

[PDF] arxiv.org

Foundation models for music: A survey

Y Ma, A Øland, A Ragni, BMS Del Sette, C Saitis… - arxiv preprint arxiv …, 2024 - arxiv.org

In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …

保存引用被引用数: 12 関連記事全 4 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Towards audio language modeling-an overview

H Wu, X Chen, YC Lin, K Chang, HL Chung… - arxiv preprint arxiv …, 2024 - arxiv.org

Neural audio codecs are initially introduced to compress audio data into compact codes to
reduce transmission latency. Researchers recently discovered the potential of codecs as …

保存引用被引用数: 24 関連記事全 2 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Lauragpt: Listen, attend, understand, and regenerate audio with gpt

Z Du, J Wang, Q Chen, Y Chu, Z Gao, Z Li, K Hu… - arxiv preprint arxiv …, 2023 - arxiv.org

Generative Pre-trained Transformer (GPT) models have achieved remarkable performance
on various natural language processing tasks, and have shown great potential as …

保存引用被引用数: 43 関連記事 HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Audiobox: Unified audio generation with natural language prompts

A Vyas, B Shi, M Le, A Tjandra, YC Wu, B Guo… - arxiv preprint arxiv …, 2023 - arxiv.org

Audio is an essential part of our life, but creating it often requires expertise and is time-
consuming. Research communities have made great progress over the past year advancing …

保存引用被引用数: 87 関連記事全 2 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling

S Ji, Z Jiang, W Wang, Y Chen, M Fang, J Zuo… - arxiv preprint arxiv …, 2024 - arxiv.org

Language models have been effectively applied to modeling natural signals, such as
images, video, speech, and audio. A crucial component of these models is the codec …

保存引用被引用数: 22 関連記事全 3 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts

SE Eskimez, X Wang, M Thakker, C Li… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org

This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-
autoregressive zero-shot text-to-speech system that offers human-level naturalness and …

保存引用被引用数: 19 関連記事全 4 バージョン

[Free GPT-4]

[PDF] arxiv.org

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

Z Ju, Y Wang, K Shen, X Tan, D **n, D Yang… - arxiv preprint arxiv …, 2024 - arxiv.org

While recent large-scale text-to-speech (TTS) models have achieved significant progress,
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …

保存引用被引用数: 130 関連記事全 4 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Flashspeech: Efficient zero-shot speech synthesis

Z Ye, Z Ju, H Liu, X Tan, J Chen, Y Lu, P Sun… - Proceedings of the …, 2024 - dl.acm.org

Recent progress in large-scale zero-shot speech synthesis has been significantly advanced
by language models and diffusion models. However, the generation process of both …

保存引用被引用数: 12 関連記事全 2 バージョン

[Free GPT-4]

[PDF] openreview.net

Multi-modal and multi-agent systems meet rationality: A survey

B Jiang, Y **e, X Wang, WJ Su, CJ Taylor… - ICML 2024 Workshop …, 2024 - openreview.net

Rationality is characterized by logical thinking and decision-making that align with evidence
and logical rules. This quality is essential for effective problem-solving, as it ensures that …

保存引用被引用数: 12 関連記事全 2 バージョン HTMLバージョン

引用

検索オプション

マイライブラリに保存しました

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research

Foundation models for music: A survey

Towards audio language modeling-an overview

Lauragpt: Listen, attend, understand, and regenerate audio with gpt

Audiobox: Unified audio generation with natural language prompts

Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling

E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

Flashspeech: Efficient zero-shot speech synthesis

Multi-modal and multi-agent systems meet rationality: A survey