الباحث العلمي من Google

M Tschannen, C Eastwood, F Mentzer - European Conference on …, 2024‏ - Springer‏

Abstract We introduce Generative Infinite-Vocabulary Transformers (GIVT) which generate
vector sequences with real-valued entries, instead of discrete tokens from a finite …‏

حفظ اقتباس تم اقتباسها في عدد: 32 مقالات ذات صلة الإصدارات الـ 2كلها

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Recent advances in speech language models: A survey‏

W Cui, D Yu, X Jiao, Z Meng, G Zhang, Q Wang… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Large Language Models (LLMs) have recently garnered significant attention, primarily for
their capabilities in text-based interactions. However, natural human interaction often relies …‏

حفظ اقتباس تم اقتباسها في عدد: 5 مقالات ذات صلة الإصدارات الـ 2كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey‏

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …‏

حفظ اقتباس تم اقتباسها في عدد: 2 مقالات ذات صلة الإصدارات الـ 2كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Wavchat: A survey of spoken dialogue models‏

S Ji, Y Chen, M Fang, J Zuo, J Lu, H Wang… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …‏

حفظ اقتباس تم اقتباسها في عدد: 7 مقالات ذات صلة الإصدارات الـ 2كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech‏

C Du, Y Guo, H Wang, Y Yang, Z Niu, S Wang… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and
VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot …‏

حفظ اقتباس تم اقتباسها في عدد: 20 مقالات ذات صلة الإصدارات الـ 2كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

BAT: Learning to Reason about Spatial Sounds with Large Language Models‏

Z Zheng, P Peng, Z Ma, X Chen, E Choi… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Spatial sound reasoning is a fundamental human skill, enabling us to navigate and interpret
our surroundings based on sound. In this paper we present BAT, which combines the spatial …‏

حفظ اقتباس تم اقتباسها في عدد: 10 مقالات ذات صلة الإصدارات الـ 3كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Streaming decoder-only automatic speech recognition with discrete speech units: A pilot study‏

P Chen, S Sun, C Shan, Q Yang, L **e - arxiv preprint arxiv:2406.18862, 2024‏ - arxiv.org‏

Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown
impressive performance across various speech-related tasks, especially in Automatic …‏

حفظ اقتباس تم اقتباسها في عدد: 2 مقالات ذات صلة الإصدارات الـ 4كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

3D-Speaker-Toolkit: An Open Source Toolkit for Multi-modal Speaker Verification and Diarization‏

Y Chen, S Zheng, H Wang, L Cheng, T Zhu… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

This paper introduces 3D-Speaker-Toolkit, an open source toolkit for multi-modal speaker
verification and diarization. It is designed for the needs of academic researchers and …‏

حفظ اقتباس تم اقتباسها في عدد: 4 مقالات ذات صلة الإصدارات الـ 2كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Task Arithmetic for Language Expansion in Speech Translation‏

YF Cheng, H Futami, Y Kashiwagi, E Tsunoo… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Recent advances in large language models (LLMs) have gained interest in speech-text
multimodal foundation models, achieving strong performance on instruction-based speech …‏

حفظ اقتباس مقالات ذات صلة الإصدارات الـ 4كلها إصدار HTML‏

Improving Audio Explanations using Audio Language Models‏

A Akman, Q Sun, BW Schuller - IEEE Signal Processing Letters, 2025‏ - ieeexplore.ieee.org‏

Foundation models are widely utilised for their strong representational capabilities, driven by
training on extensive datasets with self-supervised learning. The increasing complexity of …‏

حفظ اقتباس مقالات ذات صلة

إنشاء تنبيه

اقتباس

بحث متقدم

تم حفظ المقالة في مكتبتي.

LauraGPT: Listen, attend, understand, and regenerate audio with GPT

Givt: Generative infinite-vocabulary transformers‏

Recent advances in speech language models: A survey‏

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey‏

Wavchat: A survey of spoken dialogue models‏

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech‏

BAT: Learning to Reason about Spatial Sounds with Large Language Models‏

Streaming decoder-only automatic speech recognition with discrete speech units: A pilot study‏

3D-Speaker-Toolkit: An Open Source Toolkit for Multi-modal Speaker Verification and Diarization‏

Task Arithmetic for Language Expansion in Speech Translation‏

Improving Audio Explanations using Audio Language Models‏