الباحث العلمي من Google

S Wang, Y Zhu, H Liu, Z Zheng, C Chen, J Li - ACM Computing Surveys, 2024‏ - dl.acm.org‏

Large Language Models (LLMs) have recently transformed both the academic and industrial
landscapes due to their remarkable capacity to understand, analyze, and generate texts …‏

حفظ اقتباس تم اقتباسها في عدد: 103 مقالات ذات صلة الإصدارات الـ 2كلها

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Foundation Models Defining a New Era in Vision: a Survey and Outlook‏

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025‏ - ieeexplore.ieee.org‏

Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …‏

حفظ اقتباس تم اقتباسها في عدد: 139 مقالات ذات صلة الإصدارات الـ 2كلها

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Imagebind: One embedding space to bind them all‏

R Girdhar, A El-Nouby, Z Liu, M Singh… - Proceedings of the …, 2023‏ - openaccess.thecvf.com‏

We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …‏

حفظ اقتباس تم اقتباسها في عدد: 844 مقالات ذات صلة الإصدارات الـ 7كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

High-fidelity audio compression with improved rvqgan‏

R Kumar, P Seetharaman, A Luebs… - Advances in Neural …, 2023‏ - proceedings.neurips.cc‏

Abstract Language models have been successfully used to model natural signals, such as
images, speech, and music. A key component of these models is a high quality neural …‏

حفظ اقتباس تم اقتباسها في عدد: 245 مقالات ذات صلة الإصدارات الـ 5كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Audioldm: Text-to-audio generation with latent diffusion models‏

H Liu, Z Chen, Y Yuan, X Mei, X Liu, D Mandic… - arxiv preprint arxiv …, 2023‏ - arxiv.org‏

Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general
audio based on text descriptions. However, previous studies in TTA have limited generation …‏

حفظ اقتباس تم اقتباسها في عدد: 544 مقالات ذات صلة الإصدارات الـ 7كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models‏

R Huang, J Huang, D Yang, Y Ren… - International …, 2023‏ - proceedings.mlr.press‏

Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the …‏

حفظ اقتباس تم اقتباسها في عدد: 319 مقالات ذات صلة الإصدارات الـ 7كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Any-to-any generation via composable diffusion‏

Z Tang, Z Yang, C Zhu, M Zeng… - Advances in Neural …, 2023‏ - proceedings.neurips.cc‏

Abstract We present Composable Diffusion (CoDi), a novel generative model capable of
generating any combination of output modalities, such as language, image, video, or audio …‏

حفظ اقتباس تم اقتباسها في عدد: 149 مقالات ذات صلة الإصدارات الـ 8كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

High fidelity neural audio compression‏

A Défossez, J Copet, G Synnaeve, Y Adi - arxiv preprint arxiv:2210.13438, 2022‏ - arxiv.org‏

We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural
networks. It consists in a streaming encoder-decoder architecture with quantized latent …‏

حفظ اقتباس تم اقتباسها في عدد: 704 مقالات ذات صلة الإصدارات الـ 3كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset‏

S Chen, H Li, Q Wang, Z Zhao… - Advances in Neural …, 2023‏ - proceedings.neurips.cc‏

Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …‏

حفظ اقتباس تم اقتباسها في عدد: 104 مقالات ذات صلة الإصدارات الـ 6كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Perception test: A diagnostic benchmark for multimodal video models‏

V Patraucean, L Smaira, A Gupta… - Advances in …, 2023‏ - proceedings.neurips.cc‏

We propose a novel multimodal video benchmark-the Perception Test-to evaluate the
perception and reasoning skills of pre-trained multimodal models (eg Flamingo, BEiT-3, or …‏

حفظ اقتباس تم اقتباسها في عدد: 86 مقالات ذات صلة الإصدارات الـ 4كلها إصدار HTML‏

إنشاء تنبيه

اقتباس

بحث متقدم

تم حفظ المقالة في مكتبتي.

Audio set: An ontology and human-labeled dataset for audio events

Knowledge editing for large language models: A survey‏

Foundation Models Defining a New Era in Vision: a Survey and Outlook‏

Imagebind: One embedding space to bind them all‏

High-fidelity audio compression with improved rvqgan‏

Audioldm: Text-to-audio generation with latent diffusion models‏

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models‏

Any-to-any generation via composable diffusion‏

High fidelity neural audio compression‏

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset‏

Perception test: A diagnostic benchmark for multimodal video models‏