Text-free image-to-speech synthesis using learned segmental units

WN Hsu, D Harwath, C Song, J Glass - arxiv preprint arxiv:2012.15454, 2020 - arxiv.org
In this paper we present the first model for directly synthesizing fluent, natural-sounding
spoken audio captions for images that does not require natural language text as an …

Unsupervised text-to-speech synthesis by unsupervised automatic speech recognition

J Ni, L Wang, H Gao, K Qian, Y Zhang, S Chang… - arxiv preprint arxiv …, 2022 - arxiv.org
An unsupervised text-to-speech synthesis (TTS) system learns to generate speech
waveforms corresponding to any written sentence in a language by observing: 1) a …

SightSpeak Object detection and speech generation for visually challenged people.

P Likhitha, AR Naik, KN Chari, S Dessai… - 2024 15th …, 2024 - ieeexplore.ieee.org
This novel approach is to enhance accessibility for visually impaired individuals by
integrating object detection and speech generation using YOLOv5 model on the COCO …

Cross Lingual Style Transfer Using Multiscale Loss Function for Soliga: A Low Resource Tribal Language

A Dasare, BL Reddy, ASC Koushik, BS Raj… - … Conference on Speech …, 2023 - Springer
Voice conversion is the art of mimicking different speaker voices and styles. In this paper, we
present a cross-lingual speaker style adaptation based on a multi-scale loss function, using …

Unsupervised speech technology for low-resource languages

H Gao - 2024 - ideals.illinois.edu
Deep neural network based speech processing systems have found widespread
applications in daily life, being employed for tasks such as automatic speech recognition …

Direct speech-reply generation from text-dialogue context

K Fujita, Y Ijima, H Sugiyama - 2022 Asia-Pacific Signal and …, 2022 - ieeexplore.ieee.org
Natural speech-dialogue generation has been achieved with cascade systems combining
automatic speech recog-nition, text-dialogue, and text-to-speech models. However, it is still …

이미지 묘사 기법에 대한 조사

옥수빈, 이대호 - Journal of KIISE, 2023 - dbpia.co.kr
딥러닝의 발전과 함께 주목받고 있는 이미지 묘사 기술은 이미지 속 내용을 파악하는 컴퓨터
비전 분야와 문장으로 번역하는 자연어 처리 분야의 기술이 복합적으로 사용된다. 본 …

Lexical emergence from context: exploring unsupervised learning approaches on large multimodal language corpora

WN Havard - 2021 - theses.hal.science
In recent years, deep learning methods allowed the creation of neural models that are able
to process several modalities at once. Neural models of Visually Grounded Speech (VGS) …

[PDF][PDF] L'émergence du lexique en contexte: apport des méthodes non supervisées sur grands corpus de données multimodales

MJL SCHWARTZ, MO SCHARENBORG, ML PRÉVOT… - afcp-parole.org
Résumé Ces dernieres années, les méthodes d'apprentissage profond ont permis de créer
des mod-eles neuronaux capables de traiter plusieurs modalitésa la fois. Les modeles …