[HTML][HTML] Speech emotion recognition using transfer learning: Integration of advanced speaker embeddings and image recognition models

M Jakubec, E Lieskovska, R Jarina, M Spisiak… - Applied Sciences, 2024 - mdpi.com
Automatic Speech Emotion Recognition (SER) plays a vital role in making human–computer
interactions more natural and effective. A significant challenge in SER development is the …

Enhancing text generation from knowledge graphs with cross-structure attention distillation

X Shi, Z **a, P Cheng, Y Li - Engineering Applications of Artificial …, 2024 - Elsevier
Existing Large-scale pre-trained language models (PLMs) can effectively enhance the
knowledge-graph-to-text (KG-to-text) generation by processing the linearized version of a …

Representation Purification for End-to-End Speech Translation

C Zhang, Y Zhou, R Zhao, Y Chen, X Shi - arxiv preprint arxiv:2412.04266, 2024 - arxiv.org
Speech-to-text translation (ST) is a cross-modal task that involves converting spoken
language into text in a different language. Previous research primarily focused on …

Enhancing multimodal translation: Achieving consistency among visual information, source language and target language

X Shi, X Yang, P Cheng, Y Zhou, J Liu - Neurocomputing, 2025 - Elsevier
Multimodal machine translation refers to the task of using information from images, videos,
etc., to assist in text translation. Numerous studies have demonstrated that incorporating …

Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2

C Xu, EW Sun - arxiv preprint arxiv:2407.14212, 2024 - arxiv.org
An increasing number of Chinese people are troubled by different degrees of visual
impairment, which has made the modal conversion between a single image or video frame …