TeachText: CrossModal text-video retrieval through generalized distillation

I Croitoru, SV Bogolin, M Leordeanu, H **… - Artificial Intelligence, 2025 - Elsevier
In recent years, considerable progress on the task of text-video retrieval has been achieved
by leveraging large-scale pretraining on visual and audio datasets to construct powerful …

[HTML][HTML] Enabling Perspective-Aware Ai with Contextual Scene Graph Generation

D Platnick, M Alirezaie, H Rahnama - Information, 2024 - mdpi.com
This paper advances contextual image understanding within perspective-aware Ai (PAi), an
emerging paradigm in human–computer interaction that enables users to perceive and …

Aligning images and text with semantic role labels for fine-grained cross-modal understanding

A Bhattacharyya, C Mauceri, M Palmer… - Proceedings of the …, 2022 - aclanthology.org
As vision processing and natural language processing continue to advance, there is
increasing interest in multimodal applications, such as image retrieval, caption generation …

NeSy4VRD: A Multifaceted Resource for Neurosymbolic AI Research using Knowledge Graphs in Visual Relationship Detection

D Herron, E Jiménez-Ruiz, G Tarroni… - arxiv preprint arxiv …, 2023 - arxiv.org
NeSy4VRD is a multifaceted resource designed to support the development of
neurosymbolic AI (NeSy) research. NeSy4VRD re-establishes public access to the images …

Bridging Facial Imagery and Vocal Reality: Stable Diffusion-Enhanced Voice Generation

Y Lin, D Liu, Y Xu, H Suo, M Li - 2024 IEEE 14th International …, 2024 - ieeexplore.ieee.org
Generating novel voices in speech synthesis is a challenging task with potential for creating
versatile voices that are needed in entertainment and research. One of the primary obstacles …

Enhanced Dense Image Captioning Based On Transformers

T Goswami, S Potu, KP Reddy… - 2024 8th …, 2024 - ieeexplore.ieee.org
The paper introduces a pioneering work that explores the fusion of computer vision and
natural language processing for narrative generation. We propose an innovative …

Enhance the message passing of key nodes in scene graph generation

H Qiu, Y Sun, X Luo - Proceedings of the 5th International Conference …, 2024 - dl.acm.org
Scene graph generation is an important approach in the field of visual scene understanding.
Several current studies have aimed at how to extract more robust relational features …

Multi-view Attention Networks for Visual Question Answering

M Li, Z Bai, J Deng - 2024 6th International Conference on …, 2024 - ieeexplore.ieee.org
Visual question answering (VQA) is a typical multimodal task that necessitates a
combination of computer vision and natural language processing expertise. The …

Navigating Multimodal Complexity: Advances in Model Design, Dataset Creation, and Evaluation Techniques

PGJ Vickers - 2024 - etheses.whiterose.ac.uk
Ibn Sina, a philosopher of 11th-century Persia, wrote of aFloating Man'. This man is floating
through a void, without the use of his sight or touch or any of the senses which make us …

[PDF][PDF] A simple technical report about the Foundational Few-Shot Object Detection Challenge

Q Chen, J Ge, W **, L Yu - neeharperi.com
A simple technical report about the Foundational Few-Shot Object Detection Challenge Page 1
Abstract A technical report on our using method on the Foundational Few Shot Object Detection …