Explaining transformer-based image captioning models: An empirical analysis
Image Captioning is the task of translating an input image into a textual description. As such,
it connects Vision and Language in a generative fashion, with applications that range from …
it connects Vision and Language in a generative fashion, with applications that range from …
Aladin: distilling fine-grained alignment scores for efficient image-text matching and retrieval
Image-text matching is gaining a leading role among tasks involving the joint understanding
of vision and language. In literature, this task is often used as a pre-training objective to …
of vision and language. In literature, this task is often used as a pre-training objective to …
Deep residual weight-sharing attention network with low-rank attention for visual question answering
The attention-based networks have become prevailing recently in visual question answering
(VQA) due to their high performances. However, the extensive memory consumption of …
(VQA) due to their high performances. However, the extensive memory consumption of …
LOIS: looking out of instance semantics for visual question answering
S Zhang, Y Chen, Y Sun, F Wang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Visual question answering (VQA) has been intensively studied as a multimodal task,
requiring efforts to bridge vision and language for correct answer inference. Recent attempts …
requiring efforts to bridge vision and language for correct answer inference. Recent attempts …
Why are you traveling? Inferring trip profiles from online reviews and domain-knowledge
This paper addresses the task of inferring trip profiles (TPs), which consists of determining
the profile of travelers engaged in a particular trip given a set of possible categories. TPs …
the profile of travelers engaged in a particular trip given a set of possible categories. TPs …
Learning to select: A fully attentive approach for novel object captioning
Image captioning models have lately shown impressive results when applied to standard
datasets. Switching to real-life scenarios, however, constitutes a challenge due to the larger …
datasets. Switching to real-life scenarios, however, constitutes a challenge due to the larger …
Towards Identity-Aware Cross-Modal Retrieval: a Dataset and a Baseline
N Messina, L Vadicamo, L Maltese… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in deep learning have significantly enhanced content-based retrieval
methods, notably through models like CLIP that map images and texts into a shared …
methods, notably through models like CLIP that map images and texts into a shared …
Is CLIP the main roadblock for fine-grained open-world perception?
Modern applications increasingly demand flexible computer vision models that adapt to
novel concepts not encountered during training. This necessity is pivotal in emerging …
novel concepts not encountered during training. This necessity is pivotal in emerging …
Trasformare Visione e Linguaggio con Attenzione
M Stefanini - 2023 - iris.unimore.it
Attention mechanism and Transformer-based architectures have recently revolutionized the
artificial intelligence landscape in almost every field. Ever since their first introduction, they …
artificial intelligence landscape in almost every field. Ever since their first introduction, they …
用于图文检索的跨模态信息交互推理网络.
魏钰琦, **宁 - Journal of Computer Engineering & …, 2023 - search.ebscohost.com
针对跨模态检索任务中图像与文本模态的语义特征复杂度不一致问题, 提出了一种局部细粒度
对齐与全局特征推理相结合的图文匹配方法. 首先将图像和文本特征输入自适应交叉注意网络 …
对齐与全局特征推理相结合的图文匹配方法. 首先将图像和文本特征输入自适应交叉注意网络 …