Similarity Graph-correlation Reconstruction Network for unsupervised cross-modal hashing

D Yao, Z Li, B Li, C Zhang, H Ma - Expert Systems with Applications, 2024 - Elsevier
Existing cross-modal hash retrieval methods can simultaneously enhance retrieval speed
and reduce storage space. However, these methods face a major challenge in determining …

VISIONE at video browser showdown 2023

G Amato, P Bolettieri, F Carrara, F Falchi… - … on multimedia modeling, 2023 - Springer
In this paper, we present the fourth release of VISIONE, a tool for fast and effective video
search on a large-scale dataset. It includes several search functionalities like text search …

Text-to-motion retrieval: Towards joint understanding of human motion data and natural language

N Messina, J Sedmidubsky, F Falchi… - Proceedings of the 46th …, 2023 - dl.acm.org
Due to recent advances in pose-estimation methods, human motion can be extracted from a
common video in the form of 3D skeleton sequences. Despite wonderful application …

Towards Retrieval-Augmented Architectures for Image Captioning

S Sarto, M Cornia, L Baraldi, A Nicolosi… - ACM Transactions on …, 2024 - dl.acm.org
The objective of image captioning models is to bridge the gap between the visual and
linguistic modalities by generating natural language descriptions that accurately reflect the …

Visione: a large-scale video retrieval system with advanced search functionalities

G Amato, P Bolettieri, F Carrara, F Falchi… - Proceedings of the …, 2023 - dl.acm.org
VISIONE is a large-scale video retrieval system that integrates multiple search
functionalities, including free text search, spatial color and object search, visual and …

[HTML][HTML] Image–Text Matching Model Based on CLIP Bimodal Encoding

Y Zhu, H Xu, A Du, B Wang - Applied Sciences, 2024 - mdpi.com
Image–text matching is a fundamental task in the multimodal research field, connecting
computer vision and natural language processing by aligning visual content with …

Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective

R Tao, M Zhu, H Cao, H Ren - Sensors, 2024 - mdpi.com
Fine-grained representation is fundamental to species classification based on deep
learning, and in this context, cross-modal contrastive learning is an effective method. The …

VISIONE for newbies: an easier-to-use video retrieval system

G Amato, P Bolettieri, F Carrara, F Falchi… - Proceedings of the 20th …, 2023 - dl.acm.org
This paper presents a revised version of the VISIONE video retrieval system, which offers a
wide range of search functionalities, including free text search, spatial color and object …

Cascaded transformer-based networks for wikipedia large-scale image-caption matching

N Messina, DA Coccomini, A Esuli, F Falchi - Multimedia Tools and …, 2024 - Springer
With the increasing importance of multimedia and multilingual data in online encyclopedias,
novel methods are needed to fill domain gaps and automatically connect different modalities …

Evaluating Performance and Trends in Interactive Video Retrieval: Insights from the 12th VBS Competition

L Vadicamo, R Arnold, W Bailer, F Carrara… - IEEE …, 2024 - ieeexplore.ieee.org
This paper conducts a thorough examination of the 12th Video Browser Showdown (VBS)
competition, a well-established international benchmarking campaign for interactive video …