Content-based and knowledge-enriched representations for classification across modalities: a survey

N Pittaras, G Giannakopoulos, P Stamatopoulos… - ACM Computing …, 2023 - dl.acm.org
This survey documents representation approaches for classification across different
modalities, from purely content-based methods to techniques utilizing external sources of …

Clip-td: Clip targeted distillation for vision-language tasks

Z Wang, N Codella, YC Chen, L Zhou, J Yang… - arxiv preprint arxiv …, 2022 - arxiv.org
Contrastive language-image pretraining (CLIP) links vision and language modalities into a
unified embedding space, yielding the tremendous potential for vision-language (VL) tasks …

Camera on-boarding for person re-identification using hypothesis transfer learning

SM Ahmed, AR Lejbolle, R Panda… - Proceedings of the …, 2020 - openaccess.thecvf.com
Most of the existing approaches for person re-identification consider a static setting where
the number of cameras in the network is fixed. An interesting direction, which has received …

Improving visual question answering by combining scene-text information

H Sharma, AS Jalal - Multimedia Tools and Applications, 2022 - Springer
The text present in natural scenes contains semantic information about its surrounding
environment. For example, the majority of questions asked by blind people related to images …

Learning to respond with stickers: A framework of unifying multi-modality in multi-turn dialog

S Gao, X Chen, C Liu, L Liu, D Zhao… - Proceedings of the Web …, 2020 - dl.acm.org
Stickers with vivid and engaging expressions are becoming increasingly popular in online
messaging apps, and some works are dedicated to automatically select sticker response by …

Multimodal adaptive distillation for leveraging unimodal encoders for vision-language tasks

Z Wang, N Codella, YC Chen, L Zhou, X Dai… - arxiv preprint arxiv …, 2022 - arxiv.org
Cross-modal encoders for vision-language (VL) tasks are often pretrained with carefully
curated vision-language datasets. While these datasets reach an order of 10 million …

Transferring domain-agnostic knowledge in video question answering

T Wu, N Garcia, M Otani, C Chu, Y Nakashima… - arxiv preprint arxiv …, 2021 - arxiv.org
Video question answering (VideoQA) is designed to answer a given question based on a
relevant video clip. The current available large-scale datasets have made it possible to …

Vision to language: Methods, metrics and datasets

N Sharif, U Nadeem, SAA Shah, M Bennamoun… - … Paradigms: Advances in …, 2020 - Springer
Alan Turing's pioneering vision of machines in the 1950s, that are capable of thinking like
humans is still what Artificial Intelligence (AI) and Deep Learning research aspires to …

Learning to respond with your favorite stickers: A framework of unifying multi-modality and user preference in multi-turn dialog

S Gao, X Chen, L Liu, D Zhao, R Yan - ACM Transactions on Information …, 2021 - dl.acm.org
Stickers with vivid and engaging expressions are becoming increasingly popular in online
messaging apps, and some works are dedicated to automatically select sticker response by …

Decoupled box proposal and featurization with ultrafine-grained semantic labels improve image captioning and visual question answering

S Changpinyo, B Pang, P Sharma, R Soricut - arxiv preprint arxiv …, 2019 - arxiv.org
Object detection plays an important role in current solutions to vision and language tasks
like image captioning and visual question answering. However, popular models like Faster …