Content-based and knowledge-enriched representations for classification across modalities: a survey
This survey documents representation approaches for classification across different
modalities, from purely content-based methods to techniques utilizing external sources of …
modalities, from purely content-based methods to techniques utilizing external sources of …
Clip-td: Clip targeted distillation for vision-language tasks
Contrastive language-image pretraining (CLIP) links vision and language modalities into a
unified embedding space, yielding the tremendous potential for vision-language (VL) tasks …
unified embedding space, yielding the tremendous potential for vision-language (VL) tasks …
Camera on-boarding for person re-identification using hypothesis transfer learning
Most of the existing approaches for person re-identification consider a static setting where
the number of cameras in the network is fixed. An interesting direction, which has received …
the number of cameras in the network is fixed. An interesting direction, which has received …
Improving visual question answering by combining scene-text information
The text present in natural scenes contains semantic information about its surrounding
environment. For example, the majority of questions asked by blind people related to images …
environment. For example, the majority of questions asked by blind people related to images …
Learning to respond with stickers: A framework of unifying multi-modality in multi-turn dialog
Stickers with vivid and engaging expressions are becoming increasingly popular in online
messaging apps, and some works are dedicated to automatically select sticker response by …
messaging apps, and some works are dedicated to automatically select sticker response by …
Multimodal adaptive distillation for leveraging unimodal encoders for vision-language tasks
Cross-modal encoders for vision-language (VL) tasks are often pretrained with carefully
curated vision-language datasets. While these datasets reach an order of 10 million …
curated vision-language datasets. While these datasets reach an order of 10 million …
Transferring domain-agnostic knowledge in video question answering
Video question answering (VideoQA) is designed to answer a given question based on a
relevant video clip. The current available large-scale datasets have made it possible to …
relevant video clip. The current available large-scale datasets have made it possible to …
Vision to language: Methods, metrics and datasets
Alan Turing's pioneering vision of machines in the 1950s, that are capable of thinking like
humans is still what Artificial Intelligence (AI) and Deep Learning research aspires to …
humans is still what Artificial Intelligence (AI) and Deep Learning research aspires to …
Learning to respond with your favorite stickers: A framework of unifying multi-modality and user preference in multi-turn dialog
Stickers with vivid and engaging expressions are becoming increasingly popular in online
messaging apps, and some works are dedicated to automatically select sticker response by …
messaging apps, and some works are dedicated to automatically select sticker response by …
Decoupled box proposal and featurization with ultrafine-grained semantic labels improve image captioning and visual question answering
Object detection plays an important role in current solutions to vision and language tasks
like image captioning and visual question answering. However, popular models like Faster …
like image captioning and visual question answering. However, popular models like Faster …