Temporal sentence grounding in videos: A survey and future directions
Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …
Graph neural networks for visual question answering: a systematic review
Recently, visual question answering (VQA) has gained considerable interest within the
computer vision and natural language processing (NLP) research areas. The VQA task …
computer vision and natural language processing (NLP) research areas. The VQA task …
Multimodal relation extraction with efficient graph alignment
Relation extraction (RE) is a fundamental process in constructing knowledge graphs.
However, previous methods on relation extraction suffer sharp performance decline in short …
However, previous methods on relation extraction suffer sharp performance decline in short …
Vlg-net: Video-language graph matching network for video grounding
Grounding language queries in videos aims at identifying the time interval (or moment)
semantically relevant to a language query. The solution to this challenging task demands …
semantically relevant to a language query. The solution to this challenging task demands …
Sentiment interaction and multi-graph perception with graph convolutional networks for aspect-based sentiment analysis
Graph convolutional networks have been successfully applied to aspect-based sentiment
analysis, due to their ability to flexibly capture syntactic information and word dependencies …
analysis, due to their ability to flexibly capture syntactic information and word dependencies …
Multimodal dialogue response generation
Responsing with image has been recognized as an important capability for an intelligent
conversational agent. Yet existing works only focus on exploring the multimodal dialogue …
conversational agent. Yet existing works only focus on exploring the multimodal dialogue …
Low-fidelity video encoder optimization for temporal action localization
Most existing temporal action localization (TAL) methods rely on a transfer learning pipeline:
by first optimizing a video encoder on a large action classification dataset (ie, source …
by first optimizing a video encoder on a large action classification dataset (ie, source …
Exploring sparse spatial relation in graph inference for text-based vqa
Text-based visual question answering (TextVQA) faces the significant challenge of avoiding
redundant relational inference. To be specific, a large number of detected objects and …
redundant relational inference. To be specific, a large number of detected objects and …
Visual question answering using deep learning: A survey and performance analysis
Abstract The Visual Question Answering (VQA) task combines challenges for processing
data with both Visual and Linguistic processing, to answer basic 'common sense'questions …
data with both Visual and Linguistic processing, to answer basic 'common sense'questions …
Image difference captioning with instance-level fine-grained feature representation
The task of image difference captioning aims at locating changed objects in similar image
pairs and describing the difference with natural language. The key challenges of this task …
pairs and describing the difference with natural language. The key challenges of this task …