Temporal sentence grounding in videos: A survey and future directions

H Zhang, A Sun, W **g, JT Zhou - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …

Graph neural networks for visual question answering: a systematic review

AA Yusuf, C Feng, X Mao, R Ally Duma… - Multimedia Tools and …, 2024 - Springer
Recently, visual question answering (VQA) has gained considerable interest within the
computer vision and natural language processing (NLP) research areas. The VQA task …

Multimodal relation extraction with efficient graph alignment

C Zheng, J Feng, Z Fu, Y Cai, Q Li, T Wang - Proceedings of the 29th …, 2021 - dl.acm.org
Relation extraction (RE) is a fundamental process in constructing knowledge graphs.
However, previous methods on relation extraction suffer sharp performance decline in short …

Vlg-net: Video-language graph matching network for video grounding

M Soldan, M Xu, S Qu, J Tegner… - Proceedings of the …, 2021 - openaccess.thecvf.com
Grounding language queries in videos aims at identifying the time interval (or moment)
semantically relevant to a language query. The solution to this challenging task demands …

Sentiment interaction and multi-graph perception with graph convolutional networks for aspect-based sentiment analysis

Q Lu, X Sun, R Sutcliffe, Y **ng, H Zhang - Knowledge-Based Systems, 2022 - Elsevier
Graph convolutional networks have been successfully applied to aspect-based sentiment
analysis, due to their ability to flexibly capture syntactic information and word dependencies …

Multimodal dialogue response generation

Q Sun, Y Wang, C Xu, K Zheng, Y Yang, H Hu… - arxiv preprint arxiv …, 2021 - arxiv.org
Responsing with image has been recognized as an important capability for an intelligent
conversational agent. Yet existing works only focus on exploring the multimodal dialogue …

Low-fidelity video encoder optimization for temporal action localization

M Xu, JM Perez Rua, X Zhu… - Advances in Neural …, 2021 - proceedings.neurips.cc
Most existing temporal action localization (TAL) methods rely on a transfer learning pipeline:
by first optimizing a video encoder on a large action classification dataset (ie, source …

Exploring sparse spatial relation in graph inference for text-based vqa

S Zhou, D Guo, J Li, X Yang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Text-based visual question answering (TextVQA) faces the significant challenge of avoiding
redundant relational inference. To be specific, a large number of detected objects and …

Visual question answering using deep learning: A survey and performance analysis

Y Srivastava, V Murali, SR Dubey… - Computer Vision and …, 2021 - Springer
Abstract The Visual Question Answering (VQA) task combines challenges for processing
data with both Visual and Linguistic processing, to answer basic 'common sense'questions …

Image difference captioning with instance-level fine-grained feature representation

Q Huang, Y Liang, J Wei, Y Cai, H Liang… - IEEE transactions on …, 2021 - ieeexplore.ieee.org
The task of image difference captioning aims at locating changed objects in similar image
pairs and describing the difference with natural language. The key challenges of this task …