Vision-based traffic accident detection and anticipation: A survey

J Fang, J Qiao, J Xue, Z Li - … on Circuits and Systems for Video …, 2023 - ieeexplore.ieee.org
Traffic accident detection and anticipation is an obstinate road safety problem and
painstaking efforts have been devoted. With the rapid growth of video data, Vision-based …

Lightgt: A light graph transformer for multimedia recommendation

Y Wei, W Liu, F Liu, X Wang, L Nie… - Proceedings of the 46th …, 2023 - dl.acm.org
Multimedia recommendation methods aim to discover the user preference on the multi-
modal information to enhance the collaborative filtering (CF) based recommender system …

Winner: Weakly-supervised hierarchical decomposition and alignment for spatio-temporal video grounding

M Li, H Wang, W Zhang, J Miao… - Proceedings of the …, 2023 - openaccess.thecvf.com
Spatio-temporal video grounding aims to localize the aligned visual tube corresponding to a
language query. Existing techniques achieve such alignment by exploiting dense boundary …

Gradient-regulated meta-prompt learning for generalizable vision-language models

J Li, M Gao, L Wei, S Tang, W Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Prompt tuning, a recently emerging paradigm, enables the powerful vision-language pre-
training models to adapt to downstream tasks in a parameter-and data-efficient way, by …

Dual learning with dynamic knowledge distillation for partially relevant video retrieval

J Dong, M Zhang, Z Zhang, X Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com
Almost all previous text-to-video retrieval works assume that videos are pre-trimmed with
short durations. However, in practice, videos are generally untrimmed containing much …

Unified generative and discriminative training for multi-modal large language models

W Chow, J Li, Q Yu, K Pan, H Fei… - Advances in …, 2025 - proceedings.neurips.cc
Abstract In recent times, Vision-Language Models (VLMs) have been trained under two
predominant paradigms. Generative training has enabled Multimodal Large Language …

Online distillation-enhanced multi-modal transformer for sequential recommendation

W Ji, X Liu, A Zhang, Y Wei, Y Ni, X Wang - Proceedings of the 31st ACM …, 2023 - dl.acm.org
Multi-modal recommendation systems, which integrate diverse types of information, have
gained widespread attention in recent years. However, compared to traditional collaborative …

Not all inputs are valid: Towards open-set video moment retrieval using language

X Fang, W Fang, D Liu, X Qu, J Dong, P Zhou… - Proceedings of the …, 2024 - dl.acm.org
Video Moment Retrieval (VMR) targets to retrieve the specific moment corresponding to a
sentence query from an untrimmed video. Although recent respectable works have made …

Redundancy-aware transformer for video question answering

Y Li, X Yang, A Zhang, C Feng, X Wang… - Proceedings of the 31st …, 2023 - dl.acm.org
This paper identifies two kinds of redundancy in the current VideoQA paradigm. Specifically,
the current video encoders tend to holistically embed all video clues at different granularities …