Vlap: Efficient video-language alignment via frame prompting and distilling for video question answering

X Wang, J Liang, CK Wang, K Deng, Y Lou, MC Lin… - CoRR, 2023 - openreview.net
In this work, we propose an efficient Video-Language Alignment (ViLA) network. Our ViLA
model addresses both efficient frame sampling and effective cross-modal alignment in a …

Vila: Efficient video-language alignment for video question answering

X Wang, J Liang, CK Wang, K Deng, Y Lou… - … on Computer Vision, 2024 - Springer
We propose an efficient Vi deo-L anguage A lignment (ViLA) network. Our ViLA model
addresses both efficient frame sampling and effective cross-modal alignment in a unified …

Auxiliary modality learning with generalized curriculum distillation

Y Shen, X Wang, P Gao, M Lin - International Conference on …, 2023 - proceedings.mlr.press
Driven by the need from real-world applications, Auxiliary Modality Learning (AML) offers the
possibility to utilize more information from auxiliary data in training, while only requiring data …

Learning-Based Autonomous Driving With Enhanced Data Efficiency and Policy Training

Y Shen - 2023 - search.proquest.com
Autonomous vehicles are capable of sensing the environment and moving around with little
to no human intervention, enhancing efficiency and safety. Self-driving cars, for instance, will …