Turbo: Informativity-driven acceleration plug-in for vision-language large models

C Ju, H Wang, H Cheng, X Chen, Z Zhai… - … on Computer Vision, 2024 - Springer
Abstract Vision-Language Large Models (VLMs) recently become primary backbone of AI,
due to the impressive performance. However, their expensive computation costs, ie …

Video-guided foley sound generation with multimodal controls

Z Chen, P Seetharaman, B Russell, O Nieto… - arxiv preprint arxiv …, 2024 - arxiv.org
Generating sound effects for videos often requires creating artistic sound effects that diverge
significantly from real-life sources and flexible control in the sound design. To address this …

Denoiser: Rethinking the robustness for open-vocabulary action recognition

H Cheng, C Ju, H Wang, J Liu, M Chen, Q Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
As one of the fundamental video tasks in computer vision, Open-Vocabulary Action
Recognition (OVAR) recently gains increasing attention, with the development of vision …

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

H Wang, C Ju, W Lin, S **ao, M Chen, Y Huang… - arxiv preprint arxiv …, 2024 - arxiv.org
In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-
training (CLIP) has made significant strides, becoming foundation for various downstream …

Contrast-Unity for Partially-Supervised Temporal Sentence Grounding

H Wang, C Ju, W Lin, C Ma, S **ao, Y Zhang… - arxiv preprint arxiv …, 2025 - arxiv.org
Temporal sentence grounding aims to detect event timestamps described by the natural
language query from given untrimmed videos. The existing fully-supervised setting achieves …

FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance

H Wang, Z Yu, G Spadaro, C Ju, V Quétu… - arxiv preprint arxiv …, 2025 - arxiv.org
Recently, Multi-modal Large Language Models (MLLMs) have shown remarkable
effectiveness for multi-modal tasks due to their abilities to generate and understand cross …