Študovňa Google

C Ju, H Wang, H Cheng, X Chen, Z Zhai… - … on Computer Vision, 2024 - Springer

Abstract Vision-Language Large Models (VLMs) recently become primary backbone of AI,
due to the impressive performance. However, their expensive computation costs, ie …

Uložiť Citovať Citované 5-krát Súvisiace články Všetky verzie 8

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Video-guided foley sound generation with multimodal controls

Z Chen, P Seetharaman, B Russell, O Nieto… - arxiv preprint arxiv …, 2024 - arxiv.org

Generating sound effects for videos often requires creating artistic sound effects that diverge
significantly from real-life sources and flexible control in the sound design. To address this …

Uložiť Citovať Citované 3-krát Súvisiace články Všetky verzie 2 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Denoiser: Rethinking the robustness for open-vocabulary action recognition

H Cheng, C Ju, H Wang, J Liu, M Chen, Q Hu… - arxiv preprint arxiv …, 2024 - arxiv.org

As one of the fundamental video tasks in computer vision, Open-Vocabulary Action
Recognition (OVAR) recently gains increasing attention, with the development of vision …

Uložiť Citovať Citované 5-krát Súvisiace články Všetky verzie 4 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

H Wang, C Ju, W Lin, S **ao, M Chen, Y Huang… - arxiv preprint arxiv …, 2024 - arxiv.org

In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-
training (CLIP) has made significant strides, becoming foundation for various downstream …

Uložiť Citovať Citované 2-krát Súvisiace články Všetky verzie 2 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Contrast-Unity for Partially-Supervised Temporal Sentence Grounding

H Wang, C Ju, W Lin, C Ma, S **ao, Y Zhang… - arxiv preprint arxiv …, 2025 - arxiv.org

Temporal sentence grounding aims to detect event timestamps described by the natural
language query from given untrimmed videos. The existing fully-supervised setting achieves …

Uložiť Citovať Súvisiace články Všetky verzie 2 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance

H Wang, Z Yu, G Spadaro, C Ju, V Quétu… - arxiv preprint arxiv …, 2025 - arxiv.org

Recently, Multi-modal Large Language Models (MLLMs) have shown remarkable
effectiveness for multi-modal tasks due to their abilities to generate and understand cross …

Uložiť Citovať Súvisiace články Všetky verzie 2 HTML verzia

Vytvoriť upozornenie

Citovať

Rozšírené vyhľadávanie

Uložené do mojej knižnice

Audio-visual segmentation via unlabeled frame exploitation

Turbo: Informativity-driven acceleration plug-in for vision-language large models

Video-guided foley sound generation with multimodal controls

Denoiser: Rethinking the robustness for open-vocabulary action recognition

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

Contrast-Unity for Partially-Supervised Temporal Sentence Grounding

FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance