Imagebind: One embedding space to bind them all

R Girdhar, A El-Nouby, Z Liu, M Singh… - Proceedings of the …, 2023 - openaccess.thecvf.com
We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …

Self-chained image-language model for video localization and question answering

S Yu, J Cho, P Yadav, M Bansal - Advances in Neural …, 2023 - proceedings.neurips.cc
Recent studies have shown promising results on utilizing large pre-trained image-language
models for video question answering. While these image-language models can efficiently …

Unmasked teacher: Towards training-efficient video foundation models

K Li, Y Wang, Y Li, Y Wang, Y He… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Video Foundation Models (VFMs) have received limited exploration due to high
computational costs and data scarcity. Previous VFMs rely on Image Foundation Models …

Learning video representations from large language models

Y Zhao, I Misra, P Krähenbühl… - Proceedings of the …, 2023 - openaccess.thecvf.com
We introduce LAVILA, a new approach to learning video-language representations by
leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be …

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

S Chen, H Li, Q Wang, Z Zhao… - Advances in Neural …, 2023 - proceedings.neurips.cc
Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …

Onetracker: Unifying visual object tracking with foundation models and efficient tuning

L Hong, S Yan, R Zhang, W Li, X Zhou… - Proceedings of the …, 2024 - openaccess.thecvf.com
Visual object tracking aims to localize the target object of each frame based on its initial
appearance in the first frame. Depending on the input modility tracking tasks can be divided …

St-llm: Large language models are effective temporal learners

R Liu, C Li, H Tang, Y Ge, Y Shan, G Li - European Conference on …, 2024 - Springer
Abstract Large Language Models (LLMs) have showcased impressive capabilities in text
comprehension and generation, prompting research efforts towards video LLMs to facilitate …

Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and llms evaluations

L Yuan, Y Chen, G Cui, H Gao, F Zou… - Advances in …, 2023 - proceedings.neurips.cc
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of
NLP. We find that the distribution shift settings in previous studies commonly lack adequate …

All in one: Exploring unified video-language pre-training

J Wang, Y Ge, R Yan, Y Ge, KQ Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Mainstream Video-Language Pre-training models consist of three parts, a video
encoder, a text encoder, and a video-text fusion Transformer. They pursue better …

Agiqa-3k: An open database for ai-generated image quality assessment

C Li, Z Zhang, H Wu, W Sun, X Min… - … on Circuits and …, 2023 - ieeexplore.ieee.org
With the rapid advancements of the text-to-image generative model, AI-generated images
(AGIs) have been widely applied to entertainment, education, social media, etc. However …