Google Učenjak

R Girdhar, A El-Nouby, Z Liu, M Singh… - Proceedings of the …, 2023 - openaccess.thecvf.com

We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …

Shrani Navedi Navedeno v 870 virih Sorodni članki Vse različice: 10 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Self-chained image-language model for video localization and question answering

S Yu, J Cho, P Yadav, M Bansal - Advances in Neural …, 2023 - proceedings.neurips.cc

Recent studies have shown promising results on utilizing large pre-trained image-language
models for video question answering. While these image-language models can efficiently …

Shrani Navedi Navedeno v 155 virih Sorodni članki Vse različice: 7 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Unmasked teacher: Towards training-efficient video foundation models

K Li, Y Wang, Y Li, Y Wang, Y He… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Video Foundation Models (VFMs) have received limited exploration due to high
computational costs and data scarcity. Previous VFMs rely on Image Foundation Models …

Shrani Navedi Navedeno v 158 virih Sorodni članki Vse različice: 6 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Learning video representations from large language models

Y Zhao, I Misra, P Krähenbühl… - Proceedings of the …, 2023 - openaccess.thecvf.com

We introduce LAVILA, a new approach to learning video-language representations by
leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be …

Shrani Navedi Navedeno v 180 virih Sorodni članki Vse različice: 8 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

S Chen, H Li, Q Wang, Z Zhao… - Advances in Neural …, 2023 - proceedings.neurips.cc

Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …

Shrani Navedi Navedeno v 111 virih Sorodni članki Vse različice: 6 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Onetracker: Unifying visual object tracking with foundation models and efficient tuning

L Hong, S Yan, R Zhang, W Li, X Zhou… - Proceedings of the …, 2024 - openaccess.thecvf.com

Visual object tracking aims to localize the target object of each frame based on its initial
appearance in the first frame. Depending on the input modility tracking tasks can be divided …

Shrani Navedi Navedeno v 46 virih Sorodni članki Vse različice: 7 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

St-llm: Large language models are effective temporal learners

R Liu, C Li, H Tang, Y Ge, Y Shan, G Li - European Conference on …, 2024 - Springer

Abstract Large Language Models (LLMs) have showcased impressive capabilities in text
comprehension and generation, prompting research efforts towards video LLMs to facilitate …

Shrani Navedi Navedeno v 56 virih Sorodni članki Vse različice: 8

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and llms evaluations

L Yuan, Y Chen, G Cui, H Gao, F Zou… - Advances in …, 2023 - proceedings.neurips.cc

This paper reexamines the research on out-of-distribution (OOD) robustness in the field of
NLP. We find that the distribution shift settings in previous studies commonly lack adequate …

Shrani Navedi Navedeno v 86 virih Sorodni članki Vse različice: 8 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

All in one: Exploring unified video-language pre-training

J Wang, Y Ge, R Yan, Y Ge, KQ Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Mainstream Video-Language Pre-training models consist of three parts, a video
encoder, a text encoder, and a video-text fusion Transformer. They pursue better …

Shrani Navedi Navedeno v 233 virih Sorodni članki Vse različice: 4 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Agiqa-3k: An open database for ai-generated image quality assessment

C Li, Z Zhang, H Wu, W Sun, X Min… - … on Circuits and …, 2023 - ieeexplore.ieee.org

With the rapid advancements of the text-to-image generative model, AI-generated images
(AGIs) have been widely applied to entertainment, education, social media, etc. However …

Shrani Navedi Navedeno v 122 virih Sorodni članki Vse različice: 5

Ustvari opozorilo

Navedi

Napredno iskanje

Shranjeno v Mojo knjižnico

Clip-vip: Adapting pre-trained image-text model to video-language representation alignment

Imagebind: One embedding space to bind them all

Self-chained image-language model for video localization and question answering

Unmasked teacher: Towards training-efficient video foundation models

Learning video representations from large language models

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

Onetracker: Unifying visual object tracking with foundation models and efficient tuning

St-llm: Large language models are effective temporal learners

Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and llms evaluations

All in one: Exploring unified video-language pre-training

Agiqa-3k: An open database for ai-generated image quality assessment