Parameter-efficient fine-tuning for large models: A comprehensive survey

Z Han, C Gao, J Liu, J Zhang, SQ Zhang - arxiv preprint arxiv:2403.14608, 2024 - arxiv.org
Large models represent a groundbreaking advancement in multiple application fields,
enabling remarkable achievements across various tasks. However, their unprecedented …

Visual tuning

BXB Yu, J Chang, H Wang, L Liu, S Wang… - ACM Computing …, 2024 - dl.acm.org
Fine-tuning visual models has been widely shown promising performance on many
downstream visual tasks. With the surprising development of pre-trained visual foundation …

Segment anything in high quality

L Ke, M Ye, M Danelljan, YW Tai… - Advances in Neural …, 2024 - proceedings.neurips.cc
Abstract The recent Segment Anything Model (SAM) represents a big leap in scaling up
segmentation models, allowing for powerful zero-shot capabilities and flexible prompting …

Simda: Simple diffusion adapter for efficient video generation

Z **ng, Q Dai, H Hu, Z Wu… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
The recent wave of AI-generated content has witnessed the great development and success
of Text-to-Image (T2I) technologies. By contrast Text-to-Video (T2V) still falls short of …

Towards open vocabulary learning: A survey

J Wu, X Li, S Xu, H Yuan, H Ding… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
In the field of visual scene understanding, deep neural networks have made impressive
advancements in various core tasks like segmentation, tracking, and detection. However …

One-peace: Exploring one general representation model toward unlimited modalities

P Wang, S Wang, J Lin, S Bai, X Zhou, J Zhou… - arxiv preprint arxiv …, 2023 - arxiv.org
In this work, we explore a scalable way for building a general representation model toward
unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B …

Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models

W Wu, X Wang, H Luo, J Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision-language models (VLMs) pre-trained on large-scale image-text pairs have
demonstrated impressive transferability on various visual tasks. Transferring knowledge …

Dreamvideo: Composing your dream videos with customized subject and motion

Y Wei, S Zhang, Z Qing, H Yuan, Z Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Customized generation using diffusion models has made impressive progress in image
generation but remains unsatisfactory in the challenging video generation task as it requires …

Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment

B Zhu, B Lin, M Ning, Y Yan, J Cui, HF Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
The video-language (VL) pretraining has achieved remarkable improvement in multiple
downstream tasks. However, the current VL pretraining framework is hard to extend to …

Distilling vision-language models on millions of videos

Y Zhao, L Zhao, X Zhou, J Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com
The recent advance in vision-language models is largely attributed to the abundance of
image-text data. We aim to replicate this success for video-language models but there …