Parameter-efficient fine-tuning for large models: A comprehensive survey
Large models represent a groundbreaking advancement in multiple application fields,
enabling remarkable achievements across various tasks. However, their unprecedented …
enabling remarkable achievements across various tasks. However, their unprecedented …
Visual tuning
Fine-tuning visual models has been widely shown promising performance on many
downstream visual tasks. With the surprising development of pre-trained visual foundation …
downstream visual tasks. With the surprising development of pre-trained visual foundation …
Segment anything in high quality
Abstract The recent Segment Anything Model (SAM) represents a big leap in scaling up
segmentation models, allowing for powerful zero-shot capabilities and flexible prompting …
segmentation models, allowing for powerful zero-shot capabilities and flexible prompting …
Simda: Simple diffusion adapter for efficient video generation
The recent wave of AI-generated content has witnessed the great development and success
of Text-to-Image (T2I) technologies. By contrast Text-to-Video (T2V) still falls short of …
of Text-to-Image (T2I) technologies. By contrast Text-to-Video (T2V) still falls short of …
Towards open vocabulary learning: A survey
In the field of visual scene understanding, deep neural networks have made impressive
advancements in various core tasks like segmentation, tracking, and detection. However …
advancements in various core tasks like segmentation, tracking, and detection. However …
One-peace: Exploring one general representation model toward unlimited modalities
In this work, we explore a scalable way for building a general representation model toward
unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B …
unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B …
Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models
Vision-language models (VLMs) pre-trained on large-scale image-text pairs have
demonstrated impressive transferability on various visual tasks. Transferring knowledge …
demonstrated impressive transferability on various visual tasks. Transferring knowledge …
Dreamvideo: Composing your dream videos with customized subject and motion
Customized generation using diffusion models has made impressive progress in image
generation but remains unsatisfactory in the challenging video generation task as it requires …
generation but remains unsatisfactory in the challenging video generation task as it requires …
Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment
The video-language (VL) pretraining has achieved remarkable improvement in multiple
downstream tasks. However, the current VL pretraining framework is hard to extend to …
downstream tasks. However, the current VL pretraining framework is hard to extend to …
Distilling vision-language models on millions of videos
The recent advance in vision-language models is largely attributed to the abundance of
image-text data. We aim to replicate this success for video-language models but there …
image-text data. We aim to replicate this success for video-language models but there …