Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Imagebind: One embedding space to bind them all
We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …
Self-chained image-language model for video localization and question answering
Recent studies have shown promising results on utilizing large pre-trained image-language
models for video question answering. While these image-language models can efficiently …
models for video question answering. While these image-language models can efficiently …
Unmasked teacher: Towards training-efficient video foundation models
Abstract Video Foundation Models (VFMs) have received limited exploration due to high
computational costs and data scarcity. Previous VFMs rely on Image Foundation Models …
computational costs and data scarcity. Previous VFMs rely on Image Foundation Models …
Learning video representations from large language models
We introduce LAVILA, a new approach to learning video-language representations by
leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be …
leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be …
Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset
Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …
while other modalities such as audio and subtitles in videos have not received sufficient …
Onetracker: Unifying visual object tracking with foundation models and efficient tuning
Visual object tracking aims to localize the target object of each frame based on its initial
appearance in the first frame. Depending on the input modility tracking tasks can be divided …
appearance in the first frame. Depending on the input modility tracking tasks can be divided …
St-llm: Large language models are effective temporal learners
Abstract Large Language Models (LLMs) have showcased impressive capabilities in text
comprehension and generation, prompting research efforts towards video LLMs to facilitate …
comprehension and generation, prompting research efforts towards video LLMs to facilitate …
Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and llms evaluations
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of
NLP. We find that the distribution shift settings in previous studies commonly lack adequate …
NLP. We find that the distribution shift settings in previous studies commonly lack adequate …
All in one: Exploring unified video-language pre-training
Abstract Mainstream Video-Language Pre-training models consist of three parts, a video
encoder, a text encoder, and a video-text fusion Transformer. They pursue better …
encoder, a text encoder, and a video-text fusion Transformer. They pursue better …
Agiqa-3k: An open database for ai-generated image quality assessment
With the rapid advancements of the text-to-image generative model, AI-generated images
(AGIs) have been widely applied to entertainment, education, social media, etc. However …
(AGIs) have been widely applied to entertainment, education, social media, etc. However …