Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd
Abstract The Large Vision-Language Model (LVLM) field has seen significant
advancements, yet its progression has been hindered by challenges in comprehending fine …
advancements, yet its progression has been hindered by challenges in comprehending fine …
Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …
supports long-contextual input and output. IXC-2.5 excels in various text-image …
Aigiqa-20k: A large database for ai-generated image quality assessment
With the rapid advancements in AI-Generated Content (AIGC) AI-Generated Images (AIGIs)
have been widely applied in entertainment education and social media. However due to the …
have been widely applied in entertainment education and social media. However due to the …
Naturalbench: Evaluating vision-language models on natural adversarial samples
Vision-language models (VLMs) have made significant progress in recent visual-question-
answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However …
answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However …
What If We Recaption Billions of Web Images with LLaMA-3?
Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that
semantically aligning and enriching textual descriptions of these pairs can significantly …
semantically aligning and enriching textual descriptions of these pairs can significantly …
Motionclone: Training-free motion cloning for controllable video generation
Motion-based controllable video generation offers the potential for creating captivating
visual content. Existing methods typically necessitate model training to encode particular …
visual content. Existing methods typically necessitate model training to encode particular …
Lotlip: Improving language-image pre-training for long text understanding
In this work, we empirically confirm that the key reason causing such an issue is that the
training images are usually paired with short captions, leaving certain tokens easily …
training images are usually paired with short captions, leaving certain tokens easily …
Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction
In large vision-language models (LVLMs), images serve as inputs that carry a wealth of
information. As the idiom" A picture is worth a thousand words" implies, representing a …
information. As the idiom" A picture is worth a thousand words" implies, representing a …
E5-v: Universal embeddings with multimodal large language models
Multimodal large language models (MLLMs) have shown promising advancements in
general visual and language understanding. However, the representation of multimodal …
general visual and language understanding. However, the representation of multimodal …
Finecliper: Multi-modal fine-grained clip for dynamic facial expression recognition with adapters
Dynamic Facial Expression Recognition (DFER) is crucial for understanding human
behavior. However, current methods exhibit limited performance mainly due to the …
behavior. However, current methods exhibit limited performance mainly due to the …