Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference
Long-context Multimodal Large Language Models (MLLMs) demand substantial
computational resources for inference as the growth of their multimodal Key-Value (KV) …
computational resources for inference as the growth of their multimodal Key-Value (KV) …
In-context lora for diffusion transformers
Recent research arxiv: 2410.15027 has explored the use of diffusion transformers (DiTs) for
task-agnostic image generation by simply concatenating attention tokens across images …
task-agnostic image generation by simply concatenating attention tokens across images …
Muma-tom: Multi-modal multi-agent theory of mind
Understanding people's social interactions in complex real-world scenarios often relies on
intricate mental reasoning. To truly understand how and why people interact with one …
intricate mental reasoning. To truly understand how and why people interact with one …
A survey on multimodal benchmarks: In the era of large ai models
The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial
advancements in artificial intelligence, significantly enhancing the capability to understand …
advancements in artificial intelligence, significantly enhancing the capability to understand …
Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis
Recent advances in diffusion models have demonstrated exceptional capabilities in image
and video generation, further improving the effectiveness of 4D synthesis. Existing 4D …
and video generation, further improving the effectiveness of 4D synthesis. Existing 4D …
Acdc: Autoregressive coherent multimodal generation using diffusion correction
Autoregressive models (ARMs) and diffusion models (DMs) represent two leading
paradigms in generative modeling, each excelling in distinct areas: ARMs in global context …
paradigms in generative modeling, each excelling in distinct areas: ARMs in global context …
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
The advancement of Large Vision Language Models (LVLMs) has significantly improved
multimodal understanding, yet challenges remain in video reasoning tasks due to the …
multimodal understanding, yet challenges remain in video reasoning tasks due to the …
Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads
We introduce Orthus, an autoregressive (AR) transformer that excels in generating images
given textual prompts, answering questions based on visual inputs, and even crafting …
given textual prompts, answering questions based on visual inputs, and even crafting …
When Attention Sink Emerges in Language Models: An Empirical View
Language Models (LMs) assign significant attention to the first token, even if it is not
semantically important, which is known as attention sink. This phenomenon has been widely …
semantically important, which is known as attention sink. This phenomenon has been widely …
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Story visualization, the task of creating visual narratives from textual descriptions, has seen
progress with text-to-image generation models. However, these models often lack effective …
progress with text-to-image generation models. However, these models often lack effective …