- Academic Search

Z Wan, Z Wu, C Liu, J Huang, Z Zhu, P **… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Long-context Multimodal Large Language Models (MLLMs) demand substantial
computational resources for inference as the growth of their multimodal Key-Value (KV) …‏

שמור צטט צוטט על ידי 27 מאמרים בנושא זה כל 4 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

In-context lora for diffusion transformers‏

L Huang, W Wang, ZF Wu, Y Shi, H Dou… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Recent research arxiv: 2410.15027 has explored the use of diffusion transformers (DiTs) for
task-agnostic image generation by simply concatenating attention tokens across images …‏

שמור צטט צוטט על ידי 6 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Muma-tom: Multi-modal multi-agent theory of mind‏

H Shi, S Ye, X Fang, C **, L Isik, YL Kuo… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Understanding people's social interactions in complex real-world scenarios often relies on
intricate mental reasoning. To truly understand how and why people interact with one …‏

שמור צטט צוטט על ידי 6 מאמרים בנושא זה כל 4 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A survey on multimodal benchmarks: In the era of large ai models‏

L Li, G Chen, H Shi, J **ao, L Chen - arxiv preprint arxiv:2409.18142, 2024‏ - arxiv.org‏

The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial
advancements in artificial intelligence, significantly enhancing the capability to understand …‏

שמור צטט צוטט על ידי 4 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis‏

B Zeng, L Yang, S Li, J Liu, Z Zhang, J Tian… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Recent advances in diffusion models have demonstrated exceptional capabilities in image
and video generation, further improving the effectiveness of 4D synthesis. Existing 4D …‏

שמור צטט צוטט על ידי 2 מאמרים בנושא זה כל 5 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Acdc: Autoregressive coherent multimodal generation using diffusion correction‏

H Chung, D Lee, JC Ye - arxiv preprint arxiv:2410.04721, 2024‏ - arxiv.org‏

Autoregressive models (ARMs) and diffusion models (DMs) represent two leading
paradigms in generative modeling, each excelling in distinct areas: ARMs in global context …‏

שמור צטט צוטט על ידי 2 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection‏

S Han, W Huang, H Shi, L Zhuo, X Su, S Zhang… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

The advancement of Large Vision Language Models (LVLMs) has significantly improved
multimodal understanding, yet challenges remain in video reasoning tasks due to the …‏

שמור צטט צוטט על ידי 1 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads‏

S Kou, J **, C Liu, Y Ma, J Jia, Q Chen, P Jiang… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

We introduce Orthus, an autoregressive (AR) transformer that excels in generating images
given textual prompts, answering questions based on visual inputs, and even crafting …‏

שמור צטט צוטט על ידי 1 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

When Attention Sink Emerges in Language Models: An Empirical View‏

X Gu, T Pang, C Du, Q Liu, F Zhang, C Du… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Language Models (LMs) assign significant attention to the first token, even if it is not
semantically important, which is known as attention sink. This phenomenon has been widely …‏

שמור צטט צוטט על ידי 2 מאמרים בנושא זה כל 4 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation‏

J Wu, C Tang, J Wang, Y Zeng, X Li, Y Tong - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Story visualization, the task of creating visual narratives from textual descriptions, has seen
progress with text-to-image generation models. However, these models often lack effective …‏

שמור צטט צוטט על ידי 1 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

יצירת התראה

צטט

חיפוש מתקדם

נשמר בספרייה שלי

Seed-story: Multimodal long story generation with large language model

Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference‏

In-context lora for diffusion transformers‏

Muma-tom: Multi-modal multi-agent theory of mind‏

A survey on multimodal benchmarks: In the era of large ai models‏

Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis‏

Acdc: Autoregressive coherent multimodal generation using diffusion correction‏

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection‏

Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads‏

When Attention Sink Emerges in Language Models: An Empirical View‏

DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation‏