The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Z Qin, D Chen, W Zhang, L Yao, Y Huang… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid development of large language models (LLMs) has been witnessed in recent
years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from …

Videomaker: Zero-shot customized video generation with the inherent force of video diffusion models

T Wu, Y Zhang, X Cun, Z Qi, J Pu, H Dou… - arxiv preprint arxiv …, 2024 - arxiv.org
Zero-shot customized video generation has gained significant attention due to its substantial
application potential. Existing methods rely on additional models to extract and inject …

Corrclip: Reconstructing correlations in clip with off-the-shelf foundation models for open-vocabulary semantic segmentation

D Zhang, F Liu, Q Tang - arxiv preprint arxiv:2411.10086, 2024 - arxiv.org
Open-vocabulary semantic segmentation aims to assign semantic labels to each pixel
without relying on a predefined set of categories. Contrastive Language-Image Pre-training …

Clip-moe: Towards building mixture of experts for clip with diversified multiplet upcycling

J Zhang, X Qu, T Zhu, Y Cheng - arxiv preprint arxiv:2409.19291, 2024 - arxiv.org
In recent years, Contrastive Language-Image Pre-training (CLIP) has become a cornerstone
in multimodal intelligence. However, recent studies have identified that the information loss …

Decentralized Diffusion Models

D McAllister, M Tancik, J Song, A Kanazawa - arxiv preprint arxiv …, 2025 - arxiv.org
Large-scale AI model training divides work across thousands of GPUs, then synchronizes
gradients across them at each step. This incurs a significant network burden that only …

QR-DETR: Query Routing for Detection Transformer

T Senthivel, NS Vu - … of the Asian Conference on Computer …, 2024 - openaccess.thecvf.com
Detection Transformer (DETR) predicts object bounding boxes and classes from learned
object queries. However, DETR exhibits three major flaws:(1) Only a subset of object queries …

WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization

J Ma, Y Niu, S Huang, G Han, SF Chang - arxiv preprint arxiv:2405.18405, 2024 - arxiv.org
Language has been useful in extending the vision encoder to data from diverse distributions
without empirical discovery in training domains. However, as the image description is mostly …

Towards maintainable machine learning development through continual and modular learning

O Ostapenko - 2024 - papyrus.bib.umontreal.ca
As machine learning models grow in size and complexity, their maintainability becomes a
critical concern, especially when they are increasingly deployed in dynamic, real-world …

Novel Techniques in Addressing Label Bias & Noise in Low-Quality Real-World Data

J Ma - 2024 - search.proquest.com
Data serves as the foundation in building effective deep learning algorithms, yet the process
of annotation and curation to maintain high data quality is time-intensive. The challenges …