Guiding instruction-based image editing via multimodal large language models

TJ Fu, W Hu, X Du, WY Wang, Y Yang… - arxiv preprint arxiv …, 2023 - arxiv.org
Instruction-based image editing improves the controllability and flexibility of image
manipulation via natural commands without elaborate descriptions or regional masks …

Towards semantic equivalence of tokenization in multimodal llm

S Wu, H Fei, X Li, J Ji, H Zhang, TS Chua… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in
processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization …

Dccf: Deep comprehensible color filter learning framework for high-resolution image harmonization

B Xue, S Ran, Q Chen, R Jia, B Zhao… - European conference on …, 2022 - Springer
Image color harmonization algorithm aims to automatically match the color distribution of
foreground and background images captured in different conditions. Previous deep learning …

Towards generic image manipulation detection with weakly-supervised self-consistency learning

Y Zhai, T Luan, D Doermann… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
As advanced image manipulation techniques emerge, detecting the manipulation becomes
increasingly important. Despite the success of recent learning-based approaches for image …

Text-to-image cross-modal generation: A systematic review

M Żelaszczyk, J Mańdziuk - arxiv preprint arxiv:2401.11631, 2024 - arxiv.org
We review research on generating visual data from text from the angle of" cross-modal
generation." This point of view allows us to draw parallels between various methods geared …

Auto-encoding morph-tokens for multimodal llm

K Pan, S Tang, J Li, Z Fan, W Chow, S Yan… - arxiv preprint arxiv …, 2024 - arxiv.org
For multimodal LLMs, the synergy of visual comprehension (textual output) and generation
(visual output) presents an ongoing challenge. This is due to a conflicting objective: for …

[HTML][HTML] A review of multi-modal learning from the text-guided visual processing viewpoint

U Ullah, JS Lee, CH An, H Lee, SY Park, RH Baek… - Sensors, 2022 - mdpi.com
For decades, co-relating different data domains to attain the maximum potential of machines
has driven research, especially in neural networks. Similarly, text and visual data (images …

Language-guided global image editing via cross-modal cyclic mechanism

W Jiang, N Xu, J Wang, C Gao, J Shi… - Proceedings of the …, 2021 - openaccess.thecvf.com
Editing an image automatically via a linguistic request can significantly save laborious
manual work and is friendly to photography novice. In this paper, we focus on the task of …

A regionally indicated visual grounding network for remote sensing images

R Hang, S Xu, Q Liu - IEEE Transactions on Geoscience and …, 2024 - ieeexplore.ieee.org
Visual grounding (VG) is essential to promote the human-computer interaction in object
detection tasks. Most of the current VG methods mainly focus on grounding the target objects …

Ls-gan: iterative language-based image manipulation via long and short term consistency reasoning

G Cong, L Li, Z Liu, Y Tu, W Qin, S Zhang… - Proceedings of the 30th …, 2022 - dl.acm.org
Iterative language-based image manipulation aims to edit images step by step according to
user's linguistic instructions. The existing methods mostly focus on aligning the attributes …