Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation

Y Gou, K Chen, Z Liu, L Hong, H Xu, Z Li… - … on Computer Vision, 2024 - Springer
Multimodal large language models (MLLMs) have shown impressive reasoning abilities.
However, they are also more vulnerable to jailbreak attacks than their LLM predecessors …

Automated evaluation of large vision-language models on self-driving corner cases

K Chen, Y Li, W Zhang, Y Liu, P Li, R Gao… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Vision-Language Models (LVLMs) have received widespread attention for advancing
the interpretable self-driving. Existing evaluations of LVLMs primarily focus on multi-faceted …

Emova: Empowering language models to see, hear and speak with vivid emotions

K Chen, Y Gou, R Huang, Z Liu, D Tan, J Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and
tones, marks a milestone for omni-modal foundation models. However, empowering Large …

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

X He, Q Huang, Z Zhang, Z Lin, Z Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Co-speech gestures if presented in the lively form of videos can achieve superior visual
effects in human-machine interaction. While previous works mostly generate structural …

Diffusion models for intelligent transportation systems: A survey

M Peng, K Chen, X Guo, Q Zhang, H Lu… - arxiv preprint arxiv …, 2024 - arxiv.org
Intelligent Transportation Systems (ITS) are vital in modern traffic management and
optimization, significantly enhancing traffic efficiency and safety. Recently, diffusion models …

DCAFuse: Dual-Branch Diffusion-CNN Complementary Feature Aggregation Network for Multi-Modality Image Fusion

X Lu, Y Jiang, H Hong, Q Sun, C Zhuo - Proceedings of the 32nd ACM …, 2024 - dl.acm.org
Multi-modality image fusion (MMIF) aims to integrate the complementary features of source
images into the fused image, including target saliency and texture specifics. Recently, image …

LayoutEnc: Leveraging Enhanced Layout Representations for Transformer-based Complex Scene Synthesis

X Cui, Q Sun, M Wang, L Li, W Zhou, H Li - ACM Transactions on …, 2025 - dl.acm.org
In complex scene synthesis, the effective representation of layouts is paramount. This paper
introduces LayoutEnc, an advanced approach specifically designed to enhance layout …

OccScene: Semantic Occupancy-based Cross-task Mutual Learning for 3D Scene Generation

B Li, X **, J Wang, Y Shi, Y Sun, X Wang, Z Ma… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent diffusion models have demonstrated remarkable performance in both 3D scene
generation and perception tasks. Nevertheless, existing methods typically separate these …

Training-free point cloud recognition based on geometric and semantic information fusion

Y Chen, D Huang, Z Liao, X Cheng, X Li… - arxiv preprint arxiv …, 2024 - arxiv.org
The trend of employing training-free methods for point cloud recognition is becoming
increasingly popular due to its significant reduction in computational resources and time …

BDC-Occ: Binarized Deep Convolution Unit For Binarized Occupancy Network

Z Zhang, Z Xu, W Yang, Q Liao, JH Xue - arxiv preprint arxiv:2405.17037, 2024 - arxiv.org
Existing 3D occupancy networks demand significant hardware resources, hindering the
deployment of edge devices. Binarized Neural Networks (BNN) offer substantially reduced …