Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation
Multimodal large language models (MLLMs) have shown impressive reasoning abilities.
However, they are also more vulnerable to jailbreak attacks than their LLM predecessors …
However, they are also more vulnerable to jailbreak attacks than their LLM predecessors …
Automated evaluation of large vision-language models on self-driving corner cases
Large Vision-Language Models (LVLMs) have received widespread attention for advancing
the interpretable self-driving. Existing evaluations of LVLMs primarily focus on multi-faceted …
the interpretable self-driving. Existing evaluations of LVLMs primarily focus on multi-faceted …
Emova: Empowering language models to see, hear and speak with vivid emotions
GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and
tones, marks a milestone for omni-modal foundation models. However, empowering Large …
tones, marks a milestone for omni-modal foundation models. However, empowering Large …
Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
Co-speech gestures if presented in the lively form of videos can achieve superior visual
effects in human-machine interaction. While previous works mostly generate structural …
effects in human-machine interaction. While previous works mostly generate structural …
Diffusion models for intelligent transportation systems: A survey
Intelligent Transportation Systems (ITS) are vital in modern traffic management and
optimization, significantly enhancing traffic efficiency and safety. Recently, diffusion models …
optimization, significantly enhancing traffic efficiency and safety. Recently, diffusion models …
DCAFuse: Dual-Branch Diffusion-CNN Complementary Feature Aggregation Network for Multi-Modality Image Fusion
Multi-modality image fusion (MMIF) aims to integrate the complementary features of source
images into the fused image, including target saliency and texture specifics. Recently, image …
images into the fused image, including target saliency and texture specifics. Recently, image …
LayoutEnc: Leveraging Enhanced Layout Representations for Transformer-based Complex Scene Synthesis
X Cui, Q Sun, M Wang, L Li, W Zhou, H Li - ACM Transactions on …, 2025 - dl.acm.org
In complex scene synthesis, the effective representation of layouts is paramount. This paper
introduces LayoutEnc, an advanced approach specifically designed to enhance layout …
introduces LayoutEnc, an advanced approach specifically designed to enhance layout …
OccScene: Semantic Occupancy-based Cross-task Mutual Learning for 3D Scene Generation
Recent diffusion models have demonstrated remarkable performance in both 3D scene
generation and perception tasks. Nevertheless, existing methods typically separate these …
generation and perception tasks. Nevertheless, existing methods typically separate these …
Training-free point cloud recognition based on geometric and semantic information fusion
Y Chen, D Huang, Z Liao, X Cheng, X Li… - arxiv preprint arxiv …, 2024 - arxiv.org
The trend of employing training-free methods for point cloud recognition is becoming
increasingly popular due to its significant reduction in computational resources and time …
increasingly popular due to its significant reduction in computational resources and time …
BDC-Occ: Binarized Deep Convolution Unit For Binarized Occupancy Network
Existing 3D occupancy networks demand significant hardware resources, hindering the
deployment of edge devices. Binarized Neural Networks (BNN) offer substantially reduced …
deployment of edge devices. Binarized Neural Networks (BNN) offer substantially reduced …