A Survey on LLM-as-a-Judge

J Gu, X Jiang, Z Shi, H Tan, X Zhai, C Xu, W Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Accurate and consistent evaluation is crucial for decision-making across numerous fields,
yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large …

Idea23D: Collaborative LMM Agents Enable 3D Model Generation from Interleaved Multimodal Inputs

J Chen, X Li, X Ye, C Li, Z Fan, H Zhao - arxiv preprint arxiv:2404.04363, 2024 - arxiv.org
With the success of 2D diffusion models, 2D AIGC content has already transformed our lives.
Recently, this success has been extended to 3D AIGC, with state-of-the-art methods …

The AI Agent Index

S Casper, L Bailey, R Hunter, C Ezell, E Cabalé… - arxiv preprint arxiv …, 2025 - arxiv.org
Leading AI developers and startups are increasingly deploying agentic AI systems that can
plan and execute complex tasks with limited human involvement. However, there is currently …

Mobilesafetybench: Evaluating safety of autonomous agents in mobile device control

J Lee, D Hahm, JS Choi, WB Knox, K Lee - arxiv preprint arxiv …, 2024 - arxiv.org
Autonomous agents powered by large language models (LLMs) show promising potential in
assistive tasks across various domains, including mobile device control. As these agents …

AI Cyber Risk Benchmark: Automated Exploitation Capabilities

D Ristea, V Mavroudis, C Hicks - arxiv preprint arxiv:2410.21939, 2024 - arxiv.org
We introduce a new benchmark for assessing AI models' capabilities and risks in automated
software exploitation, focusing on their ability to detect and exploit vulnerabilities in real …

G-Safeguard: A Topology-Guided Security Lens and Treatment on LLM-based Multi-agent Systems

S Wang, G Zhang, M Yu, G Wan, F Meng, C Guo… - arxiv preprint arxiv …, 2025 - arxiv.org
Large Language Model (LLM)-based Multi-agent Systems (MAS) have demonstrated
remarkable capabilities in various complex tasks, ranging from collaborative problem …

The Science of Evaluating Foundation Models

J Yuan, J Zhang, A Wen, X Hu - arxiv preprint arxiv:2502.09670, 2025 - arxiv.org
The emergent phenomena of large foundation models have revolutionized natural language
processing. However, evaluating these models presents significant challenges due to their …

SoK: Unifying Cybersecurity and Cybersafety of Multimodal Foundation Models with an Information Theory Approach

R Sun, J Chang, H Pearce, C **ao, B Li, Q Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal foundation models (MFMs) represent a significant advancement in artificial
intelligence, combining diverse data modalities to enhance learning and understanding …

The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1

K Zhou, C Liu, X Zhao, S Jangam, J Srinivasa… - arxiv preprint arxiv …, 2025 - arxiv.org
The rapid development of large reasoning models, such as OpenAI-o3 and DeepSeek-R1,
has led to significant improvements in complex reasoning over non-reasoning large …

Aggregate and conquer: detecting and steering LLM concepts by combining nonlinear predictors over multiple layers

D Beaglehole, A Radhakrishnan, E Boix-Adserà… - arxiv preprint arxiv …, 2025 - arxiv.org
A trained Large Language Model (LLM) contains much of human knowledge. Yet, it is
difficult to gauge the extent or accuracy of that knowledge, as LLMs do not always``know …