A Survey on LLM-as-a-Judge

J Gu, X Jiang, Z Shi, H Tan, X Zhai, C Xu, W Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Accurate and consistent evaluation is crucial for decision-making across numerous fields,
yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large …

Mobilesafetybench: Evaluating safety of autonomous agents in mobile device control

J Lee, D Hahm, JS Choi, WB Knox, K Lee - arxiv preprint arxiv …, 2024 - arxiv.org
Autonomous agents powered by large language models (LLMs) show promising potential in
assistive tasks across various domains, including mobile device control. As these agents …

AI Cyber Risk Benchmark: Automated Exploitation Capabilities

D Ristea, V Mavroudis, C Hicks - arxiv preprint arxiv:2410.21939, 2024 - arxiv.org
We introduce a new benchmark for assessing AI models' capabilities and risks in automated
software exploitation, focusing on their ability to detect and exploit vulnerabilities in real …

SoK: Unifying Cybersecurity and Cybersafety of Multimodal Foundation Models with an Information Theory Approach

R Sun, J Chang, H Pearce, C **ao, B Li, Q Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal foundation models (MFMs) represent a significant advancement in artificial
intelligence, combining diverse data modalities to enhance learning and understanding …

The AI Agent Index

S Casper, L Bailey, R Hunter, C Ezell, E Cabalé… - arxiv preprint arxiv …, 2025 - arxiv.org
Leading AI developers and startups are increasingly deploying agentic AI systems that can
plan and execute complex tasks with limited human involvement. However, there is currently …

[PDF][PDF] Benchmarking OpenAI o1 in Cyber Security

D Ristea, V Mavroudis, C Hicks - arxiv preprint arxiv:2410.21939, 2024 - researchgate.net
We evaluate OpenAI's o1-preview and o1-mini models, benchmarking their performance
against the earlier GPT-4o model. Our evaluation focuses on their ability to detect …

Scalable Access-Pattern Aware I/O Acceleration and Multi-Tiered Data Management for HPC and AI Workloads

A Maurya - 2024 - search.proquest.com
The exponential growth of data-intensive scientific simulations and deep learning workloads
presents significant challenges for high-performance computing (HPC) systems. These …