From generation to judgment: Opportunities and challenges of llm-as-a-judge

D Li, B Jiang, L Huang, A Beigi, C Zhao, Z Tan… - arxiv preprint arxiv …, 2024 - arxiv.org
Assessment and evaluation have long been critical challenges in artificial intelligence (AI)
and natural language processing (NLP). However, traditional methods, whether matching …

Future events as backdoor triggers: Investigating temporal vulnerabilities in llms

S Price, A Panickssery, S Bowman… - arxiv preprint arxiv …, 2024 - arxiv.org
Backdoors are hidden behaviors that are only triggered once an AI system has been
deployed. Bad actors looking to create successful backdoors must design them to avoid …

Enhancing logical reasoning in large language models through graph-based synthetic data

J Zhou, A Ghaddar, G Zhang, L Ma, Y Hu, S Pal… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite recent advances in training and prompting strategies for Large Language Models
(LLMs), these models continue to face challenges with complex logical reasoning tasks that …

Graph Reasoning with LLMs (GReaL)

A Tsitsulin, B Perozzi, B Fatemi… - Proceedings of the 30th …, 2024 - dl.acm.org
Graphs are a powerful tool for representing and analyzing complex relationships in real-
world applications. Large Language Models (LLMs) have demonstrated impressive …

ActionReasoningBench: Reasoning about Actions with and without Ramification Constraints

D Handa, P Dolin, S Kumbhar, TC Son… - arxiv preprint arxiv …, 2024 - arxiv.org
Reasoning about Actions and Change (RAC) has historically played a pivotal role in solving
foundational AI problems, such as the frame problem. It has driven advancements in AI …

ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains

Y Park, C Yoon, J Park, D Lee, M Jeong… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) have significantly impacted many aspects of our lives.
However, assessing and ensuring their chronological knowledge remains challenging …

Perceive the Passage of Time: A Systematic Evaluation of Large Language Model in Temporal Relativity

S Chen, Y Zheng, S Li, Q Cheng… - Proceedings of the 31st …, 2025 - aclanthology.org
Temporal perception is crucial for Large Language Models (LLMs) to effectively understand
the world. However, current benchmarks primarily focus on temporal reasoning, falling short …

VCBench: A Controllable Benchmark for Symbolic and Abstract Challenges in Video Cognition

C Li, Q Chen, Z Li, F Tao, Y Zhang - arxiv preprint arxiv:2411.09105, 2024 - arxiv.org
Recent advancements in Large Video-Language Models (LVLMs) have driven the
development of benchmarks designed to assess cognitive abilities in video-based tasks …

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time

D Herel, V Bartek, T Mikolov - arxiv preprint arxiv:2409.13338, 2024 - arxiv.org
Who is the US President? The answer changes depending on when the question is asked.
While large language models (LLMs) are evaluated on various reasoning tasks, they often …

Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning

Y Zhu, X Bai, K Chen, Y **ang, M Zhang - arxiv preprint arxiv:2412.13540, 2024 - arxiv.org
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance
across diverse tasks. Despite great success, recent studies show that LVLMs encounter …