RULER: What's the Real Context Size of Your Long-Context Language Models?

CP Hsieh, S Sun, S Kriman, S Acharya… - arxiv preprint arxiv …, 2024 - arxiv.org
The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of
information (the" needle") from long distractor texts (the" haystack"), has been widely …

Leave no document behind: Benchmarking long-context llms with extended multi-doc qa

M Wang, L Chen, F Cheng, S Liao… - Proceedings of the …, 2024 - aclanthology.org
Long-context modeling capabilities of Large Language Models (LLMs) have garnered
widespread attention, leading to the emergence of LLMs with ultra-context windows …

Foundational autoraters: Taming large language models for better automatic evaluation

T Vu, K Krishna, S Alzubi, C Tar, M Faruqui… - arxiv preprint arxiv …, 2024 - arxiv.org
As large language models (LLMs) advance, it becomes more challenging to reliably
evaluate their output due to the high costs of human evaluation. To make progress towards …

How to train long-context language models (effectively)

T Gao, A Wettig, H Yen, D Chen - arxiv preprint arxiv:2410.02660, 2024 - arxiv.org
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to
make effective use of long-context information. We first establish a reliable evaluation …

When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training

H Wang, Q Liu, C Du, T Zhu, C Du… - arxiv preprint arxiv …, 2024 - arxiv.org
Extending context window sizes allows large language models (LLMs) to process longer
sequences and handle more complex tasks. Rotary Positional Embedding (RoPE) has …

Breaking the Stage Barrier: A Novel Single-Stage Approach to Long Context Extension for Large Language Models

H Lian, J Chen, W Huang, Y **ong, W Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
Recently, Large language models (LLMs) have revolutionized Natural Language
Processing (NLP). Pretrained LLMs, due to limited training context size, struggle with …

Large Language Models Can Self-Improve in Long-context Reasoning

S Li, C Yang, Z Cheng, L Liu, M Yu, Y Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) have achieved substantial progress in processing long
contexts but still struggle with long-context reasoning. Existing approaches typically involve …

LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation

X Ye, F Yin, Y He, J Zhang, H Yen, T Gao… - arxiv preprint arxiv …, 2025 - arxiv.org
Existing benchmarks for evaluating long-context language models (LCLMs) primarily focus
on long-context recall, requiring models to produce short responses based on a few critical …

The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input

A Jacovi, A Wang, C Alberti, C Tao, J Lipovetz… - arxiv preprint arxiv …, 2025 - arxiv.org
We introduce FACTS Grounding, an online leaderboard and associated benchmark that
evaluates language models' ability to generate text that is factually accurate with respect to …

Scaling Multi-Document Event Summarization: Evaluating Compression vs. Full-Text Approaches

A Pratapa, T Mitamura - arxiv preprint arxiv:2502.06617, 2025 - arxiv.org
Automatically summarizing large text collections is a valuable tool for document research,
with applications in journalism, academic research, legal work, and many other fields. In this …