RULER: What's the Real Context Size of Your Long-Context Language Models?
The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of
information (the" needle") from long distractor texts (the" haystack"), has been widely …
information (the" needle") from long distractor texts (the" haystack"), has been widely …
Leave no document behind: Benchmarking long-context llms with extended multi-doc qa
Long-context modeling capabilities of Large Language Models (LLMs) have garnered
widespread attention, leading to the emergence of LLMs with ultra-context windows …
widespread attention, leading to the emergence of LLMs with ultra-context windows …
Foundational autoraters: Taming large language models for better automatic evaluation
As large language models (LLMs) advance, it becomes more challenging to reliably
evaluate their output due to the high costs of human evaluation. To make progress towards …
evaluate their output due to the high costs of human evaluation. To make progress towards …
How to train long-context language models (effectively)
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to
make effective use of long-context information. We first establish a reliable evaluation …
make effective use of long-context information. We first establish a reliable evaluation …
When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training
Extending context window sizes allows large language models (LLMs) to process longer
sequences and handle more complex tasks. Rotary Positional Embedding (RoPE) has …
sequences and handle more complex tasks. Rotary Positional Embedding (RoPE) has …
Breaking the Stage Barrier: A Novel Single-Stage Approach to Long Context Extension for Large Language Models
Recently, Large language models (LLMs) have revolutionized Natural Language
Processing (NLP). Pretrained LLMs, due to limited training context size, struggle with …
Processing (NLP). Pretrained LLMs, due to limited training context size, struggle with …
Large Language Models Can Self-Improve in Long-context Reasoning
Large language models (LLMs) have achieved substantial progress in processing long
contexts but still struggle with long-context reasoning. Existing approaches typically involve …
contexts but still struggle with long-context reasoning. Existing approaches typically involve …
LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation
Existing benchmarks for evaluating long-context language models (LCLMs) primarily focus
on long-context recall, requiring models to produce short responses based on a few critical …
on long-context recall, requiring models to produce short responses based on a few critical …
The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input
We introduce FACTS Grounding, an online leaderboard and associated benchmark that
evaluates language models' ability to generate text that is factually accurate with respect to …
evaluates language models' ability to generate text that is factually accurate with respect to …
Scaling Multi-Document Event Summarization: Evaluating Compression vs. Full-Text Approaches
Automatically summarizing large text collections is a valuable tool for document research,
with applications in journalism, academic research, legal work, and many other fields. In this …
with applications in journalism, academic research, legal work, and many other fields. In this …