Identifying and mitigating the security risks of generative ai
Every major technical invention resurfaces the dual-use dilemma—the new technology has
the potential to be used for good as well as for harm. Generative AI (GenAI) techniques, such …
the potential to be used for good as well as for harm. Generative AI (GenAI) techniques, such …
Longbench: A bilingual, multitask benchmark for long context understanding
Although large language models (LLMs) demonstrate impressive performance for many
language tasks, most of them can only handle texts a few thousand tokens long, limiting their …
language tasks, most of them can only handle texts a few thousand tokens long, limiting their …
Embrace divergence for richer insights: A multi-document summarization benchmark and a case study on summarizing diverse information from news articles
Previous research in multi-document news summarization has typically concentrated on
collating information that all sources agree upon. However, to our knowledge, the …
collating information that all sources agree upon. However, to our knowledge, the …
L-eval: Instituting standardized evaluation for long context language models
Recently, there has been growing interest in extending the context length of large language
models (LLMs), aiming to effectively process long inputs of one turn or conversations with …
models (LLMs), aiming to effectively process long inputs of one turn or conversations with …
Never Lost in the Middle: Mastering Long-Context Question Answering with Position-Agnostic Decompositional Training
While large language models (LLMs) are equipped with longer text input capabilities than
before, they are struggling to seek correct information in long contexts. The “lost in the …
before, they are struggling to seek correct information in long contexts. The “lost in the …
A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators
Automatic evaluation is an integral aspect of dialogue system research. The traditional
reference-based NLG metrics are generally found to be unsuitable for dialogue assessment …
reference-based NLG metrics are generally found to be unsuitable for dialogue assessment …
Never lost in the middle: Improving large language models via attention strengthening question answering
While large language models (LLMs) are equipped with longer text input capabilities than
before, they are struggling to seek correct information in long contexts. The" lost in the …
before, they are struggling to seek correct information in long contexts. The" lost in the …