Aya dataset: An open-access collection for multilingual instruction tuning

S Singh, F Vargus, D Dsouza, BF Karlsson… - arxiv preprint arxiv …, 2024 - arxiv.org
Datasets are foundational to many breakthroughs in modern artificial intelligence. Many
recent achievements in the space of natural language processing (NLP) can be attributed to …

Grag: Graph retrieval-augmented generation

Y Hu, Z Lei, Z Zhang, B Pan, C Ling, L Zhao - arxiv preprint arxiv …, 2024 - arxiv.org
Naive Retrieval-Augmented Generation (RAG) focuses on individual documents during
retrieval and, as a result, falls short in handling networked documents which are very …

Docfinqa: A long-context financial reasoning dataset

V Reddy, R Koncel-Kedziorski, VD Lai… - arxiv preprint arxiv …, 2024 - arxiv.org
For large language models (LLMs) to be effective in the financial domain--where each
decision can have a significant impact--it is necessary to investigate realistic tasks and data …

Anchor-based large language models

J Pang, F Ye, DF Wong, X He, W Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) predominantly employ decoder-only transformer
architectures, necessitating the retention of keys/values information for historical tokens to …

[PDF][PDF] Qlarify: Bridging scholarly abstracts and papers with recursively expandable summaries

R Fok, JC Chang, T August, AX Zhang… - arxiv preprint arxiv …, 2023 - talaugust.github.io
As scientific literature has grown exponentially, researchers often rely on paper triaging
strategies such as browsing abstracts before deciding to delve into a paper's full text …

TruthReader: Towards Trustworthy Document Assistant Chatbot with Reliable Attribution

D Li, X Hu, Z Sun, B Hu, S Ye, Z Shan… - Proceedings of the …, 2024 - aclanthology.org
Document assistant chatbots are empowered with extensive capabilities by Large Language
Models (LLMs) and have exhibited significant advancements. However, these systems may …

Docxchain: A powerful open-source toolchain for document parsing and beyond

C Yao - arxiv preprint arxiv:2310.12430, 2023 - arxiv.org
In this report, we introduce DocXChain, a powerful open-source toolchain for document
parsing, which is designed and developed to automatically convert the rich information …

WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

X **e, H Yan, L Yin, Y Liu, J Ding, M Liao, Y Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal document understanding is a challenging task to process and comprehend large
amounts of textual and visual information. Recent advances in Large Language Models …

M-longdoc: A benchmark for multimodal super-long document understanding and a retrieval-aware tuning framework

YK Chia, L Cheng, HP Chan, C Liu, M Song… - arxiv preprint arxiv …, 2024 - arxiv.org
The ability to understand and answer questions over documents can be useful in many
business and practical applications. However, documents often contain lengthy and diverse …

Fragrel: Exploiting fragment-level relations in the external memory of large language models

X Yue, L Zhu, Y Yang - arxiv preprint arxiv:2406.03092, 2024 - arxiv.org
To process contexts with unlimited length using Large Language Models (LLMs), recent
studies explore hierarchically managing the long text. Only several text fragments are taken …