The pile: An 800gb dataset of diverse text for language modeling

L Gao, S Biderman, S Black, L Golding… - arxiv preprint arxiv …, 2020 - arxiv.org
Recent work has demonstrated that increased training dataset diversity improves general
cross-domain knowledge and downstream generalization capability for large-scale …

Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models

N Sengupta, SK Sahu, B Jia, S Katipomu, H Li… - arxiv preprint arxiv …, 2023 - arxiv.org
We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and
instruction-tuned open generative large language models (LLMs). The models are based on …

Datasheet for the pile

S Biderman, K Bicheno, L Gao - arxiv preprint arxiv:2201.07311, 2022 - arxiv.org
This datasheet describes the Pile, a 825 GiB dataset of human-authored text compiled by
EleutherAI for use in large-scale language modeling. The Pile is comprised of 22 different …

On the features of translationese

V Volansky, N Ordan, S Wintner - Digital Scholarship in the …, 2015 - academic.oup.com
Much research in translation studies indicates that translated texts are ontologically different
from original non-translated ones. Translated texts, in any language, can be considered a …

[PDF][PDF] Translationese and its dialects

M Koppel, N Ordan - Proceedings of the 49th annual meeting of …, 2011 - aclanthology.org
While it is has often been observed that the product of translation is somehow different than
non-translated text, scholars have emphasized two distinct bases for such differences. Some …

[BOOK][B] The Routledge handbook of translation and cognition

F Alves, AL Jakobsen - 2021 - api.taylorfrancis.com
With a strong focus on interdisciplinarity, the handbook surveys concepts and methods in
neighbouring disciplines that are concerned with cognition and how they relate to …

Language models for machine translation: Original vs. translated texts

G Lembersky, N Ordan, S Wintner - Computational Linguistics, 2012 - direct.mit.edu
We investigate the differences between language models compiled from original target-
language texts and those compiled from texts manually translated to the target language …

[PDF][PDF] Automatic detection of translated text and its impact on machine translation

D Kurokawa, C Goutte, P Isabelle - Proceedings of Machine …, 2009 - aclanthology.org
We investigate the possibility of automatically detecting whether a piece of text is an original
or a translation. On a large parallel English-French corpus where reference information is …

[PDF][PDF] Contrastive analysis and native language identification

SMJ Wong, M Dras - Proceedings of the Australasian Language …, 2009 - aclanthology.org
Attempts to profile authors based on their characteristics, including native language, have
drawn attention in recent years, via several approaches using machine learning with simple …

Not…Until across European Languages: A Parallel Corpus Study

H de Swart, J Tellings, B Wälchli - Languages, 2022 - mdpi.com
We present a parallel corpus study on the expression of the temporal construction 'not…
until'in a sample of European languages. We use data from the Europarl corpus and create …