ScandEval: A benchmark for Scandinavian natural language processing

DS Nielsen - arxiv preprint arxiv:2304.00906, 2023 - arxiv.org
This paper introduces a Scandinavian benchmarking platform, ScandEval, which can
benchmark any pretrained model on four different tasks in the Scandinavian languages. The …

Position: measure dataset diversity, don't just claim it

D Zhao, JTA Andrews, O Papakyriakopoulos… - arxiv preprint arxiv …, 2024 - arxiv.org
Machine learning (ML) datasets, often perceived as neutral, inherently encapsulate abstract
and disputed social constructs. Dataset curators frequently employ value-laden terms such …

Rogpt2: Romanian gpt2 for text generation

MA Niculescu, S Ruseti… - 2021 IEEE 33rd …, 2021 - ieeexplore.ieee.org
Text generation is one of the most important and challenging tasks in NLP, where models
have shown a significant performance increase in recent years. However, most generative …

IndicXNLI: Evaluating multilingual inference for Indian languages

D Aggarwal, V Gupta, A Kunchukuttan - arxiv preprint arxiv:2204.08776, 2022 - arxiv.org
While Indic NLP has made rapid advances recently in terms of the availability of corpora and
pre-trained models, benchmark datasets on standard NLU tasks are limited. To this end, we …

Measuring diversity in datasets

D Zhao, JTA Andrews, AI Sony… - International …, 2024 - openreview.net
Machine learning (ML) datasets, often perceived as" neutral," inherently encapsulate
abstract and disputed social constructs. Dataset curators frequently employ value-laden …

Beyond lexical boundaries: Llm-generated text detection for romanian digital libraries

M Nitu, M Dascalu - Future Internet, 2024 - mdpi.com
Machine-generated content reshapes the landscape of digital information; hence, ensuring
the authenticity of texts within digital libraries has become a paramount concern. This work …

This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish

L Augustyniak, K Tagowski… - Advances in …, 2022 - proceedings.neurips.cc
The availability of compute and data to train larger and larger language models increases
the demand for robust methods of benchmarking the true progress of LM training. Recent …

" Vorbe\c {s} ti Rom\^ ane\c {s} te?" A Recipe to Train Powerful Romanian LLMs with English Instructions

M Masala, DC Ilie-Ablachim, A Dima… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, Large Language Models (LLMs) have achieved almost human-like
performance on various tasks. While some LLMs have been trained on multilingual data …

Prompt optimization via adversarial in-context learning

D Long, Y Zhao, H Brown, Y **e, J Zhao… - Proceedings of the …, 2024 - aclanthology.org
We propose a new method, Adversarial In-Context Learning (adv-ICL), to optimize prompts
for in-context learning (ICL). Inspired by adversarial learning, adv-ICL is implemented as a …

Distilling the knowledge of Romanian BERTs using multiple teachers

AM Avram, D Catrina, DC Cercel, M Dascălu… - arxiv preprint arxiv …, 2021 - arxiv.org
Running large-scale pre-trained language models in computationally constrained
environments remains a challenging problem yet to be addressed, while transfer learning …