Red teaming language models with language models

E Perez, S Huang, F Song, T Cai, R Ring… - ar** language models that interact with humans is aligning
their behavior to be useful and unharmful for their human users. This is usually achieved by …

Legilimens: Practical and Unified Content Moderation for Large Language Model Services

J Wu, J Deng, S Pang, Y Chen, J Xu, X Li… - Proceedings of the 2024 …, 2024 - dl.acm.org
Given the societal impact of unsafe content generated by large language models (LLMs),
ensuring that LLM services comply with safety standards is a crucial concern for LLM service …

Reproducibility in computational linguistics: Is source code enough?

M Arvan, L Pina, N Parde - … of the 2022 Conference on Empirical …, 2022 - aclanthology.org
The availability of source code has been put forward as one of the most critical factors for
improving the reproducibility of scientific research. This work studies trends in source code …

Constructing highly inductive contexts for dialogue safety through controllable reverse generation

Z Zhang, J Cheng, H Sun, J Deng, F Mi, Y Wang… - arxiv preprint arxiv …, 2022 - arxiv.org
Large pretrained language models can easily produce toxic or biased content, which is
prohibitive for practical use. In order to detect such toxic generations, existing methods rely …

It Couldn't Help But Overhear: On the Limits of Modelling Meta-Communicative Grounding Acts with Supervised Learning

B Madureira, D Schlangen - arxiv preprint arxiv:2405.01139, 2024 - arxiv.org
Active participation in a conversation is key to building common ground, since
understanding is jointly tailored by producers and recipients. Overhearers are deprived of …

Tradeoffs Between Alignment and Helpfulness in Language Models with Representation Engineering

Y Wolf, N Wies, D Shteyman, B Rothberg… - arxiv preprint arxiv …, 2024 - arxiv.org
Language model alignment has become an important component of AI safety, allowing safe
interactions between humans and language models, by enhancing desired behaviors and …

[KNIHA][B] Finding and Fixing Undesirable Behaviors in Pretrained Language Models

E Perez - 2022 - search.proquest.com
Abstract Natural Language Processing (NLP) promises to deliver tools for a variety of
impactful applications, ranging from automatic summarization to question-answering …

[KNIHA][B] Attribute Representation in Neural Language Models

D Yu - 2022 - search.proquest.com
Neural models, including neural language models and encoder-decoder models, are the
backbone of current natural language processing (NLP) research. Large pre-trained models …

Machine Learning and Open Science: On Risks and Challenges

M Arvan - 2024 - search.proquest.com
Recent years have witnessed substantial growth in Machine Learning (ML) and Natural
Language Processing (NLP), largely fueled by the accessibility and openness of data and …