Explainable generative ai (genxai): A survey, conceptualization, and research agenda

J Schneider - Artificial Intelligence Review, 2024 - Springer
Generative AI (GenAI) represents a shift from AI's ability to “recognize” to its ability to
“generate” solutions for a wide range of tasks. As generated solutions and applications grow …

Identifying and mitigating vulnerabilities in llm-integrated applications

F Jiang - 2024 - search.proquest.com
Large language models (LLMs) are increasingly deployed as the backend for various
applications, including code completion tools and AI-powered search engines. Unlike …

Explainability for large language models: A survey

H Zhao, H Chen, F Yang, N Liu, H Deng, H Cai… - ACM Transactions on …, 2024 - dl.acm.org
Large language models (LLMs) have demonstrated impressive capabilities in natural
language processing. However, their internal mechanisms are still unclear and this lack of …

[PDF][PDF] DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models.

B Wang, W Chen, H Pei, C **e, M Kang, C Zhang, C Xu… - NeurIPS, 2023 - blogs.qub.ac.uk
Abstract Generative Pre-trained Transformer (GPT) models have exhibited exciting progress
in their capabilities, capturing the interest of practitioners and the public alike. Yet, while the …

Robust prompt optimization for defending language models against jailbreaking attacks

A Zhou, B Li, H Wang - arxiv preprint arxiv:2401.17263, 2024 - arxiv.org
Despite advances in AI alignment, large language models (LLMs) remain vulnerable to
adversarial attacks or jailbreaking, in which adversaries can modify prompts to induce …

Exploring the limits of domain-adaptive training for detoxifying large-scale language models

B Wang, W **, C **ao, P Xu… - Advances in …, 2022 - proceedings.neurips.cc
Pre-trained language models (LMs) are shown to easily generate toxic language. In this
work, we systematically explore domain-adaptive training to reduce the toxicity of language …

An llm can fool itself: A prompt-based adversarial attack

X Xu, K Kong, N Liu, L Cui, D Wang, J Zhang… - arxiv preprint arxiv …, 2023 - arxiv.org
The wide-ranging applications of large language models (LLMs), especially in safety-critical
domains, necessitate the proper evaluation of the LLM's adversarial robustness. This paper …

Exposing the Achilles' heel of textual hate speech classifiers using indistinguishable adversarial examples

S Aggarwal, DK Vishwakarma - Expert Systems with Applications, 2024 - Elsevier
The accessibility of online hate speech has increased significantly, making it crucial for
social-media companies to prioritize efforts to curb its spread. Although deep learning …

Transferable adversarial distribution learning: Query-efficient adversarial attack against large language models

H Dong, J Dong, S Wan, S Yuan, Z Guan - Computers & Security, 2023 - Elsevier
It is a challenging task to fool a text classifier based on deep neural networks under the
black-box setting where the target model can only be queried. Among the existing black-box …

Adversarial attacks and defenses for large language models (LLMs): methods, frameworks & challenges

P Kumar - International Journal of Multimedia Information …, 2024 - Springer
Large language models (LLMs) have exhibited remarkable efficacy and proficiency in a
wide array of NLP endeavors. Nevertheless, concerns are growing rapidly regarding the …