Explainable ai: A review of machine learning interpretability methods

P Linardatos, V Papastefanopoulos, S Kotsiantis - Entropy, 2020 - mdpi.com
Recent advances in artificial intelligence (AI) have led to its widespread industrial adoption,
with machine learning systems demonstrating superhuman performance in a significant …

Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …

Red teaming language models with language models

E Perez, S Huang, F Song, T Cai, R Ring… - arxiv preprint arxiv …, 2022 - arxiv.org
Language Models (LMs) often cannot be deployed because of their potential to harm users
in hard-to-predict ways. Prior work identifies harmful behaviors before deployment by using …

Radar: Robust ai-text detection via adversarial learning

X Hu, PY Chen, TY Ho - Advances in Neural Information …, 2023 - proceedings.neurips.cc
Recent advances in large language models (LLMs) and the intensifying popularity of
ChatGPT-like applications have blurred the boundary of high-quality text generation …

A survey of safety and trustworthiness of large language models through the lens of verification and validation

X Huang, W Ruan, W Huang, G **, Y Dong… - Artificial Intelligence …, 2024 - Springer
Large language models (LLMs) have exploded a new heatwave of AI for their ability to
engage end-users in human-level conversations with detailed and articulate answers across …

Trustworthy ai: A computational perspective

H Liu, Y Wang, W Fan, X Liu, Y Li, S Jain, Y Liu… - ACM Transactions on …, 2022 - dl.acm.org
In the past few decades, artificial intelligence (AI) technology has experienced swift
developments, changing everyone's daily life and profoundly altering the course of human …

Adversarial attacks and defenses in images, graphs and text: A review

H Xu, Y Ma, HC Liu, D Deb, H Liu, JL Tang… - International journal of …, 2020 - Springer
Deep neural networks (DNN) have achieved unprecedented success in numerous machine
learning tasks in various domains. However, the existence of adversarial examples raises …

Universal adversarial triggers for attacking and analyzing NLP

E Wallace, S Feng, N Kandpal, M Gardner… - arxiv preprint arxiv …, 2019 - arxiv.org
Adversarial examples highlight model vulnerabilities and are useful for evaluation and
interpretation. We define universal adversarial triggers: input-agnostic sequences of tokens …

Adversarial sensor attack on lidar-based perception in autonomous driving

Y Cao, C **ao, B Cyr, Y Zhou, W Park… - Proceedings of the …, 2019 - dl.acm.org
In Autonomous Vehicles (AVs), one fundamental pillar is perception, which leverages
sensors like cameras and LiDARs (Light Detection and Ranging) to understand the driving …

Adversarial attacks on deep-learning models in natural language processing: A survey

WE Zhang, QZ Sheng, A Alhazmi, C Li - ACM Transactions on Intelligent …, 2020 - dl.acm.org
With the development of high computational devices, deep neural networks (DNNs), in
recent years, have gained significant popularity in many Artificial Intelligence (AI) …