Toward Robust Unlearning for LLMs

Toward Robust Unlearning for LLMs
Secure And Trustworthy Large Language Models (ICLR 2024)


Recent rapid advances in AI enabled by large language models (LLMs) have raised widespread concerns regarding their potential for malicious use. While traditional open-source software has long established mechanisms for combating such adversarial behavior, systems involving large neural networks are nontrivial to interpret—let alone intervene on—for safe use. Various alignment methods have been proposed to steer model responses towards a desired output distribution. However, these techniques are superficial and can be undone entirely with supervised fine-tuning. These vulnerabilities necessitate new approaches such as machine unlearning, in which the underlying representations of these target concepts are corrupted or forgotten. We introduce state-of-the-art methods for robustly unlearning desired concepts from LLMs, such that performance cannot be recovered by white-box fine-tuning. We demonstrate our results on the MMLU benchmark, showing that we can decrease accuracy on a forget set of concepts to chance levels while maintaining accuracy on the retain set.

More Information:

Poster + Figures


A Lapis Labs Project © 2024