You are given a Google Colab notebook and access to a pretrained, aligned GPT-2 language model that has been tuned to avoid generating a small list of "harmful" target words (e.g., ["bomb", "attack", ...]). You are asked to read the paper "Universal and Transferable Adversarial Attacks on Aligned Language Models" and implement, in the notebook, a function that uses the paper's method to learn a universal adversarial text prefix that jailbreaks the model.
Implement and describe an approach that:
Outline the full procedure you would implement in Colab, including key modeling choices, loss functions, optimization details, and any tricks needed to make training stable and efficient. Focus on the methodology rather than code-level details, and assume the goal is to understand and evaluate model robustness, not to deploy the attack in practice.
Login required