Implement universal adversarial attack on GPT-2
Company: Scale AI
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
# Robustness Evaluation: Universal Adversarial Prompts for GPT-2
You are in a Machine Learning Engineer interview. Explain how you would build a controlled, offline evaluation inspired by universal and transferable adversarial attacks on language models, using a small model such as GPT-2 as the test system.
The goal is to assess robustness and understand failure modes, not to create or deploy jailbreak strings.
### Constraints & Assumptions
- Keep the answer at the level of research methodology and evaluation design.
- Do not provide a ready-to-use harmful prompt, jailbreak suffix, or deployment recipe.
- Assume the evaluation runs in an isolated environment with benign or synthetic test prompts.
- Include safety controls, documentation, and limitations.
- You may describe differentiable optimization concepts, but avoid operational details that would directly enable misuse.
### Clarifying Questions to Ask
- Is the interviewer looking for a conceptual algorithm, an evaluation harness, or production safety practices?
- What model access is available: logits, gradients, embeddings, or only black-box outputs?
- What behaviors are considered failures in this evaluation?
- Are we measuring within-model robustness, transfer to other models, or both?
### What a Strong Answer Covers
- A clear threat model and evaluation objective.
- A benign dataset of prompts and labels or refusal expectations.
- A high-level universal perturbation idea: find a single prompt prefix or suffix that degrades behavior across many inputs.
- Differentiable relaxation or surrogate optimization at a conceptual level, followed by safe discretization and filtering.
- Evaluation metrics such as attack success rate, false positives, transferability, diversity of failures, and robustness after mitigation.
- Controls such as held-out prompts, random baselines, manual review, audit logs, and no release of harmful strings.
- Discussion of mitigations: adversarial training, prompt filtering, output classifiers, red-team evaluation, and monitoring.
### Follow-up Questions
- How would your approach change if you only had black-box API access?
- How would you prevent overfitting the universal string to the training prompts?
- What baselines would make the evaluation credible?
- What artifacts would you withhold from a public report for safety reasons?
Quick Answer: Machine learning robustness prompt on safely evaluating universal adversarial prompts for GPT-2, covering threat modeling, differentiable optimization concepts, transferability, metrics, safety controls, and mitigation strategies without deployment-ready jailbreak details.