How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Technical Screen rounds at Scale AI.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Scale AI during technical interviews.

Implement universal adversarial attack on GPT-2

Quick Overview

This question evaluates competency in adversarial attacks on large language models, differentiable prompt representation and optimization, and empirical assessment of model robustness and transferability within the Machine Learning domain (natural language processing and security-related evaluation).

You are given a Google Colab notebook and access to a pretrained, aligned GPT-2 language model that has been tuned to avoid generating a small list of "harmful" target words (e.g., ["bomb", "attack", ...]). You are asked to read the paper "Universal and Transferable Adversarial Attacks on Aligned Language Models" and implement, in the notebook, a function that uses the paper's method to learn a universal adversarial text prefix that jailbreaks the model.

Implement and describe an approach that:

Formalizes the optimization objective for finding an adversarial prefix that, when prepended to many different benign prompts, causes the model with high probability to output at least one of the target harmful words in its continuation.
Represents the adversarial prefix in a way that is differentiable and can be optimized with gradient-based methods while keeping the model parameters frozen.
Converts the optimized representation back into valid discrete tokens that can be used as an actual text prompt for the language model.
Evaluates the effectiveness and transferability of the learned attack on held-out prompts (and possibly on other related models), including an appropriate success metric.

Outline the full procedure you would implement in Colab, including key modeling choices, loss functions, optimization details, and any tricks needed to make training stable and efficient. Focus on the methodology rather than code-level details, and assume the goal is to understand and evaluate model robustness, not to deploy the attack in practice.

Quick Overview

Implement and describe an approach that:

Formalizes the optimization objective for finding an adversarial prefix that, when prepended to many different benign prompts, causes the model with high probability to output at least one of the target harmful words in its continuation.
Represents the adversarial prefix in a way that is differentiable and can be optimized with gradient-based methods while keeping the model parameters frozen.
Converts the optimized representation back into valid discrete tokens that can be used as an actual text prompt for the language model.
Evaluates the effectiveness and transferability of the learned attack on held-out prompts (and possibly on other related models), including an appropriate success metric.

Implement universal adversarial attack on GPT-2

Quick Overview

Solution

Comments (0)

Implement universal adversarial attack on GPT-2

Quick Overview

Solution

Comments (0)