Design a computer-use agent end-to-end

Q: Design a computer-use agent end-to-end

This is a ML System Design interview question from Amazon for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

Loading...

Scenario

You are designing a computer-use agent that can complete user tasks on a standard desktop environment by observing the screen and issuing actions (mouse/keyboard). Examples: “Find my last invoice in Gmail and download it”, “Book a flight with these constraints”, “Open a spreadsheet, add a pivot table, and export a PDF”.

Requirements

Inputs (observations): screen pixels (and optionally accessibility tree / DOM if available), plus the user’s natural-language instruction.
Outputs (actions): mouse move/click/drag, scroll, key presses, and short text input.
Must support multi-step planning , error recovery, and working across many websites/apps.
Provide a design covering the full lifecycle:
1. Pretraining (what data, objective, and model components)
2. Post-training / supervised finetuning (what demonstrations, labeling strategy)
3. RL stage (what reward, what algorithm family, how to stabilize training)
4. Inference (latency, context/memory, safety, monitoring)

Constraints (assume)

Latency target: ~1–2 seconds per action decision.
Must be robust to UI changes.
Must minimize unsafe actions (e.g., sending emails, purchasing) and require confirmation for high-risk steps.

Deliver a high-level architecture plus key modeling/training choices, data pipelines, and evaluation/metrics.

Design a computer-use agent end-to-end

Scenario

Requirements

Constraints (assume)

Solution

Comments (0)