Design a computer-use agent end-to-end
Company: Amazon
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: medium
Interview Round: Onsite
## Scenario
You are designing a **computer-use agent** that can complete user tasks on a standard desktop environment by observing the screen and issuing actions (mouse/keyboard). Examples: “Find my last invoice in Gmail and download it”, “Book a flight with these constraints”, “Open a spreadsheet, add a pivot table, and export a PDF”.
## Requirements
- **Inputs (observations):** screen pixels (and optionally accessibility tree / DOM if available), plus the user’s natural-language instruction.
- **Outputs (actions):** mouse move/click/drag, scroll, key presses, and short text input.
- Must support **multi-step planning**, error recovery, and working across many websites/apps.
- Provide a design covering the full lifecycle:
1. **Pretraining** (what data, objective, and model components)
2. **Post-training / supervised finetuning** (what demonstrations, labeling strategy)
3. **RL stage** (what reward, what algorithm family, how to stabilize training)
4. **Inference** (latency, context/memory, safety, monitoring)
## Constraints (assume)
- Latency target: ~1–2 seconds per action decision.
- Must be robust to UI changes.
- Must minimize unsafe actions (e.g., sending emails, purchasing) and require confirmation for high-risk steps.
Deliver a high-level architecture plus key modeling/training choices, data pipelines, and evaluation/metrics.
Quick Answer: This question evaluates competency in designing end-to-end multimodal interactive ML systems, including perception from pixels and accessibility trees, sequential decision-making and planning, action policy design, robustness to UI changes, and safety-aware behavior.