Scenario
You are designing a computer-use agent that can complete user tasks on a standard desktop environment by observing the screen and issuing actions (mouse/keyboard). Examples: “Find my last invoice in Gmail and download it”, “Book a flight with these constraints”, “Open a spreadsheet, add a pivot table, and export a PDF”.
Requirements
-
Inputs (observations):
screen pixels (and optionally accessibility tree / DOM if available), plus the user’s natural-language instruction.
-
Outputs (actions):
mouse move/click/drag, scroll, key presses, and short text input.
-
Must support
multi-step planning
, error recovery, and working across many websites/apps.
-
Provide a design covering the full lifecycle:
-
Pretraining
(what data, objective, and model components)
-
Post-training / supervised finetuning
(what demonstrations, labeling strategy)
-
RL stage
(what reward, what algorithm family, how to stabilize training)
-
Inference
(latency, context/memory, safety, monitoring)
Constraints (assume)
-
Latency target: ~1–2 seconds per action decision.
-
Must be robust to UI changes.
-
Must minimize unsafe actions (e.g., sending emails, purchasing) and require confirmation for high-risk steps.
Deliver a high-level architecture plus key modeling/training choices, data pipelines, and evaluation/metrics.