This question evaluates competency in designing end-to-end multimodal interactive ML systems, including perception from pixels and accessibility trees, sequential decision-making and planning, action policy design, robustness to UI changes, and safety-aware behavior.
You are designing a computer-use agent that can complete user tasks on a standard desktop environment by observing the screen and issuing actions (mouse/keyboard). Examples: “Find my last invoice in Gmail and download it”, “Book a flight with these constraints”, “Open a spreadsheet, add a pivot table, and export a PDF”.
Deliver a high-level architecture plus key modeling/training choices, data pipelines, and evaluation/metrics.