This question evaluates a candidate's competency in designing end-to-end ML-driven systems that handle large-scale, low-latency OS snapshot ingestion, binary anomaly detection, decisioning and action orchestration with multi-tenant security, auditability, and failure rollback.
You are designing a cloud service that ingests operating system (OS) snapshots from client machines, runs an AI-based anomaly detector that outputs either Normal or Abnormal, and triggers automated actions for Abnormal cases. Clients must be able to query the history of warnings and actions.
Assume a multi-tenant environment with per-tenant isolation and compliance needs. Snapshots may arrive as a continuous stream or batched. End-to-end detection should be near real-time for operational usefulness.
Design the detection service covering: data collection, model serving, decisioning, action orchestration, audit and history storage, access controls, and failure rollback. Discuss latency targets, throughput planning, false positives trade-offs, and safeguards for automated actions.
Login required