Design an AI-Driven OS Snapshot Anomaly Detection Service
Context
You are designing a cloud service that ingests operating system (OS) snapshots from client machines, runs an AI-based anomaly detector that outputs either Normal or Abnormal, and triggers automated actions for Abnormal cases. Clients must be able to query the history of warnings and actions.
Assume a multi-tenant environment with per-tenant isolation and compliance needs. Snapshots may arrive as a continuous stream or batched. End-to-end detection should be near real-time for operational usefulness.
Requirements
-
Functional
-
Data collection: ingest OS snapshots reliably and at scale.
-
Model serving: run an AI anomaly detector (binary classification).
-
Decisioning: translate model outputs into actions with policies.
-
Action orchestration: perform shutdown/quarantine, send emails, with retries and rollbacks.
-
Audit and history storage: immutable log of detections, actions, and operator overrides; queryable by clients.
-
Access controls: strong authn/authz for agents, services, and users; per-tenant data isolation.
-
Failure rollback: safe defaults, kill switches, undo for actions, and disaster recovery.
-
Non-functional
-
Low end-to-end latency (seconds) for high-priority events.
-
High throughput and elasticity.
-
Low false positive rate with safeguards against destructive automated actions.
-
Observability, compliance-grade audit trails, and multi-region resilience.
Deliverable
Design the detection service covering: data collection, model serving, decisioning, action orchestration, audit and history storage, access controls, and failure rollback. Discuss latency targets, throughput planning, false positives trade-offs, and safeguards for automated actions.