Design anomaly detection and response platform

Q: Design anomaly detection and response platform

This question evaluates a candidate's competency in designing end-to-end ML-driven systems that handle large-scale, low-latency OS snapshot ingestion, binary anomaly detection, decisioning and action orchestration with multi-tenant security, auditability, and failure rollback.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

Design an AI-Driven OS Snapshot Anomaly Detection Service

Context

You are designing a cloud service that ingests operating system (OS) snapshots from client machines, runs an AI-based anomaly detector that outputs either Normal or Abnormal, and triggers automated actions for Abnormal cases. Clients must be able to query the history of warnings and actions.

Assume a multi-tenant environment with per-tenant isolation and compliance needs. Snapshots may arrive as a continuous stream or batched. End-to-end detection should be near real-time for operational usefulness.

Requirements

Functional
1. Data collection: ingest OS snapshots reliably and at scale.
2. Model serving: run an AI anomaly detector (binary classification).
3. Decisioning: translate model outputs into actions with policies.
4. Action orchestration: perform shutdown/quarantine, send emails, with retries and rollbacks.
5. Audit and history storage: immutable log of detections, actions, and operator overrides; queryable by clients.
6. Access controls: strong authn/authz for agents, services, and users; per-tenant data isolation.
7. Failure rollback: safe defaults, kill switches, undo for actions, and disaster recovery.
Non-functional
- Low end-to-end latency (seconds) for high-priority events.
- High throughput and elasticity.
- Low false positive rate with safeguards against destructive automated actions.
- Observability, compliance-grade audit trails, and multi-region resilience.

Deliverable

Design the detection service covering: data collection, model serving, decisioning, action orchestration, audit and history storage, access controls, and failure rollback. Discuss latency targets, throughput planning, false positives trade-offs, and safeguards for automated actions.

Design anomaly detection and response platform

Design an AI-Driven OS Snapshot Anomaly Detection Service

Context

Requirements

Deliverable

Solution

Comments (0)

Design anomaly detection and response platform

Overview

Design an AI-Driven OS Snapshot Anomaly Detection Service

Context

Requirements

Deliverable

Solution

Comments (0)