Design autonomous cloud monitoring and remediation

Q: Design autonomous cloud monitoring and remediation

This is a ML System Design interview question from Google for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

Design an AI-Assisted Monitoring and Auto-Remediation Service

Context

Design a service that monitors cloud applications across multiple providers, collects telemetry (metrics, logs, traces, events), invokes an AI-based analyzer to detect incidents, and automatically takes actions such as shutting down or network-isolating instances. The system must work at scale and be resilient to noisy alerts.

Requirements

Functional

Data ingestion
- Support metrics, logs, traces, events; streaming and near real-time.
- Schema/versioning, tenant isolation, and backpressure handling.
Model serving
- Real-time scoring; model registry/versioning; feature store.
Rule and AI fusion
- Combine deterministic rules with ML outputs to decide severity and actions.
Action orchestration
- Execute runbooks: e.g., instance shutdown, quarantine via network policies, restart, scale-out.
- Idempotency, retries, connectors to major cloud providers.
Safety checks
- Human-in-the-loop where needed, blast-radius limits, budgets, kill switches, canaries.
Audit logs
- Append-only, tamper-evident logging of telemetry-derived incidents, decisions, and actions.
Rollback
- Automatic or manual rollback with state capture and time-bound isolation.

Non-Functional

Multi-cloud support (e.g., AWS, Azure, GCP; on-prem optional).
Scale and performance (define SLOs/latency, horizontal scaling, capacity planning).
Noisy alert reduction (deduplication, rate limiting, correlation, adaptive thresholds).

Deliverables

Architecture with key components and data flow.
Choices/trade-offs for ingestion, storage, model serving, fusion, orchestration.
Safety and governance mechanisms.
Plan for multi-cloud integration, scaling, and noisy alert handling.