PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/ML System Design/Google

Design autonomous cloud monitoring and remediation

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's ability to design scalable, resilient ML-enabled monitoring and automated remediation systems, testing competencies in distributed systems architecture, observability (metrics, logs, traces), real-time model serving, action orchestration, and safety/governance.

  • hard
  • Google
  • ML System Design
  • Software Engineer

Design autonomous cloud monitoring and remediation

Company: Google

Role: Software Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Technical Screen

Design a monitoring service for cloud applications that collects telemetry, invokes an AI-based analyzer, and automatically takes actions such as shutting down or network isolating instances. Define data ingestion, model serving, rule and AI fusion, action orchestration, safety checks, audit logs, and rollback. Address multi-cloud support, scale, and noisy alerts.

Quick Answer: This question evaluates a candidate's ability to design scalable, resilient ML-enabled monitoring and automated remediation systems, testing competencies in distributed systems architecture, observability (metrics, logs, traces), real-time model serving, action orchestration, and safety/governance.

Related Interview Questions

  • Design an app-store app recommendation system - Google (medium)
  • Design a chatbot over structured and unstructured data - Google (medium)
  • Design a fraud detection system - Google (medium)
  • Choose Fast or Cheap Models - Google
  • Design ML system for self-driving perception - Google (medium)
Google logo
Google
Sep 6, 2025, 12:00 AM
Software Engineer
Technical Screen
ML System Design
2
0

Design an AI-Assisted Monitoring and Auto-Remediation Service

Context

Design a service that monitors cloud applications across multiple providers, collects telemetry (metrics, logs, traces, events), invokes an AI-based analyzer to detect incidents, and automatically takes actions such as shutting down or network-isolating instances. The system must work at scale and be resilient to noisy alerts.

Requirements

Functional

  1. Data ingestion
    • Support metrics, logs, traces, events; streaming and near real-time.
    • Schema/versioning, tenant isolation, and backpressure handling.
  2. Model serving
    • Real-time scoring; model registry/versioning; feature store.
  3. Rule and AI fusion
    • Combine deterministic rules with ML outputs to decide severity and actions.
  4. Action orchestration
    • Execute runbooks: e.g., instance shutdown, quarantine via network policies, restart, scale-out.
    • Idempotency, retries, connectors to major cloud providers.
  5. Safety checks
    • Human-in-the-loop where needed, blast-radius limits, budgets, kill switches, canaries.
  6. Audit logs
    • Append-only, tamper-evident logging of telemetry-derived incidents, decisions, and actions.
  7. Rollback
    • Automatic or manual rollback with state capture and time-bound isolation.

Non-Functional

  • Multi-cloud support (e.g., AWS, Azure, GCP; on-prem optional).
  • Scale and performance (define SLOs/latency, horizontal scaling, capacity planning).
  • Noisy alert reduction (deduplication, rate limiting, correlation, adaptive thresholds).

Deliverables

  • Architecture with key components and data flow.
  • Choices/trade-offs for ingestion, storage, model serving, fusion, orchestration.
  • Safety and governance mechanisms.
  • Plan for multi-cloud integration, scaling, and noisy alert handling.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Google•More Software Engineer•Google Software Engineer•Google ML System Design•Software Engineer ML System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.