This question evaluates the ability to design scalable, multi-tenant metrics monitoring systems, testing competencies in system architecture, ingestion and aggregation pipelines, time-series storage, alerting, capacity planning, and operational observability; it falls under the System Design domain and targets practical, architecture-level application rather than low-level coding. Interviewers commonly ask this to probe reasoning about trade-offs in cloud-native environments—such as collection models, sharding and replication, high-cardinality handling, retention and backfill, and monitoring-the-monitoring—while assessing judgment on reliability, latency, and tenant isolation; this summary is in English.
You are designing a metrics monitoring system for large-scale, cloud-native microservices running across multiple regions and clusters. Services are ephemeral (containers/autoscaling), and the platform is multi-tenant (infra teams, ML/feature teams, product services). Assume on the order of tens of thousands of hosts and hundreds of thousands of service instances, with strict SLOs for data freshness and alerting.
Make minimal, explicit assumptions as needed and call out trade-offs and guardrails.
Login required