Design a Real-Time Monitoring System
Company: DoorDash
Role: Software Engineer
Category: System Design
Difficulty: easy
Interview Round: Onsite
Design a real-time monitoring system for a large production environment.
The system should:
- collect time-series metrics such as CPU, memory, request count, latency, and error rate from agents running on about 100,000 hosts,
- support near-real-time dashboards for engineers,
- evaluate alert rules and send notifications quickly when thresholds are breached,
- store high-resolution recent data and lower-resolution historical data for long-term retention,
- remain reliable during traffic spikes and partial infrastructure failures.
Discuss the requirements, APIs or data model, ingestion pipeline, storage design, query path, alerting architecture, scaling strategy, retention policy, and fault tolerance.
Quick Answer: This question evaluates a candidate's system design and distributed-systems competencies, focusing on scalable time-series ingestion, storage, query, alerting, retention policies, and fault-tolerance for monitoring large production fleets; Category: System Design, domain: observability and time-series data.