Design a Distributed Metrics Counter
Company: Stripe
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Onsite
Design a distributed metrics counter service for internal product telemetry.
The service should support:
- `increment(metric_name, delta, tags, timestamp)`
- `get_count(metric_name, tag_filters, start_time, end_time, granularity)`
Requirements:
- ingest millions of counter updates per second from many services;
- provide near-real-time visibility, with recent data queryable within a few seconds;
- support aggregation by second, minute, and hour;
- remain available during machine or zone failures;
- store data durably for historical queries;
- handle hot metrics, retries, and high-cardinality tags;
- explain tradeoffs between exact and approximate counting.
Describe the API, data model, write path, read path, sharding, storage choices, rollups, failure recovery, and scaling strategy.
Quick Answer: This question evaluates a candidate's competency in designing scalable, highly available distributed systems for real-time telemetry ingestion and aggregation, covering concepts such as sharding, replication, durable storage, rollups, consistency trade-offs, and exact versus approximate counting.