Design Event Email System
Company: StubHub
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
##### Question
Design an event system that can manage at least **1,000,000 concurrent events** and send notification emails to every registered participant of each event (e.g., reminders, schedule changes/cancellations, post-event follow-ups). Walk through the full design and address the following:
1. **Core APIs** — create/update/cancel an event, register a participant, and schedule or immediately trigger emails for an event (include idempotency on writes).
2. **High-level architecture** — the services and message-bus/queue topology that turn an event trigger into per-recipient emails.
3. **Data model and storage strategy** — schemas, indexes, partitioning/sharding, and retention; relational vs. NoSQL trade-offs.
4. **End-to-end email delivery pipeline** — producers, queues, consumers, fan-out, batching, retries, idempotency keys, deduplication, and bounce/complaint handling via an ESP.
5. **Capacity estimates** — QPS, fan-out per event, and peak email throughput, with the math behind them.
6. **Scalability** — partitioning, horizontal autoscaling of workers, and backpressure.
7. **Scheduling strategies** — time-based reminders vs. event-driven triggers, a distributed scheduler, and how to avoid thundering-herd spikes.
8. **Rate limiting and quotas** — across tenants, ESP providers, and recipient domains; provider-quota adherence.
9. **Delivery semantics** — at-least-once vs. exactly-once, dead-letter queues, and reprocessing.
10. **Failure handling and reliability** — retries with backoff, the transactional outbox pattern, and how the system recovers from a regional outage or provider throttling.
11. **Observability** — metrics, SLOs, tracing, and alerting.
12. **Abuse prevention and compliance** — unsubscribe/suppression lists, double opt-in, CAN-SPAM/GDPR.
13. **Multi-region availability and disaster recovery**, with the consistency trade-offs involved.
14. **Cost considerations** — a rough estimate and a managed-services-vs-self-hosting discussion.
Walk through one concrete failure scenario (e.g., a regional outage or provider throttling) and explain how the system degrades and recovers.
Quick Answer: A StubHub software-engineer onsite system design question: design an event system that manages 1,000,000+ concurrent events and emails every registered participant. It tests scale estimation, data modeling, an idempotent event-driven email delivery pipeline with an ESP, rate limiting across tenants and providers, scheduling, observability, compliance, and multi-region disaster recovery.