Walk me through the most impactful infrastructure project on your resume. We will deep dive:
(
1) Problem statement, constraints, stakeholders, and success metrics;
(
2) High-level architecture, key components, data model, and critical algorithms/APIs;
(
3) Scale numbers (QPS, latency, storage), SLOs, and capacity planning;
(
4) The most serious incident/failure—what happened, how you mitigated it, and postmortem learnings;
(
5) Quantified results and what you would change if given three more months.
Quick Answer: This question evaluates ownership, technical leadership, system design, and operational reliability by requiring an end-to-end deep dive into an infrastructure project covering problem definition, architecture, scaling (QPS, latency, storage), SLOs, incident response, and impact metrics.
Solution
# Example Deep Dive: Real-time Notification Fanout Platform
Below is a model walkthrough you can adapt to your own project. It is intentionally specific, with architecture and numbers, to demonstrate the depth expected.
## 1) Problem, Constraints, Stakeholders, Success Metrics
Problem statement
- Multiple teams were hand-rolling push/email/in-app notifications, causing inconsistent latency, duplicated effort, and reliability issues. We needed a multi-tenant, low-latency fanout platform to:
- Accept event-based triggers (e.g., "user X commented"),
- Resolve recipients (e.g., followers or targeted cohorts),
- Personalize and deliver across channels (mobile push, in-app, email),
- Provide observability, rate limiting, and compliance.
Constraints
- Low latency: internal processing p95 < 300 ms; end-to-end p99 < 1.5 s for push/in-app.
- Throughput: handle bursts 10x baseline; backpressure and graceful degradation.
- Reliability: 99.95% monthly availability; at-least-once delivery with idempotency at endpoints.
- Privacy/compliance: GDPR/CCPA deletions, user opt-outs, channel preferences.
- Multi-tenant isolation: quotas, circuit breakers, fair scheduling.
- Cost efficiency: CPU/GB-hour per million deliveries target; storage capped with tiering.
Stakeholders
- Product teams emitting triggers (producers) and consuming delivery metrics.
- Mobile/web clients (delivery endpoints) and infra/SRE teams.
- Legal/Privacy for data retention and opt-out enforcement.
Success metrics (KPIs)
- Reliability: success rate > 99.95%; no more than 0.05% dropped after retries.
- Latency: internal processing p95 < 300 ms; end-to-end p99 < 1.5 s (excluding third-party variance when applicable).
- Developer experience: on-boarding time < 1 week; API error rate < 0.1%.
- Cost: 30% lower infra cost vs. legacy per million deliveries.
- Adoption: migrate N teams; deprecate M legacy pipelines.
## 2) High-level Architecture, Components, Data Model, Algorithms/APIs
High-level architecture (event-driven, multi-tenant)
- Ingestion gateway (HTTP/gRPC): AuthN/Z, schema validation, per-tenant quotas.
- Durable log (Kafka/Pulsar): decouple producers/consumers, support replays.
- Orchestrator/workflow service: routes events, resolves recipients, applies business rules.
- Fanout workers: shard by tenant + user hash; batch and rate-limit deliveries.
- Personalization/templating service: fills templates with user/context data; caches templates.
- Delivery adapters: APNs, FCM, email (SMTP provider), web in-app socket service.
- State stores:
- Idempotency/dedup cache (Redis with write-through to persistent store) keyed by (tenant_id, message_id).
- Recipient graph/profile store (Cassandra/Bigtable) for lookups.
- Preferences/suppression lists (Cassandra) and hot cache (Redis).
- Observability: tracing (OpenTelemetry), metrics (Prometheus), logs (ELK), dead-letter queues.
- Control plane: feature flags, circuit breakers, per-tenant quotas, rollout/canary.
Data model (logical)
- NotificationRequest: { request_id, tenant_id, actor_id, template_id, recipients_query, channel_prefs, dedupe_key, priority, ttl, metadata }
- NotificationInstance: { instance_id, request_id, tenant_id, user_id, channel, payload, send_state, retries, delivered_at }
- Suppression/Preferences: { user_id, channel, opt_out_reason, last_updated }
- DeliveryReceipt: { instance_id, provider_id, status, latency_ms, error_code }
Critical algorithms and patterns
- Idempotency/dedup: Upsert by (tenant_id, dedupe_key). If exists and not expired, drop duplicates. TTL aligns with retry window.
- Sharding: Consistent hashing by (tenant_id, user_id) to distribute fanout. Keeps per-user order when needed.
- Batching: Group up to N instances per channel/provider; adaptive batch size based on observed throttling and latency.
- Rate limiting/backpressure: Token bucket per tenant, per provider, and global. Dynamic refill based on provider responses.
- Retry with jitter: Exponential backoff, bounded (e.g., base 250 ms, max 5 min), full jitter to avoid thundering herds.
- Recipient resolution: Compile recipients_query to a pre-defined target set (followers, segment); paginate large sets to batches.
Public APIs (simplified)
- POST /v1/notify: accepts NotificationRequest; returns request_id.
- GET /v1/status/{request_id}: aggregate status across instances.
- POST /v1/cancel: cancel pending instances by request_id/dedupe_key.
- GET /v1/receipts?request_id=...: delivery receipts (paginated).
## 3) Scale, SLOs, Capacity Planning
Representative scale
- Ingest QPS: baseline 8k, peak 40k (events/s).
- Fanout factor: median 20, p95 300; peak deliveries ~ 40k * 300 = 12M instances/s at burst (rare), typical peak ~ 1.2M/s.
- Payload size: avg 0.8 KB (push/in-app), 2.5 KB (email post-templating).
- Storage: 30-day retention of receipts and audit logs.
SLOs and error budget
- Availability SLO: 99.95% monthly. Error budget per 30-day month = 0.0005 × 30 × 24 × 60 ≈ 21.6 minutes.
- Latency SLOs: internal pipeline p95 < 300 ms; per-channel adapters expose p99 budgets (APNs/FCM variability considered). Separate SLOs per channel to avoid conflation.
Capacity planning (back-of-envelope)
- Worker concurrency
- Suppose average per-instance processing time (excluding external provider) W = 2 ms.
- To sustain R deliveries/s, needed concurrent slots C ≈ R × W.
- Example: R = 1.2M/s → C = 1.2e6 × 0.002 = 2400 concurrent. With 2× headroom, provision ~4800 slots. On 48-core boxes with 50 effective threads each, ~2 racks of 50 boxes each, or autoscale equivalent.
- Kafka partitions
- Target 10–20 MB/s per partition. If ingress throughput = 40k events/s × 1 KB ≈ 40 MB/s, partitions ≈ 2–4; add headroom → 8–12 partitions. For fanout topics (instances), 12M/s × 200 B keys ≈ 2.4 GB/s; partitions ≈ 120–240. Start at 256 partitions to simplify growth.
- Storage sizing
- Daily receipts: assume 300M deliveries/day avg; receipt row ~ 120 B → 36 GB/day; with indices and replicas, ~150 GB/day. 30-day hot = ~4.5 TB; cold tier beyond 30 days.
- Little’s Law sanity check
- L = λ × W. If internal W = 200 ms end-to-end and λ = 40k req/s, concurrent in system L ≈ 8000 requests; ensure queues and gateways sized accordingly.
Guardrails
- Per-tenant quotas + circuit breakers (drop to DLQ if sustained overload).
- Global kill switch for channels and per-template rollback.
- Canary deployments with automatic rollback on error-rate/latency regressions.
## 4) Major Incident, Mitigation, Postmortem
Incident summary
- Symptom: p99 latency spiked from 1.2 s to 9 s, success rate dropped to 98.7% for ~14 minutes during a peak. Delivery backlog ballooned.
- Trigger: A misconfigured template targeted “all active users” instead of a cohort, generating a 100× spike. Our rate limiter protected providers, but the orchestrator retried aggressively; dedupe key collision logic failed for cross-tenant reuse.
Root causes
- Insufficient per-template quotas; only per-tenant and global were enforced.
- Dedupe key schema: {template_id, user_id} without tenant_id → cross-tenant collisions caused unintended drops/retries.
- Retry policy lacked dynamic adaptation to sustained overload; backoff caps too low.
Mitigation (during incident)
- Activated circuit breaker for the offending template via control plane.
- Increased backoff and enabled global slow mode (reduce batch sizes, raise jitter).
- Provisioned additional fanout workers and expanded Kafka consumer group temporarily to drain safe backlog.
Postmortem actions
- Schema fix: dedupe key now {tenant_id, template_id, user_id}; added validation to reject ambiguous keys.
- Quotas: introduced per-template quotas and budget-based scheduling (tokens issued per minute per template).
- Adaptive retry: provider-aware backoff using feedback signals (429/5xx → exponential with larger cap; success → shrink).
- Preflight guard: dry-run validator for targeting queries estimating fanout cardinality; block if projected > threshold without explicit override.
- Chaos tests and load drills quarterly; runbooks with one-click template disable.
- Added SLO burn alerts tied to error budget consumption, not just static thresholds.
Impact
- Duration: 14 minutes; consumed ~65% of monthly error budget. No data loss; a small percentage delivered late beyond SLA.
## 5) Results and Next Iterations
Quantified results
- Latency: internal p95 from 420 ms → 180 ms (−57%); end-to-end p99 improved ~28% after batching and rate-limit tuning.
- Reliability: 99.96% availability over 6 months; retries reduced by 35% with adaptive policies.
- Cost: 38% lower compute per million deliveries via batching, zero-copy serialization, and right-sizing JVM heaps.
- Adoption: 14 teams migrated; decommissioned 9 legacy pipelines. Developer onboarding time averaged 4.5 days.
- Product impact: Some surfaces saw +12–18% CTR vs. legacy delivery due to timeliness and channel selection.
What I’d do with three more months
- Multi-armed bandit for channel selection (push vs. email vs. in-app) optimizing open/click given user context; guard with per-user fatigue.
- Self-serve targeting DSL with static analysis and cardinality estimates, plus a UI showing projected blast radius, cost, and SLO risk.
- Multi-region active-active with topic-level replication (cluster linking) and per-tenant failover playbooks; test RTO/RPO regularly.
- End-to-end exactly-once for in-app channel using idempotent writes and deduped read models; maintain at-least-once for push/email.
- Full protobuf/Avro schema registry with backward/forward compatibility tests in CI; canary consumer verification before rollout.
How to adapt this template to your own project
- Swap the domain (e.g., feature flag service, rate limiter, ingestion pipeline) but keep the narrative: problem → design → scale → incident → impact.
- Provide concrete numbers, APIs, and data models from your experience.
- Tie choices to constraints (SLOs, cost, privacy). Show tradeoffs and what you’d improve next.