How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a hard difficulty System Design question, commonly asked during Technical Screen rounds at DoorDash.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at DoorDash during technical interviews.

Design Real-Time Driver Pay Aggregation | DoorDash Interview Question

Q: Design Real-Time Driver Pay Aggregation

This question evaluates competence in designing scalable, fault-tolerant streaming systems and implementing correct time-windowed aggregation logic, testing skills in distributed systems, stream processing, and data correctness; it belongs to the system design domain and streaming/real-time data engineering and spans both conceptual architectural thinking and practical application-level concerns. It is commonly asked to probe reasoning about correctness under at-least-once delivery and messy inputs (out-of-order, duplicate, late, or malformed events), operational resilience, and trade-offs between throughput, latency, and auditability in production-scale pipelines.

Design a production system that ingests a high-volume stream of delivery-order events and continuously computes each driver's (Dasher's) daily pay.

The core pay logic is simple in isolation: given a driver's sequence of order events for a day, sum the pay over the time they were actively working. The twist is a peak-period multiplier — during designated peak windows, the pay rate is multiplied (e.g., $\times 2$ ), so any work interval that straddles a peak boundary must be split at the boundary and each sub-interval paid at its correct rate.

The interview starts from this computation and deliberately expands into a full streaming/system-design discussion: how do you run this correctly and reliably at DoorDash scale, with messy input, huge volume, and real production failure modes?

Constraints & Assumptions

Input: JSON order events published by upstream services. Each event carries identifiers and a timestamp; the exact schema is open and should be agreed with the interviewer.
Delivery semantics: at-least-once. Events may arrive duplicated , out of order , or late ; some may be malformed .
Scale: a full day's raw events for all drivers may not fit in memory on a single machine.
Availability: the system must keep producing correct daily totals across consumer crashes and service outages .
Peak windows: a set of [start, end) time ranges per day with an associated multiplier; intervals are half-open so boundaries don't double-count.
Correctness: pay numbers feed real payouts — auditability and reproducibility matter more than shaving the last millisecond of latency.

Part 1 — Core pay computation with peak-hour splitting

Given one driver's ordered work intervals for a day and a set of peak windows, compute the driver's total pay. A single interval may span zero, one, or several peak boundaries; each resulting sub-interval is paid at base rate or the peak-multiplied rate.

Part 2 — End-to-end streaming architecture at scale

Design the production system that runs Part 1 continuously over a high-volume event stream. Cover the publisher/consumer flow, how events become work intervals, where state and results live, and the read path for "what is driver X's pay so far today?"

Part 3 — Correctness under messy input

Handle at-least-once delivery (duplicates), out-of-order and late events, and malformed events — without corrupting daily totals.

Part 4 — Scale, failure recovery, and upstream safety

Address the parts that break under real production load:

Large data: a day's raw events exceed memory. How do you keep streaming state bounded, and how do you recompute history (backfill/reconciliation) when data is too large to sort in memory?
Failure recovery: a consumer or the service crashes mid-day. How do you resume without losing or double-counting pay?
Upstream APIs: if processing must call an upstream service in production, how do you protect both systems?

Clarifying Questions to Ask

What exactly is in an order event (fields, event types, schema versioning), and what defines a "work interval" — order accept→deliver, or active dash time?
How is pay calculated at base rate (per-minute, per-order, per-mile, or a blend), and is the peak multiplier applied to the whole pay or only the time-based component?
How are peak windows defined and distributed — static config, or a separate event stream — and can they change retroactively for a day already in progress?
What freshness does the read path need (sub-second "pay so far today" vs. an end-of-day finalized number), and who consumes it (driver app, payouts batch, support)?
What is the daily event volume and per-driver order rate, and what is the late-event distribution (how late can events realistically arrive)?
When a finalized daily total later proves wrong, is in-place correction acceptable, or must we emit immutable adjustments for audit?

What a Strong Answer Covers

A correct, well-reasoned Part 1 algorithm with an explicit half-open-boundary convention and the obvious edge cases handled.
A clear two-stage pipeline (lifecycle reconstruction → aggregation) with a justified partition-key choice per stage, not a hand-wave.
Idempotency tied to a concrete key, plus a deliberate event-time + watermark + grace-period story for late/out-of-order data.
A dead-letter / quarantine path for malformed events and invalid state transitions, kept off the hot aggregation path.
A bounded-memory streaming-state design plus a separate batch/replay path for backfill and reconciliation (streaming output vs. recomputed truth).
Crash recovery that is explicit about offset-commit vs. durable-state ordering, so replay neither loses nor double-counts pay.
Upstream-protection patterns (retry/backoff, circuit breaker, rate limit, backpressure) with a reason to avoid synchronous per-event calls.
Observability: the specific metrics that would catch a silent payout bug (consumer lag, dup rate, invalid-event rate, late-event volume, reconciliation deltas).
Honest trade-offs: streaming freshness vs. batch reproducibility, immutability/replay vs. in-place mutation.

Follow-up Questions

Peak windows for today are corrected at 5pm (a window's multiplier changes). How does the already-accumulated pay for affected drivers get fixed, and what does the user-visible number do in the meantime?
A bug shipped this morning over-counted peak pay for some drivers. Walk through detecting it, recomputing the correct totals from the durable log, and reconciling without paying anyone twice.
The driver app needs "pay so far today" with sub-second latency for millions of drivers. How does the serving/read path differ from the write path, and how do you keep it cheap?
A single hot driver (or hot order key) creates a partition skew that lags one consumer badly. How do you detect and mitigate it without breaking per-key ordering guarantees?

Design a production system that ingests a high-volume stream of delivery-order events and continuously computes each driver's (Dasher's) daily pay.

Constraints & Assumptions

Input: JSON order events published by upstream services. Each event carries identifiers and a timestamp; the exact schema is open and should be agreed with the interviewer.
Delivery semantics: at-least-once. Events may arrive duplicated , out of order , or late ; some may be malformed .
Scale: a full day's raw events for all drivers may not fit in memory on a single machine.
Availability: the system must keep producing correct daily totals across consumer crashes and service outages .
Peak windows: a set of [start, end) time ranges per day with an associated multiplier; intervals are half-open so boundaries don't double-count.
Correctness: pay numbers feed real payouts — auditability and reproducibility matter more than shaving the last millisecond of latency.

Part 1 — Core pay computation with peak-hour splitting

Part 2 — End-to-end streaming architecture at scale

Part 3 — Correctness under messy input

Handle at-least-once delivery (duplicates), out-of-order and late events, and malformed events — without corrupting daily totals.

Part 4 — Scale, failure recovery, and upstream safety

Address the parts that break under real production load:

Large data: a day's raw events exceed memory. How do you keep streaming state bounded, and how do you recompute history (backfill/reconciliation) when data is too large to sort in memory?
Failure recovery: a consumer or the service crashes mid-day. How do you resume without losing or double-counting pay?
Upstream APIs: if processing must call an upstream service in production, how do you protect both systems?

Clarifying Questions to Ask

What exactly is in an order event (fields, event types, schema versioning), and what defines a "work interval" — order accept→deliver, or active dash time?
How is pay calculated at base rate (per-minute, per-order, per-mile, or a blend), and is the peak multiplier applied to the whole pay or only the time-based component?
How are peak windows defined and distributed — static config, or a separate event stream — and can they change retroactively for a day already in progress?
What freshness does the read path need (sub-second "pay so far today" vs. an end-of-day finalized number), and who consumes it (driver app, payouts batch, support)?
What is the daily event volume and per-driver order rate, and what is the late-event distribution (how late can events realistically arrive)?
When a finalized daily total later proves wrong, is in-place correction acceptable, or must we emit immutable adjustments for audit?

What a Strong Answer Covers

A correct, well-reasoned Part 1 algorithm with an explicit half-open-boundary convention and the obvious edge cases handled.
A clear two-stage pipeline (lifecycle reconstruction → aggregation) with a justified partition-key choice per stage, not a hand-wave.
Idempotency tied to a concrete key, plus a deliberate event-time + watermark + grace-period story for late/out-of-order data.
A dead-letter / quarantine path for malformed events and invalid state transitions, kept off the hot aggregation path.
A bounded-memory streaming-state design plus a separate batch/replay path for backfill and reconciliation (streaming output vs. recomputed truth).
Crash recovery that is explicit about offset-commit vs. durable-state ordering, so replay neither loses nor double-counts pay.
Upstream-protection patterns (retry/backoff, circuit breaker, rate limit, backpressure) with a reason to avoid synchronous per-event calls.
Observability: the specific metrics that would catch a silent payout bug (consumer lag, dup rate, invalid-event rate, late-event volume, reconciliation deltas).
Honest trade-offs: streaming freshness vs. batch reproducibility, immutability/replay vs. in-place mutation.

Follow-up Questions

Peak windows for today are corrected at 5pm (a window's multiplier changes). How does the already-accumulated pay for affected drivers get fixed, and what does the user-visible number do in the meantime?
A bug shipped this morning over-counted peak pay for some drivers. Walk through detecting it, recomputing the correct totals from the durable log, and reconciling without paying anyone twice.
The driver app needs "pay so far today" with sub-second latency for millions of drivers. How does the serving/read path differ from the write path, and how do you keep it cheap?
A single hot driver (or hot order key) creates a partition skew that lags one consumer badly. How do you detect and mitigate it without breaking per-key ordering guarantees?

Design Real-Time Driver Pay Aggregation

Quick Overview

Constraints & Assumptions

Part 1 — Core pay computation with peak-hour splitting

Part 2 — End-to-end streaming architecture at scale

Part 3 — Correctness under messy input

Part 4 — Scale, failure recovery, and upstream safety

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Design Real-Time Driver Pay Aggregation

Quick Overview

Constraints & Assumptions

Part 1 — Core pay computation with peak-hour splitting

Part 2 — End-to-end streaming architecture at scale

Part 3 — Correctness under messy input

Part 4 — Scale, failure recovery, and upstream safety

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP