PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/DoorDash

Design Real-Time Driver Pay Aggregation

Last updated: Jun 17, 2026

Quick Overview

This question evaluates competence in designing scalable, fault-tolerant streaming systems and implementing correct time-windowed aggregation logic, testing skills in distributed systems, stream processing, and data correctness; it belongs to the system design domain and streaming/real-time data engineering and spans both conceptual architectural thinking and practical application-level concerns. It is commonly asked to probe reasoning about correctness under at-least-once delivery and messy inputs (out-of-order, duplicate, late, or malformed events), operational resilience, and trade-offs between throughput, latency, and auditability in production-scale pipelines.

  • hard
  • DoorDash
  • System Design
  • Software Engineer

Design Real-Time Driver Pay Aggregation

Company: DoorDash

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Technical Screen

Design a production system that ingests a high-volume stream of delivery-order events and continuously computes each driver's (Dasher's) **daily pay**. The core pay logic is simple in isolation: given a driver's sequence of order events for a day, sum the pay over the time they were actively working. The twist is a **peak-period multiplier** — during designated peak windows, the pay rate is multiplied (e.g., $\times 2$), so any work interval that straddles a peak boundary must be **split at the boundary** and each sub-interval paid at its correct rate. The interview starts from this computation and deliberately expands into a full streaming/system-design discussion: how do you run this correctly and reliably at DoorDash scale, with messy input, huge volume, and real production failure modes? ### Constraints & Assumptions - **Input:** JSON order events published by upstream services. Each event carries identifiers and a timestamp; the exact schema is open and should be agreed with the interviewer. - **Delivery semantics:** at-least-once. Events may arrive **duplicated**, **out of order**, or **late**; some may be **malformed**. - **Scale:** a full day's raw events for all drivers may **not fit in memory** on a single machine. - **Availability:** the system must keep producing correct daily totals across **consumer crashes and service outages**. - **Peak windows:** a set of `[start, end)` time ranges per day with an associated multiplier; intervals are half-open so boundaries don't double-count. - **Correctness:** pay numbers feed real payouts — auditability and reproducibility matter more than shaving the last millisecond of latency. ### Part 1 — Core pay computation with peak-hour splitting Given one driver's ordered work intervals for a day and a set of peak windows, compute the driver's total pay. A single interval may span zero, one, or several peak boundaries; each resulting sub-interval is paid at base rate or the peak-multiplied rate. ```hint Where to start Reduce both the work intervals and the peak windows to points on a single timeline. Think about a **sweep** over sorted boundary points, or **interval intersection** between each work interval and each peak window. ``` ```hint Splitting cleanly Literally slicing each work interval at every peak boundary it crosses gets fiddly fast — multiple crossings, and double-counting at half-open boundaries. Before you write that, ask what the pay for one interval actually *depends on*. Is there a quantity you can measure per interval that captures the whole peak effect, so the boundary bookkeeping drops out? ``` ```hint Edge cases Zero-length intervals, intervals fully inside or fully outside a peak window, back-to-back peak windows, and an interval that exactly touches a boundary. Decide the half-open convention up front and apply it everywhere. ``` ### Part 2 — End-to-end streaming architecture at scale Design the production system that runs Part 1 continuously over a high-volume event stream. Cover the publisher/consumer flow, how events become work intervals, where state and results live, and the read path for "what is driver X's pay so far today?" ```hint Pipeline shape Separate **lifecycle reconstruction** (turning raw order events into completed, paid intervals) from **aggregation** (summing intervals into a per-driver, per-day total). The two stages often want different partition keys. ``` ```hint Partitioning Partition by `order_id` to assemble an order's lifecycle on one consumer; partition by `driver_id` to keep a driver's running daily total local and lock-free. Name the trade-off rather than picking blindly. ``` ### Part 3 — Correctness under messy input Handle at-least-once delivery (duplicates), out-of-order and late events, and malformed events — without corrupting daily totals. ```hint Idempotency Make every aggregate update **idempotent** using a stable key (an `event_id` or a deterministic composite). Dedup *before* the number moves, not after. ``` ```hint Late data Event-time processing with **watermarks** and a bounded grace period lets you admit late events correctly. Decide what happens to events that arrive after the window finalizes — silent drop, or an explicit **adjustment/correction** record. ``` ### Part 4 — Scale, failure recovery, and upstream safety Address the parts that break under real production load: - **Large data:** a day's raw events exceed memory. How do you keep streaming state bounded, and how do you recompute history (backfill/reconciliation) when data is too large to sort in memory? - **Failure recovery:** a consumer or the service crashes mid-day. How do you resume without losing or double-counting pay? - **Upstream APIs:** if processing must call an upstream service in production, how do you protect both systems? ```hint Bounded state vs. replay Keep only the **active keyed state** in memory; archive raw events to durable object storage so any day can be **replayed**. For oversized recomputation, reach for **external sort** or a distributed sort/batch job keyed by `(driver, day)` and event time. ``` ```hint Don't double-count on recovery The relationship between **offset commits** and **durable state** is what makes replay safe. Combine checkpointing with idempotent writes so re-reading the log after a crash converges to the same totals. ``` ```hint Protecting upstream calls The standard production toolkit: **retries with backoff + jitter**, **circuit breakers**, **rate limiting**, **bounded queues / backpressure**. Prefer cached reference data or async enrichment over a synchronous upstream call per event. ``` ### Clarifying Questions to Ask - What exactly is in an order event (fields, event types, schema versioning), and what defines a "work interval" — order accept→deliver, or active dash time? - How is pay calculated at base rate (per-minute, per-order, per-mile, or a blend), and is the peak multiplier applied to the whole pay or only the time-based component? - How are peak windows defined and distributed — static config, or a separate event stream — and can they change retroactively for a day already in progress? - What freshness does the read path need (sub-second "pay so far today" vs. an end-of-day finalized number), and who consumes it (driver app, payouts batch, support)? - What is the daily event volume and per-driver order rate, and what is the late-event distribution (how late can events realistically arrive)? - When a finalized daily total later proves wrong, is in-place correction acceptable, or must we emit immutable adjustments for audit? ### What a Strong Answer Covers - A correct, well-reasoned Part 1 algorithm with an explicit half-open-boundary convention and the obvious edge cases handled. - A clear two-stage pipeline (lifecycle reconstruction → aggregation) with a **justified** partition-key choice per stage, not a hand-wave. - Idempotency tied to a concrete key, plus a deliberate event-time + watermark + grace-period story for late/out-of-order data. - A dead-letter / quarantine path for malformed events and invalid state transitions, kept off the hot aggregation path. - A bounded-memory streaming-state design plus a separate batch/replay path for backfill and **reconciliation** (streaming output vs. recomputed truth). - Crash recovery that is explicit about offset-commit vs. durable-state ordering, so replay neither loses nor double-counts pay. - Upstream-protection patterns (retry/backoff, circuit breaker, rate limit, backpressure) with a reason to avoid synchronous per-event calls. - Observability: the specific metrics that would catch a silent payout bug (consumer lag, dup rate, invalid-event rate, late-event volume, reconciliation deltas). - Honest trade-offs: streaming freshness vs. batch reproducibility, immutability/replay vs. in-place mutation. ### Follow-up Questions - Peak windows for today are corrected at 5pm (a window's multiplier changes). How does the already-accumulated pay for affected drivers get fixed, and what does the user-visible number do in the meantime? - A bug shipped this morning over-counted peak pay for some drivers. Walk through detecting it, recomputing the correct totals from the durable log, and reconciling without paying anyone twice. - The driver app needs "pay so far today" with sub-second latency for millions of drivers. How does the serving/read path differ from the write path, and how do you keep it cheap? - A single hot driver (or hot order key) creates a partition skew that lags one consumer badly. How do you detect and mitigate it without breaking per-key ordering guarantees?

Quick Answer: This question evaluates competence in designing scalable, fault-tolerant streaming systems and implementing correct time-windowed aggregation logic, testing skills in distributed systems, stream processing, and data correctness; it belongs to the system design domain and streaming/real-time data engineering and spans both conceptual architectural thinking and practical application-level concerns. It is commonly asked to probe reasoning about correctness under at-least-once delivery and messy inputs (out-of-order, duplicate, late, or malformed events), operational resilience, and trade-offs between throughput, latency, and auditability in production-scale pipelines.

Related Interview Questions

  • Design a Food Rating System - DoorDash (medium)
  • Design a resilient bootstrap API - DoorDash (medium)
  • Design Food Ratings and Driver Payouts - DoorDash (medium)
  • Design personalized restaurant search and recommendations - DoorDash (medium)
  • Design a Customer Review Page - DoorDash (medium)
DoorDash logo
DoorDash
Feb 6, 2026, 12:00 AM
Software Engineer
Technical Screen
System Design
22
0

Design a production system that ingests a high-volume stream of delivery-order events and continuously computes each driver's (Dasher's) daily pay.

The core pay logic is simple in isolation: given a driver's sequence of order events for a day, sum the pay over the time they were actively working. The twist is a peak-period multiplier — during designated peak windows, the pay rate is multiplied (e.g., ×2\times 2×2), so any work interval that straddles a peak boundary must be split at the boundary and each sub-interval paid at its correct rate.

The interview starts from this computation and deliberately expands into a full streaming/system-design discussion: how do you run this correctly and reliably at DoorDash scale, with messy input, huge volume, and real production failure modes?

Constraints & Assumptions

  • Input: JSON order events published by upstream services. Each event carries identifiers and a timestamp; the exact schema is open and should be agreed with the interviewer.
  • Delivery semantics: at-least-once. Events may arrive duplicated , out of order , or late ; some may be malformed .
  • Scale: a full day's raw events for all drivers may not fit in memory on a single machine.
  • Availability: the system must keep producing correct daily totals across consumer crashes and service outages .
  • Peak windows: a set of [start, end) time ranges per day with an associated multiplier; intervals are half-open so boundaries don't double-count.
  • Correctness: pay numbers feed real payouts — auditability and reproducibility matter more than shaving the last millisecond of latency.

Part 1 — Core pay computation with peak-hour splitting

Given one driver's ordered work intervals for a day and a set of peak windows, compute the driver's total pay. A single interval may span zero, one, or several peak boundaries; each resulting sub-interval is paid at base rate or the peak-multiplied rate.

Part 2 — End-to-end streaming architecture at scale

Design the production system that runs Part 1 continuously over a high-volume event stream. Cover the publisher/consumer flow, how events become work intervals, where state and results live, and the read path for "what is driver X's pay so far today?"

Part 3 — Correctness under messy input

Handle at-least-once delivery (duplicates), out-of-order and late events, and malformed events — without corrupting daily totals.

Part 4 — Scale, failure recovery, and upstream safety

Address the parts that break under real production load:

  • Large data: a day's raw events exceed memory. How do you keep streaming state bounded, and how do you recompute history (backfill/reconciliation) when data is too large to sort in memory?
  • Failure recovery: a consumer or the service crashes mid-day. How do you resume without losing or double-counting pay?
  • Upstream APIs: if processing must call an upstream service in production, how do you protect both systems?

Clarifying Questions to Ask

  • What exactly is in an order event (fields, event types, schema versioning), and what defines a "work interval" — order accept→deliver, or active dash time?
  • How is pay calculated at base rate (per-minute, per-order, per-mile, or a blend), and is the peak multiplier applied to the whole pay or only the time-based component?
  • How are peak windows defined and distributed — static config, or a separate event stream — and can they change retroactively for a day already in progress?
  • What freshness does the read path need (sub-second "pay so far today" vs. an end-of-day finalized number), and who consumes it (driver app, payouts batch, support)?
  • What is the daily event volume and per-driver order rate, and what is the late-event distribution (how late can events realistically arrive)?
  • When a finalized daily total later proves wrong, is in-place correction acceptable, or must we emit immutable adjustments for audit?

What a Strong Answer Covers

  • A correct, well-reasoned Part 1 algorithm with an explicit half-open-boundary convention and the obvious edge cases handled.
  • A clear two-stage pipeline (lifecycle reconstruction → aggregation) with a justified partition-key choice per stage, not a hand-wave.
  • Idempotency tied to a concrete key, plus a deliberate event-time + watermark + grace-period story for late/out-of-order data.
  • A dead-letter / quarantine path for malformed events and invalid state transitions, kept off the hot aggregation path.
  • A bounded-memory streaming-state design plus a separate batch/replay path for backfill and reconciliation (streaming output vs. recomputed truth).
  • Crash recovery that is explicit about offset-commit vs. durable-state ordering, so replay neither loses nor double-counts pay.
  • Upstream-protection patterns (retry/backoff, circuit breaker, rate limit, backpressure) with a reason to avoid synchronous per-event calls.
  • Observability: the specific metrics that would catch a silent payout bug (consumer lag, dup rate, invalid-event rate, late-event volume, reconciliation deltas).
  • Honest trade-offs: streaming freshness vs. batch reproducibility, immutability/replay vs. in-place mutation.

Follow-up Questions

  • Peak windows for today are corrected at 5pm (a window's multiplier changes). How does the already-accumulated pay for affected drivers get fixed, and what does the user-visible number do in the meantime?
  • A bug shipped this morning over-counted peak pay for some drivers. Walk through detecting it, recomputing the correct totals from the durable log, and reconciling without paying anyone twice.
  • The driver app needs "pay so far today" with sub-second latency for millions of drivers. How does the serving/read path differ from the write path, and how do you keep it cheap?
  • A single hot driver (or hot order key) creates a partition skew that lags one consumer badly. How do you detect and mitigate it without breaking per-key ordering guarantees?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More DoorDash•More Software Engineer•DoorDash Software Engineer•DoorDash System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.