PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/Databricks

Design a Book Price Aggregator

Last updated: Jun 21, 2026

Quick Overview

This question evaluates skills in distributed systems design, fault tolerance, scalability, integration with external services, transactional consistency across asynchronous operations, and reliability patterns such as caching, request coalescing, and failure isolation.

  • medium
  • Databricks
  • System Design
  • Software Engineer

Design a Book Price Aggregator

Company: Databricks

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Technical Screen

Design a **book purchasing marketplace** where your service acts as an intermediary between customers and **hundreds of partner bookstores**. Your service holds no inventory of its own; for every request it queries the partner stores live, places the order on whichever store wins, and runs the customer's payment itself. A customer submits a single purchase request containing: - An **ISBN** (the book to buy) - A **bid price** — the maximum price the customer is willing to pay - A **payment method** On each request, your system must **fan out asynchronous requests to hundreds of partner bookstores** to discover which stores have the book in stock and at what price, then apply the following rules: 1. If at least one bookstore has inventory and the **lowest available price is $\le$ the customer's bid price**, place an order with that bookstore (and charge the customer). 2. If bookstores have inventory but the **lowest available price is higher than the bid price**, return that lowest price to the customer (no order placed). 3. If **no bookstore has inventory**, notify the customer that the book is unavailable. Walk through the architecture end to end, then go deep on the parts below. Treat each part as a distinct design concern; a strong session covers the read/aggregation path, the failure model, the scaling levers, and the money-correctness path with roughly equal rigor. ### Constraints & Assumptions State your own numbers explicitly and design against them; the interviewer cares more about *consistent* reasoning than any specific figure. Reasonable anchors to assume unless told otherwise: - **Scale:** hundreds of partner stores (assume ~500); a peak on the order of hundreds-to-thousands of purchase requests/sec. Note that naive fan-out multiplies these — a request rate times a store count of outbound partner calls is the load you must drive down. - **Latency:** a customer cannot wait on the slowest of 500 stores. Assume a hard global deadline for a synchronous answer (state the target you pick, e.g. on the order of ~1 s) and design to act on partial results. - **Workload skew:** ISBN popularity is heavily skewed (a small set of bestsellers dominates traffic) — call out whether you exploit this. - **Partner reality:** individual stores will time out, return 5xx, rate-limit you, or return malformed data; price and stock change over time; some stores support idempotency keys on orders and some do not. - **Money:** placing an order and charging the customer are **two separate external side effects** with **no shared transaction** across your DB, the payment provider, and the bookstore. - **Quantity:** assume one book per request (qty 1) unless you call out multi-line carts as an extension. ### Clarifying Questions to Ask A strong candidate scopes the problem before drawing boxes. Good questions to raise with the interviewer: - Is the customer answer **synchronous** (block ~1 s for a result) or **asynchronous** (accept now, notify later)? How does the API shape change if both must be supported? - Is the quoted "lowest price" a **hard contract** (we are bound to honor whatever we quote) or **best-effort**? This decides how fresh quotes must be and whether price is re-confirmed at order time. - What is the acceptable **global latency budget**, and is it acceptable to answer from a *partial* search of stores when not all respond in time? - What are our **contractual rate limits / quotas** with partner stores (i.e. how hard can we hammer them before we get banned)? - Is it ever required to **reserve** scarce inventory before payment, or is querying live price/stock at decision time sufficient? - Single book per request, or do we need multi-item carts? --- ### Part 1 — End-to-end architecture & API Lay out the major components and the request lifecycle from "customer submits a request" to "terminal status returned." Define the request/response contract, including how the customer learns the outcome (`ORDER_PLACED` / price-too-high / unavailable / pending) and how a client retry is handled. ```hint Where to start Separate the system into two paths with very different requirements: a cheap, idempotent, latency-bounded **read/aggregation path** (getting a stale quote is recoverable) and an expensive **write/money path** with real side effects (getting it wrong double-charges a customer). Keeping them separate lets you be aggressive on read latency and conservative on write correctness. ``` ```hint API contract Think about a `POST` that creates a durable "purchase intent," plus a way for the client to poll or be notified. What should the synchronous response carry so the UI doesn't *lie* about how many stores were actually searched (consider a coverage/metadata field)? How does a client-supplied **idempotency key** on the `POST` prevent a retry from starting a second workflow? ``` #### What This Part Should Cover - **Component inventory & lifecycle:** the major services named (gateway/intent → quote orchestrator → partner adapters → purchase workflow → notification) and a coherent request flow through them, not a flat box-and-arrow dump. - **Two-path separation surfaced at the API:** the read/aggregation path and the write/money path appear as distinct concerns in both the components and the contract. - **Contract completeness:** the request/response carries every terminal status, a re-bid price for the price-too-high case, and a coverage/metadata field so the UI can't overstate search completeness. - **Retry semantics:** a client-supplied idempotency key collapses a re-POST onto the same intent rather than starting a second workflow; sync-vs-async outcome delivery (poll/notify) is addressed. ### Part 2 — Fan-out, deadlines & result aggregation Describe how you query many unreliable stores in parallel and produce a decision under a hard global deadline. How do per-call timeouts relate to the global deadline? When do you stop collecting responses? How do you aggregate the responses into the order / price-too-high / unavailable decision, and how do you avoid claiming "unavailable" when stores simply didn't answer in time? ```hint Deadline math Make each **per-partner timeout strictly shorter than the global deadline** (e.g. partner timeout a few hundred ms vs. a ~1 s global deadline) so a single slow store can never consume the whole budget. Collect until the first of: deadline reached, all candidates returned, or an early-stop condition. ``` ```hint Aggregation Think about what state you must keep as responses trickle in to turn many `{in_stock, price}` quotes into one decision, and what coverage counters (queried / responded / timed-out / failed) you need alongside it. Then reason carefully about *when* you're allowed to stop: does "lowest price wins" let you act on the first acceptable price, or does that risk missing a cheaper store still in flight? Make the early-stop-vs-wait tradeoff explicit. ``` ```hint Partial-result honesty Carry a **coverage** indicator (e.g. "lowest among 27 of 32 stores") so "unavailable" can be qualified as "unavailable among stores that answered in time," and so background jobs can keep searching without overriding a decision already returned to the user. ``` #### What This Part Should Cover - **Deadline discipline:** per-call timeouts strictly inside the global deadline, with explicit reasoning for why one slow store can never burn the whole budget. - **Stop condition:** a clear rule for when to stop collecting (deadline / all returned / justified early-stop), and an explicit early-stop-vs-wait tradeoff for "true minimum price." - **Aggregation state:** a running-min over valid, in-stock quotes plus coverage counters, and the mapping from aggregate outcome to the order / price-too-high / unavailable decision. - **Partial-result honesty:** "unavailable" qualified as "unavailable among responders," so incomplete coverage is never reported as a definitive no. ### Part 3 — Tolerating downstream failures (timeouts, circuit breakers, bulkheads) With hundreds of partners, several are unhealthy at any moment. Detail your layered defenses: per-call timeouts, when (if ever) to retry, and how circuit breakers and bulkheads keep one bad partner from degrading the whole request or exhausting your resources. ```hint Circuit breaker Track per-partner timeout rate, error rate, 429s, and latency percentiles. Consider the standard **closed → open → half-open** lifecycle: when a partner is *open*, skip it instantly (treat as "no data") instead of paying a timeout on every call — this protects both your latency budget and the struggling partner. ``` ```hint Isolation Use **bulkheads** — per-partner (or per-group) worker/connection pools — so a hanging store fills only its own pool and can't starve the threads needed to call healthy stores. Consider adaptive concurrency / backpressure (e.g. AIMD-style) when a partner degrades. ``` ```hint Retries Only retry **idempotent, transient** quote reads, at most once with jitter, and only if the latency budget allows — never blind-retry into a store that is already slow. ``` #### What This Part Should Cover - **Layered defenses:** per-call timeout, breaker, and bulkhead presented as distinct layers (call → partner → whole-system), not a single mechanism. - **Breaker semantics:** the closed → open → half-open lifecycle, the signals that trip it (timeout/error/429/latency), and the win of skipping an open partner instantly instead of eating a timeout. - **Isolation & backpressure:** per-partner pools so a hang is contained, plus adaptive concurrency / graceful degradation when a partner or internal queue saturates. - **Disciplined retries:** retries confined to idempotent, transient quote reads, bounded and jittered, never blind-retrying into an already-slow store. ### Part 4 — Scaling: caching, TTLs, coalescing & thundering-herd protection The naive fan-out (requests × stores) is not survivable and would get you rate-limited or banned. Explain how you drive that number down. Cover your cache key and TTL policy (including negative results and partner errors), request coalescing, thundering-herd protection on cache expiry, and how you decide *which* stores to query. ```hint Biggest lever **Partner selection** — don't query all 500. Rank candidates per ISBN by historical availability, price competitiveness, latency/reliability, geography, and current breaker state; query the top ~N and expand only if needed (bounded by the deadline). This alone can cut fan-out by an order of magnitude. ``` ```hint Cache + coalescing Key the cache on `(ISBN, partner_id)` with short TTLs (and special-case **negative/out-of-stock** results with very short TTLs). Use **request coalescing / single-flight**: the first request for an ISBN becomes the leader that runs the fan-out; concurrent requests attach to the in-flight search and share its result. ``` ```hint Thundering herd When a hot ISBN's entry expires, prevent a stampede with single-flight on the refresh, **stale-while-revalidate** (serve stale while refreshing in the background), and **TTL jitter** (randomize expirations so a batch doesn't all expire on the same tick). Exploit the heavy ISBN skew — even a short TTL on the hot head removes most calls. ``` #### What This Part Should Cover - **Quantified problem framing:** the requests × stores fan-out stated as a number, then driven down with named levers (the candidate should know which lever dominates). - **Cache & TTL policy:** an `(ISBN, partner_id)` key, freshness-vs-load TTL reasoning, and explicit special-casing of negative results and partner-error signals. - **Coalescing & herd defenses:** single-flight on concurrent same-ISBN requests, plus stale-while-revalidate and TTL jitter to survive hot-key expiry. - **Partner selection & quota respect:** ranking candidates per ISBN instead of querying all 500, and multi-layer rate limiting that honors partner contractual quotas. ### Part 5 — Money correctness: order-vs-charge consistency, crash recovery & duplicate prevention This is the hardest part. Placing the order (call to the store) and charging (call to the payment provider) are two independent external side effects with no shared transaction. Decide whether to **order first or charge first**, justify it, and explain how you recover from a mid-flight crash and prevent **duplicate charges or duplicate orders**. ```hint Model the workflow There's no single transaction spanning your DB, the payment provider, and the bookstore — so think about what structure lets a worker that dies between any two external calls wake up and know exactly where it was. What would you have to persist, and *when* relative to each external call, for recovery to be unambiguous? And what property must each external step have so that re-running it after a crash is safe? ``` ```hint Ordering of money operations Don't reach for a single all-or-nothing charge. Enumerate the failure state you most want to make impossible (the customer's money is gone but no book is coming) and walk each ordering of the money and order steps against it: which sequence makes that state unreachable rather than merely rare? Many payment providers expose a two-step "hold then settle" primitive — consider how splitting the charge around the order step changes which failures are recoverable, and which failures release funds cleanly versus needing a slow reversal. ``` ```hint Duplicate prevention Duplicates come from two sources: clients/queues/workers retrying, and crashes landing between "did the side effect" and "recorded that I did it." Address both. For retries, think about what makes a re-sent external call collapse onto the original instead of creating a second one — and what your DB schema can guarantee so two concurrent workers can't both advance the same step. Then consider the harder case: a partner store that offers no way to dedupe an order. With no exactly-once call available, how do you find out whether your earlier attempt actually landed before you act again? ``` #### What This Part Should Cover - **Workflow durability:** the purchase modeled as a persisted saga / state machine, with each transition written before the next external call so a dead worker resumes from a known state. - **Justified ordering:** an explicit order-vs-charge decision that makes "money gone, no book" unreachable by construction — typically authorize → order → capture, with charge-first and order-first rejected for stated reasons. - **Idempotency & DB invariants:** deterministic idempotency keys on each external call plus DB uniqueness / compare-and-swap transitions so retries and concurrent workers can't double-charge or double-order. - **Crash recovery & the non-idempotent store:** an outbox + reconciliation sweeper, and a concrete plan to discover whether an order landed (lookup by reference id) when the store offers no idempotency key. ### What a Strong Answer Covers These dimensions span all parts — the interviewer is listening for them across the whole session, not inside any single Part: - **Requirements & scope:** functional rules restated cleanly, plus explicit non-functional targets (latency budget, fault tolerance, money correctness, partner protection) and a rough capacity estimate that frames the fan-out problem. - **Clean decomposition:** a deliberate split between a latency-bounded, cache-heavy **read/aggregation path** and a strictly-correct **write/money path**, sustained consistently across every part. - **Tradeoffs named out loud:** what strains first, and the freshness-vs-load and latency-vs-completeness dials, with the candidate choosing a point rather than hand-waving. - **Observability:** the metrics and dashboards you would page on (fan-out size, cache/coalescing hit rate, per-partner health, breaker transitions, % partial results, payment/order success rates, stuck-intent reconciliation). ### Follow-up Questions - How does the design change at **100x request volume**, or when the partner count grows from hundreds to tens of thousands of stores — what breaks first, and which lever do you pull? - If the quoted price becomes a **hard contractual obligation** (you must honor whatever you quote), how do caching and the order step change? - A partner store does **not** support order idempotency keys and your worker crashes after dispatching an order but before recording it — how do you avoid a duplicate order without a guaranteed-exactly-once call? - Suppose inventory for some titles is **genuinely scarce** and must be reserved before payment — how does the consistency model and the order-vs-charge decision change?

Quick Answer: This question evaluates skills in distributed systems design, fault tolerance, scalability, integration with external services, transactional consistency across asynchronous operations, and reliability patterns such as caching, request coalescing, and failure isolation.

Related Interview Questions

  • Design a Slack-Like Messaging System - Databricks (medium)
  • Design a Distributed File System - Databricks (medium)
  • Design a stock order manager - Databricks (medium)
  • Design a Hierarchical File System - Databricks (hard)
  • Design an Online Bookstore - Databricks (hard)
Databricks logo
Databricks
May 6, 2026, 12:00 AM
Software Engineer
Technical Screen
System Design
82
0

Design a book purchasing marketplace where your service acts as an intermediary between customers and hundreds of partner bookstores. Your service holds no inventory of its own; for every request it queries the partner stores live, places the order on whichever store wins, and runs the customer's payment itself.

A customer submits a single purchase request containing:

  • An ISBN (the book to buy)
  • A bid price — the maximum price the customer is willing to pay
  • A payment method

On each request, your system must fan out asynchronous requests to hundreds of partner bookstores to discover which stores have the book in stock and at what price, then apply the following rules:

  1. If at least one bookstore has inventory and the lowest available price is ≤\le≤ the customer's bid price , place an order with that bookstore (and charge the customer).
  2. If bookstores have inventory but the lowest available price is higher than the bid price , return that lowest price to the customer (no order placed).
  3. If no bookstore has inventory , notify the customer that the book is unavailable.

Walk through the architecture end to end, then go deep on the parts below. Treat each part as a distinct design concern; a strong session covers the read/aggregation path, the failure model, the scaling levers, and the money-correctness path with roughly equal rigor.

Constraints & Assumptions

State your own numbers explicitly and design against them; the interviewer cares more about consistent reasoning than any specific figure. Reasonable anchors to assume unless told otherwise:

  • Scale: hundreds of partner stores (assume ~500); a peak on the order of hundreds-to-thousands of purchase requests/sec. Note that naive fan-out multiplies these — a request rate times a store count of outbound partner calls is the load you must drive down.
  • Latency: a customer cannot wait on the slowest of 500 stores. Assume a hard global deadline for a synchronous answer (state the target you pick, e.g. on the order of ~1 s) and design to act on partial results.
  • Workload skew: ISBN popularity is heavily skewed (a small set of bestsellers dominates traffic) — call out whether you exploit this.
  • Partner reality: individual stores will time out, return 5xx, rate-limit you, or return malformed data; price and stock change over time; some stores support idempotency keys on orders and some do not.
  • Money: placing an order and charging the customer are two separate external side effects with no shared transaction across your DB, the payment provider, and the bookstore.
  • Quantity: assume one book per request (qty 1) unless you call out multi-line carts as an extension.

Clarifying Questions to Ask

A strong candidate scopes the problem before drawing boxes. Good questions to raise with the interviewer:

  • Is the customer answer synchronous (block ~1 s for a result) or asynchronous (accept now, notify later)? How does the API shape change if both must be supported?
  • Is the quoted "lowest price" a hard contract (we are bound to honor whatever we quote) or best-effort ? This decides how fresh quotes must be and whether price is re-confirmed at order time.
  • What is the acceptable global latency budget , and is it acceptable to answer from a partial search of stores when not all respond in time?
  • What are our contractual rate limits / quotas with partner stores (i.e. how hard can we hammer them before we get banned)?
  • Is it ever required to reserve scarce inventory before payment, or is querying live price/stock at decision time sufficient?
  • Single book per request, or do we need multi-item carts?

Part 1 — End-to-end architecture & API

Lay out the major components and the request lifecycle from "customer submits a request" to "terminal status returned." Define the request/response contract, including how the customer learns the outcome (ORDER_PLACED / price-too-high / unavailable / pending) and how a client retry is handled.

What This Part Should Cover

  • Component inventory & lifecycle: the major services named (gateway/intent → quote orchestrator → partner adapters → purchase workflow → notification) and a coherent request flow through them, not a flat box-and-arrow dump.
  • Two-path separation surfaced at the API: the read/aggregation path and the write/money path appear as distinct concerns in both the components and the contract.
  • Contract completeness: the request/response carries every terminal status, a re-bid price for the price-too-high case, and a coverage/metadata field so the UI can't overstate search completeness.
  • Retry semantics: a client-supplied idempotency key collapses a re-POST onto the same intent rather than starting a second workflow; sync-vs-async outcome delivery (poll/notify) is addressed.

Part 2 — Fan-out, deadlines & result aggregation

Describe how you query many unreliable stores in parallel and produce a decision under a hard global deadline. How do per-call timeouts relate to the global deadline? When do you stop collecting responses? How do you aggregate the responses into the order / price-too-high / unavailable decision, and how do you avoid claiming "unavailable" when stores simply didn't answer in time?

What This Part Should Cover

  • Deadline discipline: per-call timeouts strictly inside the global deadline, with explicit reasoning for why one slow store can never burn the whole budget.
  • Stop condition: a clear rule for when to stop collecting (deadline / all returned / justified early-stop), and an explicit early-stop-vs-wait tradeoff for "true minimum price."
  • Aggregation state: a running-min over valid, in-stock quotes plus coverage counters, and the mapping from aggregate outcome to the order / price-too-high / unavailable decision.
  • Partial-result honesty: "unavailable" qualified as "unavailable among responders," so incomplete coverage is never reported as a definitive no.

Part 3 — Tolerating downstream failures (timeouts, circuit breakers, bulkheads)

With hundreds of partners, several are unhealthy at any moment. Detail your layered defenses: per-call timeouts, when (if ever) to retry, and how circuit breakers and bulkheads keep one bad partner from degrading the whole request or exhausting your resources.

What This Part Should Cover

  • Layered defenses: per-call timeout, breaker, and bulkhead presented as distinct layers (call → partner → whole-system), not a single mechanism.
  • Breaker semantics: the closed → open → half-open lifecycle, the signals that trip it (timeout/error/429/latency), and the win of skipping an open partner instantly instead of eating a timeout.
  • Isolation & backpressure: per-partner pools so a hang is contained, plus adaptive concurrency / graceful degradation when a partner or internal queue saturates.
  • Disciplined retries: retries confined to idempotent, transient quote reads, bounded and jittered, never blind-retrying into an already-slow store.

Part 4 — Scaling: caching, TTLs, coalescing & thundering-herd protection

The naive fan-out (requests × stores) is not survivable and would get you rate-limited or banned. Explain how you drive that number down. Cover your cache key and TTL policy (including negative results and partner errors), request coalescing, thundering-herd protection on cache expiry, and how you decide which stores to query.

What This Part Should Cover

  • Quantified problem framing: the requests × stores fan-out stated as a number, then driven down with named levers (the candidate should know which lever dominates).
  • Cache & TTL policy: an (ISBN, partner_id) key, freshness-vs-load TTL reasoning, and explicit special-casing of negative results and partner-error signals.
  • Coalescing & herd defenses: single-flight on concurrent same-ISBN requests, plus stale-while-revalidate and TTL jitter to survive hot-key expiry.
  • Partner selection & quota respect: ranking candidates per ISBN instead of querying all 500, and multi-layer rate limiting that honors partner contractual quotas.

Part 5 — Money correctness: order-vs-charge consistency, crash recovery & duplicate prevention

This is the hardest part. Placing the order (call to the store) and charging (call to the payment provider) are two independent external side effects with no shared transaction. Decide whether to order first or charge first, justify it, and explain how you recover from a mid-flight crash and prevent duplicate charges or duplicate orders.

What This Part Should Cover

  • Workflow durability: the purchase modeled as a persisted saga / state machine, with each transition written before the next external call so a dead worker resumes from a known state.
  • Justified ordering: an explicit order-vs-charge decision that makes "money gone, no book" unreachable by construction — typically authorize → order → capture, with charge-first and order-first rejected for stated reasons.
  • Idempotency & DB invariants: deterministic idempotency keys on each external call plus DB uniqueness / compare-and-swap transitions so retries and concurrent workers can't double-charge or double-order.
  • Crash recovery & the non-idempotent store: an outbox + reconciliation sweeper, and a concrete plan to discover whether an order landed (lookup by reference id) when the store offers no idempotency key.

What a Strong Answer Covers

These dimensions span all parts — the interviewer is listening for them across the whole session, not inside any single Part:

  • Requirements & scope: functional rules restated cleanly, plus explicit non-functional targets (latency budget, fault tolerance, money correctness, partner protection) and a rough capacity estimate that frames the fan-out problem.
  • Clean decomposition: a deliberate split between a latency-bounded, cache-heavy read/aggregation path and a strictly-correct write/money path , sustained consistently across every part.
  • Tradeoffs named out loud: what strains first, and the freshness-vs-load and latency-vs-completeness dials, with the candidate choosing a point rather than hand-waving.
  • Observability: the metrics and dashboards you would page on (fan-out size, cache/coalescing hit rate, per-partner health, breaker transitions, % partial results, payment/order success rates, stuck-intent reconciliation).

Follow-up Questions

  • How does the design change at 100x request volume , or when the partner count grows from hundreds to tens of thousands of stores — what breaks first, and which lever do you pull?
  • If the quoted price becomes a hard contractual obligation (you must honor whatever you quote), how do caching and the order step change?
  • A partner store does not support order idempotency keys and your worker crashes after dispatching an order but before recording it — how do you avoid a duplicate order without a guaranteed-exactly-once call?
  • Suppose inventory for some titles is genuinely scarce and must be reserved before payment — how does the consistency model and the order-vs-charge decision change?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Databricks•More Software Engineer•Databricks Software Engineer•Databricks System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.