Design Ad Frequency and Order Tracking
Company: Netflix
Role: Software Engineer
Category: System Design
Interview Round: Onsite
Design two components of an ads platform. This is a two-part system-design question: **Part 1** is a real-time frequency-capping service used during ad decisioning, and **Part 2** is the data model and tracking flow for a *direct-sold* demand-side order. Treat them as related but independent deliverables.
### Constraints & Assumptions
These apply across both parts; part-specific constraints are noted under each Part.
- Treat this as Netflix-scale ad serving: tens of thousands of eligibility checks per second, each scoring **many** candidate ads, so a check is effectively a **batched** multi-key lookup, not a single read.
- The eligibility-check budget is a small, fixed slice of the decisioning latency budget (single-digit milliseconds).
- Impressions are emitted at very high write volume; events can arrive **out of order, duplicated, or delayed**.
- User identifiers reaching either part are already **hashed/tokenized**; raw PII is out of scope, but retention and privacy rules still apply.
- Configuration is authored by humans (campaign managers, sales/ops) and changes while campaigns are live; ad decisions and reporting must be **reproducible** against the config that was in effect.
- **Part 2 only:** exactly one order type matters — **direct-sold demand** (reserved/guaranteed). Open-auction bid-request / bid-response handling is explicitly **out of scope as the core workflow**; do not center the design on it.
### Clarifying Questions to Ask
- What is the impression / eligibility-check QPS, the read-latency budget, and the maximum number of candidate ads scored per request?
- Are caps strictly enforced (never exceed $N$) or best-effort (a small over-serve is acceptable for revenue)? Does the answer differ by campaign type?
- Are time windows fixed (e.g. per calendar day) or rolling (last 24h), and do we need both?
- What is the cap-key cardinality — how many distinct `(user, cap)` pairs must we keep counters for, and what is the retention window?
- For Part 2, what reporting freshness is required (near-real-time dashboards vs. authoritative daily billing), and who is the source of truth for spend?
- Is the deployment single- or multi-region, and is a user pinned to a home region?
### Part 1 — Frequency-Capping Service
Build a service used during ad decisioning to prevent a user from seeing an ad entity too many times. A cap may apply at several levels of the ad hierarchy — **creative, line item, campaign, advertiser, or global** — and any single ad-decisioning request must respect *all* applicable caps simultaneously. A cap is expressed as "at most $N$ impressions for a given user within a fixed or rolling time window."
The service must support:
- **Low-latency eligibility checks** during candidate selection (many candidate ads are scored for the same user in one request).
- **Exposure recording** after an impression is actually served.
- **High write volume** at impression scale.
- **Duplicate / replayed event handling** so a single impression never double-counts.
- **Cap configuration changes** (a campaign manager edits a cap mid-flight).
- **Privacy-safe user identifiers.**
Define the API surface, the counter storage model, the read (eligibility) and write (exposure) paths, how you keep them consistent under concurrency, and how the system degrades when the counter store is slow or unavailable.
```hint Where to start
Separate the two paths: a batched, read-only **eligibility check** on the hot decisioning path, and an asynchronous **exposure record** after the impression is confirmed served. Decide up front whether you can tolerate slight over-serving in exchange for keeping the read path off any write lock.
```
```hint Counter key & storage model
What goes into the counter key so one key covers exactly one $(\text{user}, \text{cap}, \text{window})$? A fixed window and a rolling window have very different storage shapes — how would you expire a fixed-window key cheaply, and how would you answer a rolling-window query *without* keeping one entry per impression? Name the precision-vs-cost tradeoff between the options you'd consider. Then think about *where* the keys live: what property of your key layout lets a batched, multi-key read for one user stay on a single shard?
```
```hint Consistency & idempotency
Two decisions for the same user can race on the same cap. What's the cheapest way to record exposures, and exactly how badly can it over-serve? What stricter mechanism eliminates that, and what does it cost you on the hot path? Decide which caps deserve which. Separately: an impression event can be retried, replayed, or arrive twice — what makes `recordExposure` produce the same counter no matter how many times it's delivered? And since a manager can edit a cap while a campaign runs, how do you store caps so an edit doesn't silently rewrite what was true for past decisions?
```
```hint Failure behavior
The counter store *will* occasionally be slow or unavailable mid-decision. You can't both guarantee the cap and guarantee the ad serves — so which way does a given campaign fall when the counter is unreadable, and what kind of campaign would want each direction? Is this one global switch, or a per-policy decision?
```
#### Constraints & Assumptions for this Part
- The eligibility check is on the critical decisioning path: it is **read-only** and **batched** across all candidate ads for one user.
- The exposure-recording path is write-heavy and tolerant of being asynchronous, but it must be **idempotent** under retries and duplicate delivery.
- Counters are a soft business control, not a financial ledger — a small, bounded over-serve is acceptable unless a cap is explicitly marked strict.
#### What This Part Should Cover
- A clean separation of the read (eligibility) and write (exposure) paths, with the write kept off the decisioning latency budget.
- A concrete counter key scheme that covers all cap scopes and correctly distinguishes fixed vs. rolling windows, including the precision/cost tradeoff for rolling windows.
- An explicit consistency/idempotency strategy: dedupe by event id, and a reasoned choice between async-increment (cheap, can over-serve) and atomic check-and-increment (strict, costs throughput) per cap.
- A sharding/partitioning plan keyed by hashed user so a batched check stays single-shard, plus a cache strategy and versioning for cap config.
- A stated, policy-driven fallback (fail-open vs. fail-closed) when the counter store is unavailable.
### Part 2 — Direct-Sold Demand Order Tracking Model
Design the **data model and tracking flow** for a demand-side advertising platform (DSP) to manage and report on a single **direct-sold demand** order — i.e. contracted, reserved demand, *not* open-auction bidding.
Your design must cover the full lifecycle of one order and include: the **order**, **line items / flights**, **creatives**, **budgets**, **pacing**, **targeting**, **frequency caps**, **inventory or placement configuration**, **delivery events**, and **reporting aggregates**. Show the entities and their key fields, how they relate, how configuration changes are tracked over time, and how raw delivery events roll up into reports.
```hint Where to start
Lay out the entity hierarchy first: **advertiser → order → line item → creative-assignment → creative**, with targeting / budget / pacing / frequency-cap configs hanging off the line item (that's where most delivery decisions are made). Then split the world into two layers: mutable **configuration** (versioned) and immutable **delivery events** (append-only facts).
```
```hint Versioning & reporting
Because budgets, targeting, pacing, and caps change mid-flight, make config **versioned with effective-time ranges** rather than overwriting — historical reports and audits must reproduce "what was true then." Build reporting **aggregates** (by time bucket × order × line item × creative × placement × geo/device) derived from the immutable event facts so you can backfill after late-arriving events or bugs.
```
#### Constraints & Assumptions for this Part
- Orders are created by sales/ops, then trafficked; line items must be **independently pausable and versioned**.
- Reporting must be **reproducible and backfillable** — late, duplicated, and corrected events are normal.
- Goals/billing can be impression-based, click-based, or another contracted goal; the model should not hard-code a single billing model.
#### What This Part Should Cover
- A normalized entity model (order → line item → creative-assignment → creative) with the key fields per table.
- Config layers (targeting / budget / pacing / frequency-cap / placement) that are **versioned with effective times** so any past decision can be reconstructed.
- An immutable, idempotent delivery-event fact stream and the reporting aggregates derived from it, designed to be rebuilt/backfilled on late or corrected events.
- An order/line-item **lifecycle state machine** with audited transitions, and an explicit acknowledgment that auction bidding is out of scope.
### What a Strong Answer Covers
These dimensions span both parts and are what tie a "two separate designs" answer into one coherent platform:
- **Shared config and event substrate:** the order's frequency-cap config (Part 2) is exactly what the runtime capping service (Part 1) enforces, and both consume the same immutable impression events — one for live counting, one for reporting.
- **Privacy-safe identifiers and retention:** hashed/tokenized keys everywhere, with a retention bound that also caps the counter keyspace.
- **Observability:** blocked rate, over-cap rate, counter-store latency, stale-config rate, and dropped/duplicate-event rates — surfaced for both correctness and SLA alerting.
- **End-to-end data flow:** config authoring → serving → event emission → streaming (near-real-time) + batch (authoritative) reporting.
### Follow-up Questions
- If two ad-decisioning requests for the same user race on the *same* cap, how do you bound the over-serve, and what does it cost you to eliminate it entirely?
- How do you support a **rolling** 24-hour cap without storing one row per impression per user — what's the precision/cost tradeoff of your bucketing?
- A campaign manager lowers a cap from 5 to 2 mid-flight. What happens to a user already at 4 impressions, and how does your versioning make reporting reproducible?
- In Part 2, a billing event arrives 3 days late after the daily report was already published. How does your event/aggregate design correct the numbers without rewriting history?
Quick Answer: This question evaluates system-design competencies for building low-latency, high-throughput ad-platform components—specifically a real-time frequency-capping service and a direct-sold order tracking data model—focusing on distributed state management, idempotency, privacy-safe identifiers, concurrency control, and scalability.