Design a Revenue Ranking Platform
Company: Creditkarma
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: medium
Interview Round: Onsite
Design a machine learning recommendation and ranking system for a consumer finance marketplace such as Credit Karma. The product shows each user a set of eligible financial offers (credit cards, personal loans, insurance products, etc.), and the business goal is to **maximize revenue while preserving user experience, regulatory compliance, and long-term user trust**.
This is a broad, panel-style design discussion. Work through the modeling strategy, the ranking objective, the system architecture, the data infrastructure, large-scale serving, and the operational lifecycle. Each `### Part` below is a distinct discussion area; the interviewer will probe several of them in depth.
### Constraints & Assumptions
- **Scale:** hundreds of millions of users; thousands of distinct models (per product line, market, partner, segment, and experiment); the offer catalog ranges from a handful to many thousands depending on the surface.
- **Sparsity:** conversion/approval positive rate is roughly **0.1%–0.5%**.
- **Label delay:** approval and revenue labels can land **days to weeks** after the originating impression.
- **Regulatory:** consumer-finance offers are subject to eligibility, fair-lending, and disclosure constraints; the ranker must respect hard eligibility/compliance filters, not just soft penalties.
- **Latency:** online ranking must return within a low-latency budget suitable for a page render (interactive product surface).
- **Objective:** maximize expected revenue subject to user-experience, compliance, and long-term-trust guardrails — not raw short-term revenue.
### Clarifying Questions to Ask
- What is the exact business objective, and which guardrail metrics are hard constraints vs. soft penalties (e.g. approval-rate floor, fairness, satisfaction)?
- What is the revenue model per product — pay-per-click, pay-per-approved-application, revenue share, or a mix — and does it vary by partner?
- What is the end-to-end latency budget, and the size of the eligible candidate set on a typical surface?
- How are conversions attributed (impression vs. click vs. application vs. approval time), and what is the realistic label-maturity window?
- What compliance and fair-lending constraints must the ranking respect, and are any offers legally mandated to appear or be filtered?
- What infrastructure already exists (feature store, model registry, serving framework, experimentation platform)?
---
### Part 1 — Funnel modeling strategy
A conversion in this marketplace passes through several stages: a user must **click** an offer, **apply** for it, and then be **approved** by the lender. Revenue is realized only after approval (and sometimes only after the product is used).
Should you build a single model or separate models for the funnel stages (CTR, application rate, approval rate, expected value)? Discuss one model vs. per-stage models vs. a multi-task model, and what you would recommend for a large marketplace.
```hint Map the stages to labels
Each funnel stage produces a different label, arrives on a different timeline, and has a different base rate. Reasoning stage-by-stage about *where each label comes from* and *how dense it is* tells you which stages can share signal and which need their own training data, sampling, and calibration — which is the real input to the one-model-vs-per-stage decision.
```
```hint Sharing vs. isolation
Weigh a shared-bottom / multi-task or mixture-of-experts model (better representation sharing, helps sparse stages) against fully separate models (easier to debug, calibrate, and own per stage). Consider a hybrid: shared embeddings, per-stage calibrated heads.
```
#### What This Part Should Cover
- **Label-to-stage reasoning:** ties each stage (CTR / apply / approve / value) to its label source, arrival latency, and base rate before choosing an architecture.
- **Architecture tradeoff:** articulates representation sharing (helps sparse downstream stages) vs. per-stage debuggability/ownership/calibration, rather than a generic "use a big model."
- **A concrete recommendation:** lands on a defensible choice (e.g. shared embeddings + per-stage calibrated heads) with the reasoning *why*, not a list of options.
---
### Part 2 — Delayed and sparse conversion labels
Approval and revenue labels can arrive **days or weeks** after the impression, and the positive rate at the conversion/approval stage is **extremely sparse (roughly 0.1%–0.5%)**.
How would you handle the delayed labels so you do not poison training with false negatives? How would you model the extreme class imbalance?
```hint Don't mislabel "not yet" as negative
A recent impression with no conversion may simply be censored, not negative. Think about label-maturity windows, delayed-feedback / survival models for censored examples, and using denser proxy labels (clicks, application starts) for fresh data.
```
```hint Imbalance toolkit
For 0.1%–0.5% positives consider class-weighted cross-entropy or focal loss, negative subsampling (with correction — see Part 3), representation sharing from denser tasks, and explicit calibration on an unbiased holdout. Pick evaluation metrics robust to imbalance (PR-AUC, log loss, ECE), not raw accuracy.
```
#### What This Part Should Cover
- **Censoring vs. negatives:** recognizes that recent non-conversions are censored, and proposes a concrete mechanism (maturity windows, arrival curves, or a delayed-feedback/survival model) rather than treating "no label yet" as a 0.
- **Freshness vs. correctness:** balances staying fresh (dense proxy labels) against waiting for downstream labels to mature.
- **Imbalance handling:** names loss design and/or sampling and pairs it with imbalance-robust evaluation (PR-AUC, log loss, ECE), not accuracy.
---
### Part 3 — Negative sampling correction
To make training tractable you subsample negatives. This shifts the class prior between training and serving.
If negatives are kept with sampling probability $q$, how should predicted probabilities be corrected at inference time so they are calibrated against the true distribution?
```hint Two ways to undo the shift
Subsampling changes only the class prior, not the per-example evidence — so the bias is systematic and recoverable. There are two complementary families: an *analytical* correction derived from how $q$ moves the odds (work out what subsampling does to the prior odds for a logistic model), and a *learned* post-hoc calibrator fit on a holdout that carries the true class ratio.
```
#### What This Part Should Cover
- **Direction of the bias:** states that subsampling negatives inflates predicted positives and must be corrected, not ignored.
- **A correct mechanism:** gives the analytical odds/logit adjustment in terms of $q$ (or equivalent importance weighting) and/or a post-hoc calibrator fit on a true-prior holdout.
- **Why it matters here:** connects calibration back to the multiplied stage-probability objective (uncalibrated heads break the product).
---
### Part 4 — Combining objectives into one ranking score
You now have several signals — CTR, conversion/application rate, approval probability, expected value, plus user-experience and compliance considerations.
How do you combine these into a single final ranking objective? What does the objective function look like, and how do you keep it from over-optimizing short-term revenue?
```hint Value-based score with guardrails
Rank by expected value (product of calibrated stage probabilities × margin) but add explicit penalty/constraint terms for poor user fit, high rejection risk, fatigue, fairness/compliance, and long-term value. Note that *calibration* matters here, not just ranking AUC — uncalibrated heads make the product meaningless. Consider tunable weights validated via online A/B tests.
```
#### What This Part Should Cover
- **A written objective:** an expected-value-with-guardrails form (product of calibrated stage probabilities × margin, minus weighted penalty terms), not a vague "blend the scores."
- **Hard vs. soft constraints:** enforces eligibility/compliance as hard filters and treats user fit / fatigue / fairness / long-term value as penalties.
- **Anti-myopia mechanism:** explains how guardrails and online A/B-validated weights stop short-term revenue from cannibalizing approval rate and trust.
---
### Part 5 — Single-stage vs. two-stage architecture
Should the system retrieve candidates first and then rerank, or score every eligible item directly?
Discuss the tradeoffs of two-stage (retrieval + reranking) vs. single-stage scoring, and give a concrete decision rule for choosing between them.
```hint What forces a retrieval stage
The deciding factors are candidate-set size, per-item scoring cost, latency budget, and how many eligibility/compliance filters must run first. Think about where recall can be capped (retrieval) vs. where precision is won (rerank), and the training–serving mismatch a two-stage system introduces.
```
#### What This Part Should Cover
- **Both directions of the tradeoff:** two-stage cuts latency/cost and enforces eligibility early, but caps recall and adds a retrieval/rerank training–serving mismatch.
- **A concrete decision rule:** ties the choice to candidate count × per-item latency vs. budget (plus personalization/model cost), not a stylistic preference.
- **Recall vs. precision framing:** recall is set in retrieval, precision is won in rerank.
---
### Part 6 — Feature store and offline/online consistency
What should the feature store provide, and how do you guarantee that the features used in training match those served online (avoiding training–serving skew)?
```hint Point-in-time correctness + one transform
Name the core responsibilities (shared definitions, point-in-time-correct offline joins to prevent leakage, low-latency online reads, freshness/lineage/access control). For consistency, focus on a single source of transformation logic, logged feature timestamps, and replaying online requests through the offline path to diff distributions and missing rates.
```
#### What This Part Should Cover
- **Core responsibilities:** shared definitions, point-in-time-correct offline joins (anti-leakage), low-latency online reads, freshness/lineage/access control.
- **A concrete anti-skew mechanism:** a single source of transform logic and/or logging served features for reuse in training, plus replay-and-diff to detect drift.
- **Leakage awareness:** explains *why* point-in-time correctness matters (training must not see values unavailable at inference).
---
### Part 7 — Serving hundreds of millions of users and thousands of models
Design the serving path for hundreds of millions of users and **thousands of models**. How do you route a request to the correct model version, manage hot vs. cold models, and decide what stays resident in memory vs. loaded on demand?
```hint Routing + tiered model cache
Think of a router that selects a model by product / market / segment / experiment / partner, plus a tiered cache: hot models pinned in memory, cold models lazily loaded under LRU/LFU with pre-warming before traffic shifts. The core tradeoff is memory cost vs. cold-start tail latency.
```
#### What This Part Should Cover
- **Routing logic:** selects a model version by product / market / segment / experiment arm / partner.
- **Tiered caching:** hot models pinned, cold models lazily loaded under LRU/LFU, with pre-warming before traffic shifts.
- **The core tradeoff named explicitly:** memory cost vs. cold-start (p99) tail latency, with a sensible default (tiered hot/warm/cold).
---
### Part 8 — MLOps, versioning, and serialization formats
With thousands of models trained per year, how do you manage versioning, reproducibility, dependency isolation, and deployment pipelines? And how would you reason about serialization formats — native framework formats, TorchScript, ONNX, Pickle, Joblib?
```hint Reproducibility surface + format tradeoffs
Reproducibility means versioning code + data snapshot + feature definitions + config + artifact + eval report, with isolated dependencies (containers/lockfiles) and staged CI/CD (offline → shadow → canary → A/B → rollback). For formats, compare on portability, runtime fidelity, latency, and safety — and remember Pickle/Joblib are unsafe for untrusted artifacts.
```
#### What This Part Should Cover
- **Reproducibility surface:** versions code + data snapshot + feature definitions + config + artifact + eval report, with isolated dependencies.
- **Staged delivery:** an offline → shadow → canary → A/B → rollback pipeline with automatic gates.
- **Format tradeoffs:** compares native / TorchScript / ONNX / Pickle / Joblib on portability, runtime fidelity, latency, and **safety** (Pickle/Joblib are unsafe to load from untrusted sources).
---
### What a Strong Answer Covers
These dimensions span all parts; a strong candidate exhibits them throughout rather than in any single part.
- **Calibration as a through-line:** treats calibration as first-class because stage probabilities are multiplied into the final score — it recurs in Parts 1, 3, and 4.
- **Coherence across the stack:** the modeling choices (Parts 1–4), the architecture (Parts 5, 7), and the infra (Part 6) tell one consistent story, not eight disconnected mini-answers.
- **Regulatory discipline:** consistently enforces eligibility/fair-lending as hard constraints upstream, not as soft scoring penalties.
- **Tradeoff fluency:** names tradeoffs and failure modes rather than asserting a single "correct" design.
### Follow-up Questions
- A multi-task model's loss decreases, then suddenly becomes **NaN or unstable after several hundred steps**. What are the likely causes, and how would you debug it?
- The model looks strong **offline but performs poorly online**. How do you investigate the gap, and where do you look first for training–serving skew?
- If online feature distributions drift significantly from the training distribution, which components and signals do you inspect, and how do you detect it automatically?
- Walk through how you would run a safe online experiment for a new ranking model — exposure, guardrail metrics, ramp-up, and rollback criteria — given the multi-week label delay.
Quick Answer: This question evaluates a machine learning engineer's competency in designing production-scale recommendation and ranking systems that balance revenue optimization with user experience, regulatory compliance, and long-term trust, emphasizing funnel modeling, delayed and sparse conversion labels, negative sampling correction, system architecture, data infrastructure, large-scale serving, and operational lifecycle. It is commonly asked to assess both conceptual understanding of trade-offs in multi-stage funnel modeling and label censoring and practical application skills in scalable ML system design, including handling class imbalance, calibration, monitoring, and deployment in the ML system design domain.