How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

What difficulty level is this interview question?

This is a medium difficulty ML System Design question, commonly asked during Onsite rounds at Creditkarma.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Creditkarma during technical interviews.

Design a Revenue Ranking Platform | Creditkarma Interview Question

Q: Design a Revenue Ranking Platform

This question evaluates a machine learning engineer's competency in designing production-scale recommendation and ranking systems that balance revenue optimization with user experience, regulatory compliance, and long-term trust, emphasizing funnel modeling, delayed and sparse conversion labels, negative sampling correction, system architecture, data infrastructure, large-scale serving, and operational lifecycle. It is commonly asked to assess both conceptual understanding of trade-offs in multi-stage funnel modeling and label censoring and practical application skills in scalable ML system design, including handling class imbalance, calibration, monitoring, and deployment in the ML system design domain.

Design a machine learning recommendation and ranking system for a consumer finance marketplace such as Credit Karma. The product shows each user a set of eligible financial offers (credit cards, personal loans, insurance products, etc.), and the business goal is to maximize revenue while preserving user experience, regulatory compliance, and long-term user trust.

This is a broad, panel-style design discussion. Work through the modeling strategy, the ranking objective, the system architecture, the data infrastructure, large-scale serving, and the operational lifecycle. Each ### Part below is a distinct discussion area; the interviewer will probe several of them in depth.

Constraints & Assumptions

Scale: hundreds of millions of users; thousands of distinct models (per product line, market, partner, segment, and experiment); the offer catalog ranges from a handful to many thousands depending on the surface.
Sparsity: conversion/approval positive rate is roughly 0.1%–0.5% .
Label delay: approval and revenue labels can land days to weeks after the originating impression.
Regulatory: consumer-finance offers are subject to eligibility, fair-lending, and disclosure constraints; the ranker must respect hard eligibility/compliance filters, not just soft penalties.
Latency: online ranking must return within a low-latency budget suitable for a page render (interactive product surface).
Objective: maximize expected revenue subject to user-experience, compliance, and long-term-trust guardrails — not raw short-term revenue.

Clarifying Questions to Ask

What is the exact business objective, and which guardrail metrics are hard constraints vs. soft penalties (e.g. approval-rate floor, fairness, satisfaction)?
What is the revenue model per product — pay-per-click, pay-per-approved-application, revenue share, or a mix — and does it vary by partner?
What is the end-to-end latency budget, and the size of the eligible candidate set on a typical surface?
How are conversions attributed (impression vs. click vs. application vs. approval time), and what is the realistic label-maturity window?
What compliance and fair-lending constraints must the ranking respect, and are any offers legally mandated to appear or be filtered?
What infrastructure already exists (feature store, model registry, serving framework, experimentation platform)?

Part 1 — Funnel modeling strategy

A conversion in this marketplace passes through several stages: a user must click an offer, apply for it, and then be approved by the lender. Revenue is realized only after approval (and sometimes only after the product is used).

Should you build a single model or separate models for the funnel stages (CTR, application rate, approval rate, expected value)? Discuss one model vs. per-stage models vs. a multi-task model, and what you would recommend for a large marketplace.

What This Part Should Cover

Label-to-stage reasoning: ties each stage (CTR / apply / approve / value) to its label source, arrival latency, and base rate before choosing an architecture.
Architecture tradeoff: articulates representation sharing (helps sparse downstream stages) vs. per-stage debuggability/ownership/calibration, rather than a generic "use a big model."
A concrete recommendation: lands on a defensible choice (e.g. shared embeddings + per-stage calibrated heads) with the reasoning why , not a list of options.

Part 2 — Delayed and sparse conversion labels

Approval and revenue labels can arrive days or weeks after the impression, and the positive rate at the conversion/approval stage is extremely sparse (roughly 0.1%–0.5%).

How would you handle the delayed labels so you do not poison training with false negatives? How would you model the extreme class imbalance?

What This Part Should Cover

Censoring vs. negatives: recognizes that recent non-conversions are censored, and proposes a concrete mechanism (maturity windows, arrival curves, or a delayed-feedback/survival model) rather than treating "no label yet" as a 0.
Freshness vs. correctness: balances staying fresh (dense proxy labels) against waiting for downstream labels to mature.
Imbalance handling: names loss design and/or sampling and pairs it with imbalance-robust evaluation (PR-AUC, log loss, ECE), not accuracy.

Part 3 — Negative sampling correction

To make training tractable you subsample negatives. This shifts the class prior between training and serving.

If negatives are kept with sampling probability $q$ , how should predicted probabilities be corrected at inference time so they are calibrated against the true distribution?

What This Part Should Cover

Direction of the bias: states that subsampling negatives inflates predicted positives and must be corrected, not ignored.
A correct mechanism: gives the analytical odds/logit adjustment in terms of $q$ (or equivalent importance weighting) and/or a post-hoc calibrator fit on a true-prior holdout.
Why it matters here: connects calibration back to the multiplied stage-probability objective (uncalibrated heads break the product).

Part 4 — Combining objectives into one ranking score

You now have several signals — CTR, conversion/application rate, approval probability, expected value, plus user-experience and compliance considerations.

How do you combine these into a single final ranking objective? What does the objective function look like, and how do you keep it from over-optimizing short-term revenue?

What This Part Should Cover

A written objective: an expected-value-with-guardrails form (product of calibrated stage probabilities × margin, minus weighted penalty terms), not a vague "blend the scores."
Hard vs. soft constraints: enforces eligibility/compliance as hard filters and treats user fit / fatigue / fairness / long-term value as penalties.
Anti-myopia mechanism: explains how guardrails and online A/B-validated weights stop short-term revenue from cannibalizing approval rate and trust.

Part 5 — Single-stage vs. two-stage architecture

Should the system retrieve candidates first and then rerank, or score every eligible item directly?

Discuss the tradeoffs of two-stage (retrieval + reranking) vs. single-stage scoring, and give a concrete decision rule for choosing between them.

What This Part Should Cover

Both directions of the tradeoff: two-stage cuts latency/cost and enforces eligibility early, but caps recall and adds a retrieval/rerank training–serving mismatch.
A concrete decision rule: ties the choice to candidate count × per-item latency vs. budget (plus personalization/model cost), not a stylistic preference.
Recall vs. precision framing: recall is set in retrieval, precision is won in rerank.

Part 6 — Feature store and offline/online consistency

What should the feature store provide, and how do you guarantee that the features used in training match those served online (avoiding training–serving skew)?

What This Part Should Cover

Core responsibilities: shared definitions, point-in-time-correct offline joins (anti-leakage), low-latency online reads, freshness/lineage/access control.
A concrete anti-skew mechanism: a single source of transform logic and/or logging served features for reuse in training, plus replay-and-diff to detect drift.
Leakage awareness: explains why point-in-time correctness matters (training must not see values unavailable at inference).

Part 7 — Serving hundreds of millions of users and thousands of models

Design the serving path for hundreds of millions of users and thousands of models. How do you route a request to the correct model version, manage hot vs. cold models, and decide what stays resident in memory vs. loaded on demand?

What This Part Should Cover

Routing logic: selects a model version by product / market / segment / experiment arm / partner.
Tiered caching: hot models pinned, cold models lazily loaded under LRU/LFU, with pre-warming before traffic shifts.
The core tradeoff named explicitly: memory cost vs. cold-start (p99) tail latency, with a sensible default (tiered hot/warm/cold).

Part 8 — MLOps, versioning, and serialization formats

With thousands of models trained per year, how do you manage versioning, reproducibility, dependency isolation, and deployment pipelines? And how would you reason about serialization formats — native framework formats, TorchScript, ONNX, Pickle, Joblib?

What This Part Should Cover

Reproducibility surface: versions code + data snapshot + feature definitions + config + artifact + eval report, with isolated dependencies.
Staged delivery: an offline → shadow → canary → A/B → rollback pipeline with automatic gates.
Format tradeoffs: compares native / TorchScript / ONNX / Pickle / Joblib on portability, runtime fidelity, latency, and safety (Pickle/Joblib are unsafe to load from untrusted sources).

What a Strong Answer Covers

These dimensions span all parts; a strong candidate exhibits them throughout rather than in any single part.

Calibration as a through-line: treats calibration as first-class because stage probabilities are multiplied into the final score — it recurs in Parts 1, 3, and 4.
Coherence across the stack: the modeling choices (Parts 1–4), the architecture (Parts 5, 7), and the infra (Part 6) tell one consistent story, not eight disconnected mini-answers.
Regulatory discipline: consistently enforces eligibility/fair-lending as hard constraints upstream, not as soft scoring penalties.
Tradeoff fluency: names tradeoffs and failure modes rather than asserting a single "correct" design.

Follow-up Questions

A multi-task model's loss decreases, then suddenly becomes NaN or unstable after several hundred steps . What are the likely causes, and how would you debug it?
The model looks strong offline but performs poorly online . How do you investigate the gap, and where do you look first for training–serving skew?
If online feature distributions drift significantly from the training distribution, which components and signals do you inspect, and how do you detect it automatically?
Walk through how you would run a safe online experiment for a new ranking model — exposure, guardrail metrics, ramp-up, and rollback criteria — given the multi-week label delay.

Constraints & Assumptions

Scale: hundreds of millions of users; thousands of distinct models (per product line, market, partner, segment, and experiment); the offer catalog ranges from a handful to many thousands depending on the surface.
Sparsity: conversion/approval positive rate is roughly 0.1%–0.5% .
Label delay: approval and revenue labels can land days to weeks after the originating impression.
Regulatory: consumer-finance offers are subject to eligibility, fair-lending, and disclosure constraints; the ranker must respect hard eligibility/compliance filters, not just soft penalties.
Latency: online ranking must return within a low-latency budget suitable for a page render (interactive product surface).
Objective: maximize expected revenue subject to user-experience, compliance, and long-term-trust guardrails — not raw short-term revenue.

Clarifying Questions to Ask

What is the exact business objective, and which guardrail metrics are hard constraints vs. soft penalties (e.g. approval-rate floor, fairness, satisfaction)?
What is the revenue model per product — pay-per-click, pay-per-approved-application, revenue share, or a mix — and does it vary by partner?
What is the end-to-end latency budget, and the size of the eligible candidate set on a typical surface?
How are conversions attributed (impression vs. click vs. application vs. approval time), and what is the realistic label-maturity window?
What compliance and fair-lending constraints must the ranking respect, and are any offers legally mandated to appear or be filtered?
What infrastructure already exists (feature store, model registry, serving framework, experimentation platform)?

Part 1 — Funnel modeling strategy

What This Part Should Cover

Label-to-stage reasoning: ties each stage (CTR / apply / approve / value) to its label source, arrival latency, and base rate before choosing an architecture.
Architecture tradeoff: articulates representation sharing (helps sparse downstream stages) vs. per-stage debuggability/ownership/calibration, rather than a generic "use a big model."
A concrete recommendation: lands on a defensible choice (e.g. shared embeddings + per-stage calibrated heads) with the reasoning why , not a list of options.

Part 2 — Delayed and sparse conversion labels

Approval and revenue labels can arrive days or weeks after the impression, and the positive rate at the conversion/approval stage is extremely sparse (roughly 0.1%–0.5%).

How would you handle the delayed labels so you do not poison training with false negatives? How would you model the extreme class imbalance?

What This Part Should Cover

Censoring vs. negatives: recognizes that recent non-conversions are censored, and proposes a concrete mechanism (maturity windows, arrival curves, or a delayed-feedback/survival model) rather than treating "no label yet" as a 0.
Freshness vs. correctness: balances staying fresh (dense proxy labels) against waiting for downstream labels to mature.
Imbalance handling: names loss design and/or sampling and pairs it with imbalance-robust evaluation (PR-AUC, log loss, ECE), not accuracy.

Part 3 — Negative sampling correction

To make training tractable you subsample negatives. This shifts the class prior between training and serving.

If negatives are kept with sampling probability $q$ , how should predicted probabilities be corrected at inference time so they are calibrated against the true distribution?

What This Part Should Cover

Direction of the bias: states that subsampling negatives inflates predicted positives and must be corrected, not ignored.
A correct mechanism: gives the analytical odds/logit adjustment in terms of $q$ (or equivalent importance weighting) and/or a post-hoc calibrator fit on a true-prior holdout.
Why it matters here: connects calibration back to the multiplied stage-probability objective (uncalibrated heads break the product).

Part 4 — Combining objectives into one ranking score

You now have several signals — CTR, conversion/application rate, approval probability, expected value, plus user-experience and compliance considerations.

How do you combine these into a single final ranking objective? What does the objective function look like, and how do you keep it from over-optimizing short-term revenue?

What This Part Should Cover

A written objective: an expected-value-with-guardrails form (product of calibrated stage probabilities × margin, minus weighted penalty terms), not a vague "blend the scores."
Hard vs. soft constraints: enforces eligibility/compliance as hard filters and treats user fit / fatigue / fairness / long-term value as penalties.
Anti-myopia mechanism: explains how guardrails and online A/B-validated weights stop short-term revenue from cannibalizing approval rate and trust.

Part 5 — Single-stage vs. two-stage architecture

Should the system retrieve candidates first and then rerank, or score every eligible item directly?

Discuss the tradeoffs of two-stage (retrieval + reranking) vs. single-stage scoring, and give a concrete decision rule for choosing between them.

What This Part Should Cover

Both directions of the tradeoff: two-stage cuts latency/cost and enforces eligibility early, but caps recall and adds a retrieval/rerank training–serving mismatch.
A concrete decision rule: ties the choice to candidate count × per-item latency vs. budget (plus personalization/model cost), not a stylistic preference.
Recall vs. precision framing: recall is set in retrieval, precision is won in rerank.

Part 6 — Feature store and offline/online consistency

What should the feature store provide, and how do you guarantee that the features used in training match those served online (avoiding training–serving skew)?

What This Part Should Cover

Core responsibilities: shared definitions, point-in-time-correct offline joins (anti-leakage), low-latency online reads, freshness/lineage/access control.
A concrete anti-skew mechanism: a single source of transform logic and/or logging served features for reuse in training, plus replay-and-diff to detect drift.
Leakage awareness: explains why point-in-time correctness matters (training must not see values unavailable at inference).

Part 7 — Serving hundreds of millions of users and thousands of models

What This Part Should Cover

Routing logic: selects a model version by product / market / segment / experiment arm / partner.
Tiered caching: hot models pinned, cold models lazily loaded under LRU/LFU, with pre-warming before traffic shifts.
The core tradeoff named explicitly: memory cost vs. cold-start (p99) tail latency, with a sensible default (tiered hot/warm/cold).

Part 8 — MLOps, versioning, and serialization formats

What This Part Should Cover

Reproducibility surface: versions code + data snapshot + feature definitions + config + artifact + eval report, with isolated dependencies.
Staged delivery: an offline → shadow → canary → A/B → rollback pipeline with automatic gates.
Format tradeoffs: compares native / TorchScript / ONNX / Pickle / Joblib on portability, runtime fidelity, latency, and safety (Pickle/Joblib are unsafe to load from untrusted sources).

What a Strong Answer Covers

These dimensions span all parts; a strong candidate exhibits them throughout rather than in any single part.

Calibration as a through-line: treats calibration as first-class because stage probabilities are multiplied into the final score — it recurs in Parts 1, 3, and 4.
Coherence across the stack: the modeling choices (Parts 1–4), the architecture (Parts 5, 7), and the infra (Part 6) tell one consistent story, not eight disconnected mini-answers.
Regulatory discipline: consistently enforces eligibility/fair-lending as hard constraints upstream, not as soft scoring penalties.
Tradeoff fluency: names tradeoffs and failure modes rather than asserting a single "correct" design.

Follow-up Questions

A multi-task model's loss decreases, then suddenly becomes NaN or unstable after several hundred steps . What are the likely causes, and how would you debug it?
The model looks strong offline but performs poorly online . How do you investigate the gap, and where do you look first for training–serving skew?
If online feature distributions drift significantly from the training distribution, which components and signals do you inspect, and how do you detect it automatically?
Walk through how you would run a safe online experiment for a new ranking model — exposure, guardrail metrics, ramp-up, and rollback criteria — given the multi-week label delay.

Design a Revenue Ranking Platform

Quick Overview