PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/ML System Design/Creditkarma

Design a Revenue Ranking Platform

Last updated: Jun 21, 2026

Quick Overview

This question evaluates a machine learning engineer's competency in designing production-scale recommendation and ranking systems that balance revenue optimization with user experience, regulatory compliance, and long-term trust, emphasizing funnel modeling, delayed and sparse conversion labels, negative sampling correction, system architecture, data infrastructure, large-scale serving, and operational lifecycle. It is commonly asked to assess both conceptual understanding of trade-offs in multi-stage funnel modeling and label censoring and practical application skills in scalable ML system design, including handling class imbalance, calibration, monitoring, and deployment in the ML system design domain.

  • medium
  • Creditkarma
  • ML System Design
  • Machine Learning Engineer

Design a Revenue Ranking Platform

Company: Creditkarma

Role: Machine Learning Engineer

Category: ML System Design

Difficulty: medium

Interview Round: Onsite

Design a machine learning recommendation and ranking system for a consumer finance marketplace such as Credit Karma. The product shows each user a set of eligible financial offers (credit cards, personal loans, insurance products, etc.), and the business goal is to **maximize revenue while preserving user experience, regulatory compliance, and long-term user trust**. This is a broad, panel-style design discussion. Work through the modeling strategy, the ranking objective, the system architecture, the data infrastructure, large-scale serving, and the operational lifecycle. Each `### Part` below is a distinct discussion area; the interviewer will probe several of them in depth. ### Constraints & Assumptions - **Scale:** hundreds of millions of users; thousands of distinct models (per product line, market, partner, segment, and experiment); the offer catalog ranges from a handful to many thousands depending on the surface. - **Sparsity:** conversion/approval positive rate is roughly **0.1%–0.5%**. - **Label delay:** approval and revenue labels can land **days to weeks** after the originating impression. - **Regulatory:** consumer-finance offers are subject to eligibility, fair-lending, and disclosure constraints; the ranker must respect hard eligibility/compliance filters, not just soft penalties. - **Latency:** online ranking must return within a low-latency budget suitable for a page render (interactive product surface). - **Objective:** maximize expected revenue subject to user-experience, compliance, and long-term-trust guardrails — not raw short-term revenue. ### Clarifying Questions to Ask - What is the exact business objective, and which guardrail metrics are hard constraints vs. soft penalties (e.g. approval-rate floor, fairness, satisfaction)? - What is the revenue model per product — pay-per-click, pay-per-approved-application, revenue share, or a mix — and does it vary by partner? - What is the end-to-end latency budget, and the size of the eligible candidate set on a typical surface? - How are conversions attributed (impression vs. click vs. application vs. approval time), and what is the realistic label-maturity window? - What compliance and fair-lending constraints must the ranking respect, and are any offers legally mandated to appear or be filtered? - What infrastructure already exists (feature store, model registry, serving framework, experimentation platform)? --- ### Part 1 — Funnel modeling strategy A conversion in this marketplace passes through several stages: a user must **click** an offer, **apply** for it, and then be **approved** by the lender. Revenue is realized only after approval (and sometimes only after the product is used). Should you build a single model or separate models for the funnel stages (CTR, application rate, approval rate, expected value)? Discuss one model vs. per-stage models vs. a multi-task model, and what you would recommend for a large marketplace. ```hint Map the stages to labels Each funnel stage produces a different label, arrives on a different timeline, and has a different base rate. Reasoning stage-by-stage about *where each label comes from* and *how dense it is* tells you which stages can share signal and which need their own training data, sampling, and calibration — which is the real input to the one-model-vs-per-stage decision. ``` ```hint Sharing vs. isolation Weigh a shared-bottom / multi-task or mixture-of-experts model (better representation sharing, helps sparse stages) against fully separate models (easier to debug, calibrate, and own per stage). Consider a hybrid: shared embeddings, per-stage calibrated heads. ``` #### What This Part Should Cover - **Label-to-stage reasoning:** ties each stage (CTR / apply / approve / value) to its label source, arrival latency, and base rate before choosing an architecture. - **Architecture tradeoff:** articulates representation sharing (helps sparse downstream stages) vs. per-stage debuggability/ownership/calibration, rather than a generic "use a big model." - **A concrete recommendation:** lands on a defensible choice (e.g. shared embeddings + per-stage calibrated heads) with the reasoning *why*, not a list of options. --- ### Part 2 — Delayed and sparse conversion labels Approval and revenue labels can arrive **days or weeks** after the impression, and the positive rate at the conversion/approval stage is **extremely sparse (roughly 0.1%–0.5%)**. How would you handle the delayed labels so you do not poison training with false negatives? How would you model the extreme class imbalance? ```hint Don't mislabel "not yet" as negative A recent impression with no conversion may simply be censored, not negative. Think about label-maturity windows, delayed-feedback / survival models for censored examples, and using denser proxy labels (clicks, application starts) for fresh data. ``` ```hint Imbalance toolkit For 0.1%–0.5% positives consider class-weighted cross-entropy or focal loss, negative subsampling (with correction — see Part 3), representation sharing from denser tasks, and explicit calibration on an unbiased holdout. Pick evaluation metrics robust to imbalance (PR-AUC, log loss, ECE), not raw accuracy. ``` #### What This Part Should Cover - **Censoring vs. negatives:** recognizes that recent non-conversions are censored, and proposes a concrete mechanism (maturity windows, arrival curves, or a delayed-feedback/survival model) rather than treating "no label yet" as a 0. - **Freshness vs. correctness:** balances staying fresh (dense proxy labels) against waiting for downstream labels to mature. - **Imbalance handling:** names loss design and/or sampling and pairs it with imbalance-robust evaluation (PR-AUC, log loss, ECE), not accuracy. --- ### Part 3 — Negative sampling correction To make training tractable you subsample negatives. This shifts the class prior between training and serving. If negatives are kept with sampling probability $q$, how should predicted probabilities be corrected at inference time so they are calibrated against the true distribution? ```hint Two ways to undo the shift Subsampling changes only the class prior, not the per-example evidence — so the bias is systematic and recoverable. There are two complementary families: an *analytical* correction derived from how $q$ moves the odds (work out what subsampling does to the prior odds for a logistic model), and a *learned* post-hoc calibrator fit on a holdout that carries the true class ratio. ``` #### What This Part Should Cover - **Direction of the bias:** states that subsampling negatives inflates predicted positives and must be corrected, not ignored. - **A correct mechanism:** gives the analytical odds/logit adjustment in terms of $q$ (or equivalent importance weighting) and/or a post-hoc calibrator fit on a true-prior holdout. - **Why it matters here:** connects calibration back to the multiplied stage-probability objective (uncalibrated heads break the product). --- ### Part 4 — Combining objectives into one ranking score You now have several signals — CTR, conversion/application rate, approval probability, expected value, plus user-experience and compliance considerations. How do you combine these into a single final ranking objective? What does the objective function look like, and how do you keep it from over-optimizing short-term revenue? ```hint Value-based score with guardrails Rank by expected value (product of calibrated stage probabilities × margin) but add explicit penalty/constraint terms for poor user fit, high rejection risk, fatigue, fairness/compliance, and long-term value. Note that *calibration* matters here, not just ranking AUC — uncalibrated heads make the product meaningless. Consider tunable weights validated via online A/B tests. ``` #### What This Part Should Cover - **A written objective:** an expected-value-with-guardrails form (product of calibrated stage probabilities × margin, minus weighted penalty terms), not a vague "blend the scores." - **Hard vs. soft constraints:** enforces eligibility/compliance as hard filters and treats user fit / fatigue / fairness / long-term value as penalties. - **Anti-myopia mechanism:** explains how guardrails and online A/B-validated weights stop short-term revenue from cannibalizing approval rate and trust. --- ### Part 5 — Single-stage vs. two-stage architecture Should the system retrieve candidates first and then rerank, or score every eligible item directly? Discuss the tradeoffs of two-stage (retrieval + reranking) vs. single-stage scoring, and give a concrete decision rule for choosing between them. ```hint What forces a retrieval stage The deciding factors are candidate-set size, per-item scoring cost, latency budget, and how many eligibility/compliance filters must run first. Think about where recall can be capped (retrieval) vs. where precision is won (rerank), and the training–serving mismatch a two-stage system introduces. ``` #### What This Part Should Cover - **Both directions of the tradeoff:** two-stage cuts latency/cost and enforces eligibility early, but caps recall and adds a retrieval/rerank training–serving mismatch. - **A concrete decision rule:** ties the choice to candidate count × per-item latency vs. budget (plus personalization/model cost), not a stylistic preference. - **Recall vs. precision framing:** recall is set in retrieval, precision is won in rerank. --- ### Part 6 — Feature store and offline/online consistency What should the feature store provide, and how do you guarantee that the features used in training match those served online (avoiding training–serving skew)? ```hint Point-in-time correctness + one transform Name the core responsibilities (shared definitions, point-in-time-correct offline joins to prevent leakage, low-latency online reads, freshness/lineage/access control). For consistency, focus on a single source of transformation logic, logged feature timestamps, and replaying online requests through the offline path to diff distributions and missing rates. ``` #### What This Part Should Cover - **Core responsibilities:** shared definitions, point-in-time-correct offline joins (anti-leakage), low-latency online reads, freshness/lineage/access control. - **A concrete anti-skew mechanism:** a single source of transform logic and/or logging served features for reuse in training, plus replay-and-diff to detect drift. - **Leakage awareness:** explains *why* point-in-time correctness matters (training must not see values unavailable at inference). --- ### Part 7 — Serving hundreds of millions of users and thousands of models Design the serving path for hundreds of millions of users and **thousands of models**. How do you route a request to the correct model version, manage hot vs. cold models, and decide what stays resident in memory vs. loaded on demand? ```hint Routing + tiered model cache Think of a router that selects a model by product / market / segment / experiment / partner, plus a tiered cache: hot models pinned in memory, cold models lazily loaded under LRU/LFU with pre-warming before traffic shifts. The core tradeoff is memory cost vs. cold-start tail latency. ``` #### What This Part Should Cover - **Routing logic:** selects a model version by product / market / segment / experiment arm / partner. - **Tiered caching:** hot models pinned, cold models lazily loaded under LRU/LFU, with pre-warming before traffic shifts. - **The core tradeoff named explicitly:** memory cost vs. cold-start (p99) tail latency, with a sensible default (tiered hot/warm/cold). --- ### Part 8 — MLOps, versioning, and serialization formats With thousands of models trained per year, how do you manage versioning, reproducibility, dependency isolation, and deployment pipelines? And how would you reason about serialization formats — native framework formats, TorchScript, ONNX, Pickle, Joblib? ```hint Reproducibility surface + format tradeoffs Reproducibility means versioning code + data snapshot + feature definitions + config + artifact + eval report, with isolated dependencies (containers/lockfiles) and staged CI/CD (offline → shadow → canary → A/B → rollback). For formats, compare on portability, runtime fidelity, latency, and safety — and remember Pickle/Joblib are unsafe for untrusted artifacts. ``` #### What This Part Should Cover - **Reproducibility surface:** versions code + data snapshot + feature definitions + config + artifact + eval report, with isolated dependencies. - **Staged delivery:** an offline → shadow → canary → A/B → rollback pipeline with automatic gates. - **Format tradeoffs:** compares native / TorchScript / ONNX / Pickle / Joblib on portability, runtime fidelity, latency, and **safety** (Pickle/Joblib are unsafe to load from untrusted sources). --- ### What a Strong Answer Covers These dimensions span all parts; a strong candidate exhibits them throughout rather than in any single part. - **Calibration as a through-line:** treats calibration as first-class because stage probabilities are multiplied into the final score — it recurs in Parts 1, 3, and 4. - **Coherence across the stack:** the modeling choices (Parts 1–4), the architecture (Parts 5, 7), and the infra (Part 6) tell one consistent story, not eight disconnected mini-answers. - **Regulatory discipline:** consistently enforces eligibility/fair-lending as hard constraints upstream, not as soft scoring penalties. - **Tradeoff fluency:** names tradeoffs and failure modes rather than asserting a single "correct" design. ### Follow-up Questions - A multi-task model's loss decreases, then suddenly becomes **NaN or unstable after several hundred steps**. What are the likely causes, and how would you debug it? - The model looks strong **offline but performs poorly online**. How do you investigate the gap, and where do you look first for training–serving skew? - If online feature distributions drift significantly from the training distribution, which components and signals do you inspect, and how do you detect it automatically? - Walk through how you would run a safe online experiment for a new ranking model — exposure, guardrail metrics, ramp-up, and rollback criteria — given the multi-week label delay.

Quick Answer: This question evaluates a machine learning engineer's competency in designing production-scale recommendation and ranking systems that balance revenue optimization with user experience, regulatory compliance, and long-term trust, emphasizing funnel modeling, delayed and sparse conversion labels, negative sampling correction, system architecture, data infrastructure, large-scale serving, and operational lifecycle. It is commonly asked to assess both conceptual understanding of trade-offs in multi-stage funnel modeling and label censoring and practical application skills in scalable ML system design, including handling class imbalance, calibration, monitoring, and deployment in the ML system design domain.

Related Interview Questions

  • Design Personalized Promotion Recommendations - Creditkarma (medium)
Creditkarma logo
Creditkarma
Jun 10, 2026, 12:00 AM
Machine Learning Engineer
Onsite
ML System Design
0
0

Design a machine learning recommendation and ranking system for a consumer finance marketplace such as Credit Karma. The product shows each user a set of eligible financial offers (credit cards, personal loans, insurance products, etc.), and the business goal is to maximize revenue while preserving user experience, regulatory compliance, and long-term user trust.

This is a broad, panel-style design discussion. Work through the modeling strategy, the ranking objective, the system architecture, the data infrastructure, large-scale serving, and the operational lifecycle. Each ### Part below is a distinct discussion area; the interviewer will probe several of them in depth.

Constraints & Assumptions

  • Scale: hundreds of millions of users; thousands of distinct models (per product line, market, partner, segment, and experiment); the offer catalog ranges from a handful to many thousands depending on the surface.
  • Sparsity: conversion/approval positive rate is roughly 0.1%–0.5% .
  • Label delay: approval and revenue labels can land days to weeks after the originating impression.
  • Regulatory: consumer-finance offers are subject to eligibility, fair-lending, and disclosure constraints; the ranker must respect hard eligibility/compliance filters, not just soft penalties.
  • Latency: online ranking must return within a low-latency budget suitable for a page render (interactive product surface).
  • Objective: maximize expected revenue subject to user-experience, compliance, and long-term-trust guardrails — not raw short-term revenue.

Clarifying Questions to Ask

  • What is the exact business objective, and which guardrail metrics are hard constraints vs. soft penalties (e.g. approval-rate floor, fairness, satisfaction)?
  • What is the revenue model per product — pay-per-click, pay-per-approved-application, revenue share, or a mix — and does it vary by partner?
  • What is the end-to-end latency budget, and the size of the eligible candidate set on a typical surface?
  • How are conversions attributed (impression vs. click vs. application vs. approval time), and what is the realistic label-maturity window?
  • What compliance and fair-lending constraints must the ranking respect, and are any offers legally mandated to appear or be filtered?
  • What infrastructure already exists (feature store, model registry, serving framework, experimentation platform)?

Part 1 — Funnel modeling strategy

A conversion in this marketplace passes through several stages: a user must click an offer, apply for it, and then be approved by the lender. Revenue is realized only after approval (and sometimes only after the product is used).

Should you build a single model or separate models for the funnel stages (CTR, application rate, approval rate, expected value)? Discuss one model vs. per-stage models vs. a multi-task model, and what you would recommend for a large marketplace.

What This Part Should Cover

  • Label-to-stage reasoning: ties each stage (CTR / apply / approve / value) to its label source, arrival latency, and base rate before choosing an architecture.
  • Architecture tradeoff: articulates representation sharing (helps sparse downstream stages) vs. per-stage debuggability/ownership/calibration, rather than a generic "use a big model."
  • A concrete recommendation: lands on a defensible choice (e.g. shared embeddings + per-stage calibrated heads) with the reasoning why , not a list of options.

Part 2 — Delayed and sparse conversion labels

Approval and revenue labels can arrive days or weeks after the impression, and the positive rate at the conversion/approval stage is extremely sparse (roughly 0.1%–0.5%).

How would you handle the delayed labels so you do not poison training with false negatives? How would you model the extreme class imbalance?

What This Part Should Cover

  • Censoring vs. negatives: recognizes that recent non-conversions are censored, and proposes a concrete mechanism (maturity windows, arrival curves, or a delayed-feedback/survival model) rather than treating "no label yet" as a 0.
  • Freshness vs. correctness: balances staying fresh (dense proxy labels) against waiting for downstream labels to mature.
  • Imbalance handling: names loss design and/or sampling and pairs it with imbalance-robust evaluation (PR-AUC, log loss, ECE), not accuracy.

Part 3 — Negative sampling correction

To make training tractable you subsample negatives. This shifts the class prior between training and serving.

If negatives are kept with sampling probability qqq, how should predicted probabilities be corrected at inference time so they are calibrated against the true distribution?

What This Part Should Cover

  • Direction of the bias: states that subsampling negatives inflates predicted positives and must be corrected, not ignored.
  • A correct mechanism: gives the analytical odds/logit adjustment in terms of qqq (or equivalent importance weighting) and/or a post-hoc calibrator fit on a true-prior holdout.
  • Why it matters here: connects calibration back to the multiplied stage-probability objective (uncalibrated heads break the product).

Part 4 — Combining objectives into one ranking score

You now have several signals — CTR, conversion/application rate, approval probability, expected value, plus user-experience and compliance considerations.

How do you combine these into a single final ranking objective? What does the objective function look like, and how do you keep it from over-optimizing short-term revenue?

What This Part Should Cover

  • A written objective: an expected-value-with-guardrails form (product of calibrated stage probabilities × margin, minus weighted penalty terms), not a vague "blend the scores."
  • Hard vs. soft constraints: enforces eligibility/compliance as hard filters and treats user fit / fatigue / fairness / long-term value as penalties.
  • Anti-myopia mechanism: explains how guardrails and online A/B-validated weights stop short-term revenue from cannibalizing approval rate and trust.

Part 5 — Single-stage vs. two-stage architecture

Should the system retrieve candidates first and then rerank, or score every eligible item directly?

Discuss the tradeoffs of two-stage (retrieval + reranking) vs. single-stage scoring, and give a concrete decision rule for choosing between them.

What This Part Should Cover

  • Both directions of the tradeoff: two-stage cuts latency/cost and enforces eligibility early, but caps recall and adds a retrieval/rerank training–serving mismatch.
  • A concrete decision rule: ties the choice to candidate count × per-item latency vs. budget (plus personalization/model cost), not a stylistic preference.
  • Recall vs. precision framing: recall is set in retrieval, precision is won in rerank.

Part 6 — Feature store and offline/online consistency

What should the feature store provide, and how do you guarantee that the features used in training match those served online (avoiding training–serving skew)?

What This Part Should Cover

  • Core responsibilities: shared definitions, point-in-time-correct offline joins (anti-leakage), low-latency online reads, freshness/lineage/access control.
  • A concrete anti-skew mechanism: a single source of transform logic and/or logging served features for reuse in training, plus replay-and-diff to detect drift.
  • Leakage awareness: explains why point-in-time correctness matters (training must not see values unavailable at inference).

Part 7 — Serving hundreds of millions of users and thousands of models

Design the serving path for hundreds of millions of users and thousands of models. How do you route a request to the correct model version, manage hot vs. cold models, and decide what stays resident in memory vs. loaded on demand?

What This Part Should Cover

  • Routing logic: selects a model version by product / market / segment / experiment arm / partner.
  • Tiered caching: hot models pinned, cold models lazily loaded under LRU/LFU, with pre-warming before traffic shifts.
  • The core tradeoff named explicitly: memory cost vs. cold-start (p99) tail latency, with a sensible default (tiered hot/warm/cold).

Part 8 — MLOps, versioning, and serialization formats

With thousands of models trained per year, how do you manage versioning, reproducibility, dependency isolation, and deployment pipelines? And how would you reason about serialization formats — native framework formats, TorchScript, ONNX, Pickle, Joblib?

What This Part Should Cover

  • Reproducibility surface: versions code + data snapshot + feature definitions + config + artifact + eval report, with isolated dependencies.
  • Staged delivery: an offline → shadow → canary → A/B → rollback pipeline with automatic gates.
  • Format tradeoffs: compares native / TorchScript / ONNX / Pickle / Joblib on portability, runtime fidelity, latency, and safety (Pickle/Joblib are unsafe to load from untrusted sources).

What a Strong Answer Covers

These dimensions span all parts; a strong candidate exhibits them throughout rather than in any single part.

  • Calibration as a through-line: treats calibration as first-class because stage probabilities are multiplied into the final score — it recurs in Parts 1, 3, and 4.
  • Coherence across the stack: the modeling choices (Parts 1–4), the architecture (Parts 5, 7), and the infra (Part 6) tell one consistent story, not eight disconnected mini-answers.
  • Regulatory discipline: consistently enforces eligibility/fair-lending as hard constraints upstream, not as soft scoring penalties.
  • Tradeoff fluency: names tradeoffs and failure modes rather than asserting a single "correct" design.

Follow-up Questions

  • A multi-task model's loss decreases, then suddenly becomes NaN or unstable after several hundred steps . What are the likely causes, and how would you debug it?
  • The model looks strong offline but performs poorly online . How do you investigate the gap, and where do you look first for training–serving skew?
  • If online feature distributions drift significantly from the training distribution, which components and signals do you inspect, and how do you detect it automatically?
  • Walk through how you would run a safe online experiment for a new ranking model — exposure, guardrail metrics, ramp-up, and rollback criteria — given the multi-week label delay.

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Creditkarma•More Machine Learning Engineer•Creditkarma Machine Learning Engineer•Creditkarma ML System Design•Machine Learning Engineer ML System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.