How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

What difficulty level is this interview question?

This is a medium difficulty ML System Design question, commonly asked during Technical Screen rounds at Reddit.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Reddit during technical interviews.

Design Comment Prediction Ranking System | Reddit Interview Question

Q: Design Comment Prediction Ranking System

This question evaluates proficiency in end-to-end machine learning system design for predicting user engagement, specifically the probability of a user commenting given exposure, covering problem formulation, label and feature engineering, model architecture, training and evaluation pipelines, low-latency serving, and operational concerns like monitoring and cold-start. It falls under the ML System Design / Machine Learning Engineering domain and is commonly asked to assess an engineer's ability to build scalable, bias-aware, low-latency ranking signals that integrate into multi-objective recommenders, with a primarily practical system-level focus rather than purely theoretical modeling.

Design an end-to-end machine learning system that powers the following prediction API:

will_user_comment_on_posts(user_id, post_ids) -> scores

Input

One user_id .
A list of post_ids — potentially around 1,000 candidate posts produced by retrieval for a single home-feed request.

Output

One floating-point score per post_id , representing the model's belief that the user would comment on that post if they were exposed to it .

Context

When a user opens Reddit, they see a personalized home feed. The feed-ranking system first retrieves a candidate set of posts, then orders them using several ranking signals. The output of will_user_comment_on_posts is one component of the overall ranking system — it is combined with other objectives (upvotes, dwell, freshness, diversity, safety) rather than used alone.

Design the full system: problem formulation, label generation, feature engineering, model architecture, training pipeline, offline and online evaluation, serving architecture, monitoring, and operational concerns (cold start, fallbacks, safety).

Constraints & Assumptions

Candidate set: a separate retrieval stage returns ~1,000 post IDs per request; you do not design retrieval here.
Latency: this is on the online ranking path, so a request scoring all candidates should complete in roughly tens of milliseconds (state your target and justify it).
Throughput: every active home-feed view triggers a scoring call over hundreds-to-thousands of candidates — assume a high, peaky request rate.
Output contract: a post_id → score map; ideally a calibrated probability or a stable monotonic engagement score, since it is later blended with other objectives.
Scope: the comment-prediction score is one signal in a multi-objective ranker; you are not designing the final blending policy, but you should note how your score plugs into it.

Clarifying Questions to Ask

Is the output consumed as a calibrated probability, or only as a relative score for ordering? Does the downstream blender require calibration across model versions?
What is the attribution window for a "comment" (immediate, 1 hour, 24 hours)? Do replies, top-level comments, and removed/deleted comments all count?
Do we have exploration/randomized-ranking traffic available for unbiased label collection, or only logs from the production ranker?
What is the strict latency and availability SLA for the scoring service, and what is the acceptable fallback behavior if it is breached?
Are there safety/quality constraints — e.g., must we avoid up-ranking outrage-bait — and which guardrail metrics gate a launch?
What are the privacy/policy limits on user and author features (sensitive attributes, retention)?

What a Strong Answer Covers

Problem framing: identifies the conditional-on-exposure target and why it differs from raw log probabilities; states the output contract (calibrated vs. relative).
Labeling: impression-level positives/negatives, attribution window, handling of repeat impressions, and filtering of bot/spam/removed content.
Bias handling: explicit treatment of exposure/position/selection bias (exploration logging, propensity weighting, position features).
Features: user, post, user×post interaction, and context features, with attention to point-in-time correctness / leakage.
Model: a baseline-to-production progression (GBDT/LR → two-tower or neural ranker), justified by the latency and scale constraints, including the two-stage pre-ranker/ranker split.
Training pipeline: log joins, leakage-free feature construction, time-based splits, quality gates, canary/A-B rollout.
Evaluation: ranking + calibration offline metrics, slice analysis, and online A/B with primary and guardrail metrics.
Serving: the read path, batching, feature stores (online/offline), caching strategy, and freshness tiers.
Operations: cold start (new user/post/community), monitoring & training-serving skew, fallbacks, and safety/privacy.

Follow-up Questions

The score is blended with upvote, dwell, and safety objectives. How do you keep the comment score comparable (calibrated) across daily model retrains so the blend weights remain stable?
Optimizing comment rate can up-rank controversial or outrage-inducing content. How would you detect this and prevent the model from learning that shortcut?
Your training data comes almost entirely from what the current ranker chose to show. Concretely, how much exploration traffic do you need, and how do you bound its cost to user experience?
A popular post's comment-velocity features change minute-to-minute. How do you keep those features fresh online without creating training-serving skew against the batch-computed training features?

Design an end-to-end machine learning system that powers the following prediction API:

will_user_comment_on_posts(user_id, post_ids) -> scores

Input

One user_id .
A list of post_ids — potentially around 1,000 candidate posts produced by retrieval for a single home-feed request.

Output

One floating-point score per post_id , representing the model's belief that the user would comment on that post if they were exposed to it .

Context

Constraints & Assumptions

Candidate set: a separate retrieval stage returns ~1,000 post IDs per request; you do not design retrieval here.
Latency: this is on the online ranking path, so a request scoring all candidates should complete in roughly tens of milliseconds (state your target and justify it).
Throughput: every active home-feed view triggers a scoring call over hundreds-to-thousands of candidates — assume a high, peaky request rate.
Output contract: a post_id → score map; ideally a calibrated probability or a stable monotonic engagement score, since it is later blended with other objectives.
Scope: the comment-prediction score is one signal in a multi-objective ranker; you are not designing the final blending policy, but you should note how your score plugs into it.

Clarifying Questions to Ask

Is the output consumed as a calibrated probability, or only as a relative score for ordering? Does the downstream blender require calibration across model versions?
What is the attribution window for a "comment" (immediate, 1 hour, 24 hours)? Do replies, top-level comments, and removed/deleted comments all count?
Do we have exploration/randomized-ranking traffic available for unbiased label collection, or only logs from the production ranker?
What is the strict latency and availability SLA for the scoring service, and what is the acceptable fallback behavior if it is breached?
Are there safety/quality constraints — e.g., must we avoid up-ranking outrage-bait — and which guardrail metrics gate a launch?
What are the privacy/policy limits on user and author features (sensitive attributes, retention)?

What a Strong Answer Covers

Problem framing: identifies the conditional-on-exposure target and why it differs from raw log probabilities; states the output contract (calibrated vs. relative).
Labeling: impression-level positives/negatives, attribution window, handling of repeat impressions, and filtering of bot/spam/removed content.
Bias handling: explicit treatment of exposure/position/selection bias (exploration logging, propensity weighting, position features).
Features: user, post, user×post interaction, and context features, with attention to point-in-time correctness / leakage.
Model: a baseline-to-production progression (GBDT/LR → two-tower or neural ranker), justified by the latency and scale constraints, including the two-stage pre-ranker/ranker split.
Training pipeline: log joins, leakage-free feature construction, time-based splits, quality gates, canary/A-B rollout.
Evaluation: ranking + calibration offline metrics, slice analysis, and online A/B with primary and guardrail metrics.
Serving: the read path, batching, feature stores (online/offline), caching strategy, and freshness tiers.
Operations: cold start (new user/post/community), monitoring & training-serving skew, fallbacks, and safety/privacy.

Follow-up Questions

The score is blended with upvote, dwell, and safety objectives. How do you keep the comment score comparable (calibrated) across daily model retrains so the blend weights remain stable?
Optimizing comment rate can up-rank controversial or outrage-inducing content. How would you detect this and prevent the model from learning that shortcut?
Your training data comes almost entirely from what the current ranker chose to show. Concretely, how much exploration traffic do you need, and how do you bound its cost to user experience?
A popular post's comment-velocity features change minute-to-minute. How do you keep those features fresh online without creating training-serving skew against the batch-computed training features?

Design Comment Prediction Ranking System

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Design Comment Prediction Ranking System

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP