Design Comment Prediction Ranking System
Company: Reddit
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: medium
Interview Round: Technical Screen
Design an end-to-end machine learning system that powers the following prediction API:
```
will_user_comment_on_posts(user_id, post_ids) -> scores
```
**Input**
- One `user_id`.
- A list of `post_ids` — potentially around 1,000 candidate posts produced by retrieval for a single home-feed request.
**Output**
- One floating-point score per `post_id`, representing the model's belief that the user *would comment* on that post **if they were exposed to it**.
**Context**
When a user opens Reddit, they see a personalized home feed. The feed-ranking system first retrieves a candidate set of posts, then orders them using several ranking signals. The output of `will_user_comment_on_posts` is **one component** of the overall ranking system — it is combined with other objectives (upvotes, dwell, freshness, diversity, safety) rather than used alone.
Design the full system: problem formulation, label generation, feature engineering, model architecture, training pipeline, offline and online evaluation, serving architecture, monitoring, and operational concerns (cold start, fallbacks, safety).
```hint Frame the prediction target precisely
The score is $P(\text{comment} \mid \text{exposure})$, not the raw probability that any (user, post) pair produces a comment. Logs only contain posts the current ranker chose to show, so your training distribution is already biased by exposure — think about what that means for both labels and evaluation.
```
```hint Selection / position bias
Higher-ranked posts get seen and commented on more, so naive impression labels teach the model to imitate the *existing* ranker. Consider exploration/randomization buckets, position-as-feature, and inverse-propensity weighting.
```
```hint Latency at 1,000 candidates
You must score ~1,000 posts within a feed-request budget (tens of ms). Think two-stage scoring (cheap pre-ranker over all candidates, heavier model over the top-N), batched vectorized inference, and aggressive caching of post features / embeddings that are shared across users.
```
### Constraints & Assumptions
- **Candidate set:** a separate retrieval stage returns ~1,000 post IDs per request; you do not design retrieval here.
- **Latency:** this is on the online ranking path, so a request scoring all candidates should complete in roughly tens of milliseconds (state your target and justify it).
- **Throughput:** every active home-feed view triggers a scoring call over hundreds-to-thousands of candidates — assume a high, peaky request rate.
- **Output contract:** a `post_id → score` map; ideally a *calibrated* probability or a stable monotonic engagement score, since it is later blended with other objectives.
- **Scope:** the comment-prediction score is one signal in a multi-objective ranker; you are not designing the final blending policy, but you should note how your score plugs into it.
### Clarifying Questions to Ask
- Is the output consumed as a calibrated probability, or only as a relative score for ordering? Does the downstream blender require calibration across model versions?
- What is the attribution window for a "comment" (immediate, 1 hour, 24 hours)? Do replies, top-level comments, and removed/deleted comments all count?
- Do we have exploration/randomized-ranking traffic available for unbiased label collection, or only logs from the production ranker?
- What is the strict latency and availability SLA for the scoring service, and what is the acceptable fallback behavior if it is breached?
- Are there safety/quality constraints — e.g., must we avoid up-ranking outrage-bait — and which guardrail metrics gate a launch?
- What are the privacy/policy limits on user and author features (sensitive attributes, retention)?
### What a Strong Answer Covers
- **Problem framing:** identifies the conditional-on-exposure target and why it differs from raw log probabilities; states the output contract (calibrated vs. relative).
- **Labeling:** impression-level positives/negatives, attribution window, handling of repeat impressions, and filtering of bot/spam/removed content.
- **Bias handling:** explicit treatment of exposure/position/selection bias (exploration logging, propensity weighting, position features).
- **Features:** user, post, user×post interaction, and context features, with attention to point-in-time correctness / leakage.
- **Model:** a baseline-to-production progression (GBDT/LR → two-tower or neural ranker), justified by the latency and scale constraints, including the two-stage pre-ranker/ranker split.
- **Training pipeline:** log joins, leakage-free feature construction, time-based splits, quality gates, canary/A-B rollout.
- **Evaluation:** ranking + calibration offline metrics, slice analysis, and online A/B with primary and guardrail metrics.
- **Serving:** the read path, batching, feature stores (online/offline), caching strategy, and freshness tiers.
- **Operations:** cold start (new user/post/community), monitoring & training-serving skew, fallbacks, and safety/privacy.
### Follow-up Questions
- The score is blended with upvote, dwell, and safety objectives. How do you keep the comment score *comparable* (calibrated) across daily model retrains so the blend weights remain stable?
- Optimizing comment rate can up-rank controversial or outrage-inducing content. How would you detect this and prevent the model from learning that shortcut?
- Your training data comes almost entirely from what the current ranker chose to show. Concretely, how much exploration traffic do you need, and how do you bound its cost to user experience?
- A popular post's comment-velocity features change minute-to-minute. How do you keep those features fresh online without creating training-serving skew against the batch-computed training features?
Quick Answer: This question evaluates proficiency in end-to-end machine learning system design for predicting user engagement, specifically the probability of a user commenting given exposure, covering problem formulation, label and feature engineering, model architecture, training and evaluation pipelines, low-latency serving, and operational concerns like monitoring and cold-start. It falls under the ML System Design / Machine Learning Engineering domain and is commonly asked to assess an engineer's ability to build scalable, bias-aware, low-latency ranking signals that integrate into multi-objective recommenders, with a primarily practical system-level focus rather than purely theoretical modeling.