Ranking, Recommendation, And Feedback Systems

What's being tested

Interviewers are probing whether you can design a ranking system as a production software system: request flow, candidate retrieval, scoring, feedback capture, latency budgets, reliability, and safe degradation. For OpenAI, this matters because many user-facing experiences involve choosing the best response, tool, suggestion, memory, document, or next action under tight latency and quality constraints. A strong Software Engineer answer treats the model as one component in a larger distributed system, not as magic: you should define interfaces, data contracts, serving paths, fallbacks, observability, and failure modes. The key signal is whether you can balance relevance, latency, freshness, cost, and safety without over-indexing on ML details outside the SWE lane.

Core knowledge

Two-stage ranking is the default architecture for large-scale recommendation: first retrieve hundreds or thousands of candidates cheaply, then rerank the top subset with a more expensive scorer. If the corpus is $N=10^8$ , exhaustive scoring is usually impossible; use retrieval to reduce to $K \approx 100$ – $1000$ before final ranking.
Candidate generation can use lexical search, rules, collaborative filtering, embeddings, or business constraints. For a SWE design, focus on service contracts: CandidateService.getCandidates(user_id, context, limit) should return IDs plus lightweight metadata, with timeout behavior and deduplication guarantees.
Approximate nearest neighbor search is common for embedding retrieval. Systems like FAISS, ScaNN, HNSW, or vector databases trade recall for latency. A typical SLA might require returning 500 candidates in <50ms; exact search over millions of vectors often fails that budget without indexing.
Top-K selection matters when merging large candidate pools. Use a min-heap of size $K$ for streaming candidates in $O(N \log K)$ time and $O(K)$ space, or quickselect for average $O(N)$ if the full list is in memory. Always define deterministic tie-breaking, such as (score DESC, recency DESC, id ASC).
Feature hydration is often the hidden bottleneck. The online path should avoid fanout to many stores per candidate; batch fetch with getMany(ids) and cache stable fields. If reranking 500 candidates and each causes a remote call, p99 latency will collapse from tail amplification.
Ranking score composition should be explainable at the system level even if the model is opaque. A practical score may combine model output and constraints:
$score = modelScore + \lambda_1 freshnessBoost - \lambda_2 safetyPenalty - \lambda_3 repetitionPenalty$
The SWE should describe where these terms are computed, versioned, and logged.
Feedback loops require clean event semantics. Capture impressions, clicks, hides, dwell time, thumbs up/down, edits, regenerations, and accepted suggestions with request IDs. The important SWE point is attribution: feedback is only useful if it links user_id, request_id, candidate_id, rank_position, model_version, and timestamp.
Position bias affects interpretation of feedback. Items ranked first naturally get more clicks, so click-through is not pure relevance. You do not need to derive causal estimators, but you should mention that logs must include rank position and exposure so downstream evaluation can correct for bias.
Online serving architecture usually separates API Gateway, RankingService, CandidateService, FeatureService, ModelScoringService, and LoggingService. The ranking service orchestrates deadlines, retries, partial results, dedupe, policy filters, and final ordering.
Latency budgeting should be explicit. For a 200ms p95 endpoint, you might budget 20ms auth/context, 50ms candidate generation, 40ms feature fetch, 60ms scoring, 20ms post-processing, and 10ms buffer. Parallelize independent calls, use deadlines, and return fallbacks instead of timing out.
Fallback behavior is part of correctness. If model scoring fails, serve cached popular items, recent items, lexical matches, or rule-ranked candidates. If safety filtering fails closed, return fewer candidates rather than unsafe content. Define degraded responses intentionally instead of letting exceptions shape UX.
Observability should cover both system and ranking health. Track p50, p95, p99, timeout rate, empty-result rate, candidate-source contribution, score distribution, cache hit rate, feedback logging success, and model/version mix. Add request tracing so one bad dependency is visible in a single slow ranking call.

Worked example

For Design a response-ranking ML system, a strong candidate starts by clarifying scope: “Are we ranking multiple candidate assistant responses for a single user prompt, or ranking response templates/content items across sessions? What is the latency target, how many candidates arrive per request, and are safety filters mandatory before or after ranking?” Then they declare assumptions: maybe 5–50 generated responses per prompt, a 300ms additional ranking budget, and hard exclusion for policy-violating responses.

The answer can be organized into four pillars: candidate intake, feature/scoring path, post-processing, and feedback/observability. Candidate intake defines a stable schema like response_id, prompt_id, generator_version, text metadata, safety flags, and generation cost. Feature/scoring describes a RankingService that batches candidates into one call to a ModelScoringService, adds lightweight deterministic boosts or penalties, and sorts with deterministic tie-breaking.

Post-processing handles deduplication, diversity, safety gates, and graceful degradation if scoring is unavailable. Feedback capture logs impressions, selected response, user edits, regeneration, thumbs up/down, and conversation continuation with request_id and rank_position. One explicit tradeoff to flag is whether to score synchronously or asynchronously: synchronous scoring improves quality but adds latency, while asynchronous reranking may require showing an initial response and updating later, which can be jarring.

A good close is: “If I had more time, I’d discuss offline replay for regression testing, shadow deployments for new rankers, and dashboards that correlate ranking version with latency, empty responses, and user feedback.”

A second angle

For Design an End-to-End ML System, the same ideas apply, but the framing is broader: you are designing the full recommendation service rather than only ranking candidate responses. You would spend more time on system boundaries: request API, candidate sources, feature access, model serving, logging, monitoring, and deployment strategy. The constraints may involve a much larger item corpus, so candidate generation and ANN retrieval become more important than in response ranking, where the candidate set may be small. The transferable core is still the same: reduce a huge option space to a manageable candidate set, score it under a latency budget, apply business/safety constraints, log exposures and feedback, and operate the system reliably.

Common pitfalls

Pitfall: Jumping straight into “train a neural network to predict clicks” without designing the serving path.

That answer misses what a Software Engineer is being evaluated on. A better answer first defines APIs, services, request flow, data contracts, ranking stages, timeouts, and fallbacks; the model can be treated as a versioned scoring dependency behind an interface.

Pitfall: Ignoring exposure and attribution in feedback logging.

Logging only “user clicked item X” is insufficient because downstream systems cannot know what else was shown, at what rank, under which model version. A stronger design logs impressions and outcomes together with request_id, rank_position, candidate_id, model_version, and relevant context.

Pitfall: Designing an ideal ranking algorithm with no latency, cost, or failure tradeoffs.

Interviewers expect you to constrain the design with real production budgets. State a target like p95 < 200ms, batch feature fetches, cap candidates before expensive scoring, use caches for stable metadata, and define degraded ranking when dependencies are slow.

Connections

Interviewers may pivot from this topic into feature stores, model serving, A/B testing, vector search, distributed caching, or observability for ML-backed services. Be ready to discuss how ranking interacts with Redis caching, ANN indexes, request tracing, deployment rollouts, and online/offline evaluation boundaries.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts