Design a Homepage Store Recommender
Company: DoorDash
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Onsite
You are designing the homepage store recommendation system for a food-delivery app similar to DoorDash. When a user opens the app, the online request contains very little context: primarily `user_id` and the user's current latitude/longitude.
The system must return a ranked list of stores for the homepage feed under the following hard constraints:
- Every recommended store must be within the user's delivery range.
- Every recommended store must be currently open.
- The system serves high traffic, so online latency and reliability are critical.
- Assume each retrieval source has an aggressive timeout budget of about 15 ms.
Design the end-to-end ML system, and address the following:
1. **Overall architecture**
- How would you structure candidate retrieval, filtering, ranking, and serving?
- What are the main online and offline components?
2. **Candidate retrieval**
- How would you generate candidates given only `user_id` and location?
- What retrieval channels would you include (for example: nearby popular stores, user affinity, cuisine/category similarity, embedding-based retrieval, cold-start fallbacks)?
- How would you enforce the hard constraints on delivery eligibility and store open status?
3. **Geospatial caching**
- How would you use a geospatial index such as Geohash, H3, or a grid system for caching or precomputing location-based candidate sets?
- What would the cache key look like?
- How would you handle cache invalidation when stores open/close or delivery eligibility changes?
4. **Extreme latency constraints**
- If each retrieval path must finish within about 15 ms, how would you optimize fan-out and parallel fetching?
- How would you degrade gracefully when one or more retrieval sources time out?
5. **Ranking and feature platform**
- How would you build the ranking layer?
- What objective would you optimize: click-through rate, order conversion, GMV, long-term retention, delivery quality, or some weighted combination?
- How would you avoid feedback loops, popularity bias, and over-optimization for short-term clicks?
6. **Feature store design**
- Different feature types exist: dense embeddings, numeric features, and categorical features. How would you store them differently at the database layer?
- How would you key features using `user_id`, `store_id`, and possibly `user_id + store_id`?
- How would you support hourly offline refreshes while preserving high-concurrency, low-latency online reads?
7. **Model iteration and experimentation**
- Suppose model version V2.0 adds several new features relative to V1.1. How should the infrastructure support multiple model versions at once?
- How would different A/B test treatments fetch different feature sets or feature configurations safely?
- What offline and online metrics would you use to evaluate the change?
8. **Real-time versus batch features**
- What are the tradeoffs between real-time features and offline batch-computed features in this system?
- What failure modes appear when you add real-time features under strict latency requirements, such as timeouts, missing values, training-serving skew, and stability issues?
- How would you decide which features must be real-time versus batch?
Your answer should include system architecture, storage choices, ML tradeoffs, experimentation strategy, and operational safeguards.
Quick Answer: This question evaluates system-level machine learning and recommender competencies, including candidate retrieval, filtering and ranking, feature-store design, geospatial caching, low-latency serving, and experimentation infrastructure.