Design a Low-Latency Store Recommender
Company: DoorDash
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Onsite
You are designing the home-page store recommendation system for a food delivery app such as DoorDash.
A request contains very little context: primarily **user_id** and the user's **current latitude/longitude**. The system must return a ranked list of stores for the app home page.
## Hard constraints
- Recommended stores must be **within the deliveryable area** for the user.
- Recommended stores must be **open at request time**.
- The system is latency-sensitive and powers the home page.
## Product goal
Design a recommendation system that maximizes long-term business value, such as orders or contribution profit, while balancing user engagement, relevance, freshness, and system latency. Discuss what primary metric you would optimize for and what guardrail metrics you would monitor.
## What to cover
1. **End-to-end architecture**
- Describe the online request flow from request intake to final ranked list.
- Explain how you would structure retrieval, pre-ranking, ranking, and post-processing.
- Discuss how you would handle cold-start users, sparse geographies, and new stores.
2. **Retrieval design**
- Propose multiple candidate-generation strategies, given that the online inputs are limited.
- Explain how you would ensure all candidates satisfy delivery-range and open-now constraints.
- Discuss how you would merge, deduplicate, and budget candidates across retrieval channels.
3. **Location-aware caching**
- How would you use a geospatial indexing scheme such as Geohash or H3 to support caching?
- Would you precompute popular stores per grid cell offline?
- What cache key, TTL, and invalidation strategy would you use, especially when store availability and open status change frequently?
4. **Extreme latency constraint**
- Suppose each retrieval route has a very strict timeout budget, for example **15 ms**.
- How would you optimize parallel fan-out, partial results, fallback behavior, and service-level reliability under such a tight budget?
5. **Ranking and feature platform**
- Design the feature-serving infrastructure for different feature types: **embeddings**, **numeric features**, and **categorical features**.
- Explain how you would store and serve features keyed by **store_id**, **user_id**, and possibly **user-store pairs**.
- Assume offline feature pipelines refresh hourly. How would you support high-concurrency online inference while keeping features reasonably fresh and point-in-time correct?
6. **Model versioning and experimentation**
- Model version V2.0 adds new features relative to V1.1.
- How would your infrastructure support multiple model versions without breaking online serving?
- How would you configure different treatment groups in an A/B test to fetch different feature sets or model artifacts?
- What experiment design choices would you make, including randomization unit, success metrics, guardrails, and failure detection?
7. **Real-time versus batch features**
- Discuss the trade-offs between adding real-time features and relying on offline batch features.
- Under strict latency requirements, what can go wrong if you overuse real-time features?
- How would you design graceful degradation for feature timeouts, missing values, or upstream instability?
Your answer should explicitly address modeling trade-offs, latency and reliability constraints, experimentation, and common production pitfalls such as training-serving skew, missing features, and marketplace-side side effects.