Design a "Future Best-Sellers" Prediction and Recommendation System
Company: Intuit
Role: Software Engineer
Category: ML System Design
Difficulty: medium
Interview Round: Technical Screen
Design a "Future Best-Sellers" Prediction and Recommendation System
An e-commerce platform wants a new feature on its category pages. A user selects a product **category** (for example, *running shoes* or *coffee makers*) and types in a **future time window** — anything from "the next 24 hours" to a custom range up to roughly 90 days out. The system returns a ranked list of the items predicted to be the **best sellers** in that category during that window, so shoppers can buy what is about to be popular and the business can surface high-intent inventory.
Design this prediction-and-recommendation system end to end: the modeling approach, the data and serving architecture, how you meet an interactive latency budget, and how you evaluate it. The interviewer is explicitly open to either a classical forecasting/ML approach or an LLM-/agent-based approach — part of the exercise is to choose deliberately and justify the trade-off.
### Constraints & Assumptions
- **Catalog scale:** tens of millions of active items spread across thousands of categories; the long tail of items has sparse sales.
- **Traffic:** the feature is surfaced on high-traffic category landing pages, so expect heavy, bursty read load.
- **Window:** user-specified, from `next 24h` up to `next 90 days`; the same `(category, window)` pair is requested by many users.
- **Output:** a ranked top-`K` (e.g., top 20) items per request, with enough signal to render cards (predicted rank, optional confidence).
- **Latency:** interactive feature — target p95 end-to-end under ~500 ms for the served response.
- **Data available:** historical order/transaction logs, clickstream (views, add-to-cart), per-item attributes, inventory, price/promotion calendars, and seasonality signals.
- **Freshness:** "best seller" should reflect recent momentum, so the underlying signals must update at least daily, ideally hourly.
### Clarifying Questions to Ask
- How is "best seller" defined for ranking — **units sold**, **revenue/GMV**, or a blended demand score? Does the business want absolute volume or *rising* momentum?
- Is the ranking **global** (same list for everyone viewing the category) or **personalized** per user? This dramatically changes the precompute strategy.
- How is the **category** defined — a fixed taxonomy node, or can users pick arbitrary filters (brand + price band + category)? Arbitrary filters explode the key space.
- What is the **freshness SLA** — can yesterday's precomputed forecast be served, or must it react to an item going viral in the last hour?
- How do we handle **cold-start** items (new SKUs with little history) and brand-new categories?
- What is the cost ceiling? An on-demand LLM/agent call per request has very different economics than a nightly batch forecast served from cache.
### Part 1 — Problem framing and modeling approach
Decide what the model actually predicts and how. Translate "best sellers in category C over window W" into a concrete learning/forecasting target, choose the modeling approach (classical demand forecasting + ranking vs. an LLM/agent pipeline), and specify the labels, features, and how arbitrary user windows are handled. Make an explicit recommendation between the traditional and LLM approaches and defend it against the latency and scale constraints above.
```hint How to frame the target
"Best sellers over a window" is **per-item demand forecasting followed by ranking**. For each candidate item in the category, forecast expected demand $\hat{d}_{i}(W)$ over window $W$, then sort. The window can be handled by forecasting a per-period rate (e.g., daily) and **summing/integrating** over the requested window rather than training a separate model per window length.
```
```hint Classical vs LLM — where each fits
A per-SKU statistical/ML forecaster (gradient-boosted trees on lag/seasonality/price/promo features, or a global deep forecaster) is cheap, batchable, and accurate on historical demand. An LLM/agent is strong for reasoning over *unstructured* signals (reviews, trends, launch buzz) but is slow (multi-second) and expensive per call — push it **offline** as a feature generator or to the long tail/cold-start, not on the synchronous serving path.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2 — Data flow and system architecture
Lay out the architecture for both the **write/ingestion path** (turning raw orders and clickstream into model-ready features and forecasts) and the **read/serve path** (turning a `(category, window)` request into a ranked list). Cover the data model, where forecasts are computed and stored, and the components in between.
```hint Two paths, decoupled
Separate a **batch/streaming ingestion + training + scoring pipeline** (offline) from a thin **serving layer** (online). Orders/clickstream → stream (e.g., Kafka) → feature aggregation → feature store; a scheduled job retrains/refreshes forecasts and writes per-item, per-period demand to a fast key-value store keyed by item.
```
```hint What to materialize
The expensive forecasting should happen **ahead of the request**. Materialize per-item per-day demand forecasts; at request time the serving layer fans out over the category's candidate items, integrates the forecast over the requested window, ranks, and returns top-`K`. This keeps the synchronous path to lookups + a sort.
```
#### Clarifying Questions for this Part
- Is there an existing **feature store / candidate retrieval** service (category → item set) we can reuse, or do we build the category-to-items index ourselves?
- What is the acceptable **staleness** of the materialized forecast (hourly vs daily refresh) — this sets the streaming-vs-batch boundary.
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3 — Latency, caching, and cost
The feature is interactive (p95 < ~500 ms), yet a naive design — especially an LLM/agent pipeline that reasons per request — can take **20+ seconds** and cost a lot per call. Show how you hit the budget. Address what is precomputed vs computed on demand, the caching strategy and its keys, cache invalidation/freshness, and graceful degradation under load.
```hint Move work off the request path
The dominant lever is **precomputation**: forecasts are batch-produced offline so the synchronous request is just candidate lookup + window integration + sort. Never run a multi-second model (or an LLM agent) inside the user request.
```
```hint Cache key design
Because the same `(category, window)` is requested by many users, cache the **ranked result** keyed by `(category, normalized_window, taxonomy_version)`. Normalize/bucket windows (e.g., snap to day boundaries) so you get high hit rates instead of a unique key per arbitrary timestamp.
```
```hint Invalidation and fallback
Tie cache TTL to the forecast refresh cadence; bump a version key when forecasts or the taxonomy update. On a cache miss or downstream timeout, fall back to a cheaper precomputed proxy (e.g., trailing-window actual best-sellers) so the page always renders fast.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 4 — Evaluation and monitoring
Define how you measure whether the system is good, both before launch and in production, and what you monitor to catch regressions.
```hint Two layers of metrics
Evaluate the **forecast** (e.g., MAPE/WAPE or quantile loss on held-out future windows via backtesting) and the **ranking** (NDCG@K / recall@K of the predicted top-K vs the items that actually sold best in that window). Then validate the **product** outcome with an online A/B test on conversion / category revenue.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- The forecast refresh runs nightly, but an item goes viral at 2 p.m. and starts selling out. How does your design surface it before the next batch — and what would you change to make "rising momentum" first-class?
- A user types a 90-day window. Forecast error compounds far into the future. How do you communicate or bound uncertainty in the ranking, and would you cap the window?
- The product team wants the list **personalized** per user instead of global. Quantify the impact on your caching/precompute strategy and propose how to keep latency under budget.
- Suppose you must incorporate an LLM/agent somewhere because reviews and external trend signals genuinely improve cold-start accuracy. Exactly where in the pipeline does it go, and how do you stop it from leaking onto the latency-critical path?
Quick Answer: This question evaluates a candidate's ability to design a large-scale ML/forecasting system that combines predictive modeling with low-latency serving. It tests skills in system architecture, data pipeline design, and reasoning about latency and evaluation trade-offs under real-time constraints, commonly assessed in ML system design interviews at the practical application level.