System Design: Cross‑Site Product Price Tracking
Context
Design the backend for a price-history tool that tracks product prices across many e-commerce sites. The system must be compliant with robots.txt, scalable, cost-aware, and provide user-facing search and alerting.
Assume a high-level scale (tune as needed during the interview):
-
10–50 million product URLs across 500–2,000 merchant domains
-
Freshness target: 6–24 hours for popular products; 2–7 days for the long tail
-
Concurrent fetches: O(10k–50k) globally, respecting per-domain limits
-
Price changes per day: ~5–15% of offers
Requirements
Functional
-
Ingest product URLs from users and internal seed lists.
-
Schedule respectful crawls: honor robots.txt, per-domain rate limits, backoff, and time windows.
-
Extract and normalize prices and currencies; handle taxes/shipping signals when available.
-
Deduplicate products across sellers into a canonical product catalog.
-
Store price histories and page snapshots.
-
Compute trends (e.g., % change, moving averages) and user alerts (thresholds, drops).
-
Expose search over products/offers and support user subscriptions/watchlists.
Non-Functional
-
High availability and graceful degradation when merchants change layouts or block access.
-
Multi-region deployment for latency/resiliency.
-
Cost controls across compute, bandwidth, and storage.
Discussion Prompts
-
Data model choices and storage technologies for products, offers, and time series.
-
Crawl orchestration at scale (frontier, politeness, prioritization, parsing).
-
Anti-bot posture and CAPTCHA handling within legal/ethical boundaries.
-
Backfill and re-crawl strategies (freshness SLAs, change detection, sitemaps).
-
Multi-region architecture and data replication.
-
Cost control levers and trade-offs.
Deliverables
-
High-level architecture, key components, and APIs.
-
Data modeling rationale and schemas (conceptual is fine).
-
Scheduling, deduplication, trend/alert logic, and operational strategies.