Design a price tracking system
Company: Meta
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
##### Question
Design a price tracking system for e-commerce sites (similar to price-history tools such as CamelCamelCamel or Keepa). The system ingests product URLs, crawls prices respectfully over time, captures historical prices, visualizes trends, and notifies users on price drops or back-in-stock events. Walk through the high-level architecture and then go deep on the components the interviewer probes.
1. **Scope, scale, and SLOs.** State your assumptions: number of tracked URLs/offers (e.g. 10M–100M offers across thousands of domains), number of users, and freshness targets. Define freshness SLA tiers (e.g. hot/popular items refreshed within ~1 hour, the long tail within 24 hours) and an extraction-accuracy and false-alert budget.
2. **Crawl ingestion and orchestration.** Design the scheduler/frontier that decides what to fetch next, per-domain rate limiting and politeness (token/leaky bucket, robots crawl-delay), a fetcher pool (mostly static HTTP with a small budgeted headless-browser pool for JS-rendered pages), and how you scale fetchers with backpressure off a message bus.
3. **Parsing, extraction, and normalization.** Extract price, currency, availability, and shipping from pages using a multi-strategy parser (structured data / JSON-LD / Schema.org first, then site-specific CSS/XPath templates, then heuristic/ML fallback). Normalize currency (convert to a reference currency with daily FX, keep native + normalized), timezones (UTC), and tax/shipping fields. Validate prices (positive, in a sane range, ignore obvious personalization/geo walls).
4. **URL canonicalization and deduplication.** Canonicalize URLs (strip tracking params, sort query params, normalize host/casing, resolve redirects) and dedupe content via a hash of the salient DOM/price block so you skip redundant history writes.
5. **Product identity resolution.** Distinguish an Offer (site+seller+variant listing) from a Product (canonical entity). Cluster offers to products using strong signals (UPC/EAN/GTIN, MPN, brand+model), medium signals (normalized title tokens, spec overlap, image embeddings), and weak signals; use blocking + pairwise scoring with accept/review/reject thresholds and a human-in-the-loop for ambiguous cases.
6. **Storage schema and database choices.** Choose fit-for-purpose stores: OLTP for offer/product metadata and subscriptions; a time-series/wide-column store (plus a cold object-store tier) for price histories; a search index for product search/faceting; a cache for hot endpoints; object storage for raw HTML snapshots. Lay out the core schemas (product, offer, offer_state, price_history, user/watchlist/alert, fx_rate).
7. **Change detection.** Write a history record only on a meaningful change in [price, availability, currency, shipping] (plus a periodic heartbeat), suppress noise (rounding thresholds, require stability across consecutive crawls for A/B-testing sites), and compute derived metrics like rolling 30/90-day min/max.
8. **Trends and alerting.** Support absolute-threshold, relative-drop, historical-low, and back-in-stock rules. Apply debounce/hysteresis to avoid flapping, per-user/per-offer cooldowns, and fan out notifications (email/push/signed webhooks with retry + DLQ).
9. **Search, subscriptions, and APIs.** Expose product search, price-history, and watchlist/alert APIs (REST/GraphQL) with caching/ETag, rate limiting, and auth.
10. **Backfill and replay.** Seed via sitemaps/merchant feeds/affiliate APIs with per-domain depth caps; keep raw pages in object storage and Kafka with retention so you can re-parse history when the parser is updated (pin a parser_version) and re-consume from offsets idempotently.
11. **Failure recovery and reliability.** Idempotent fetch jobs (deterministic job_id), retries with exponential backoff + jitter, dead-letter queues for persistent failures (paywalls/CAPTCHAs), and canary/golden-page regression tests for parsers.
12. **Anti-bot, CAPTCHA, and legal/robots compliance.** Stay compliance-first: honor robots.txt and crawl-delay, identify your crawler with contact info, prefer official/affiliate APIs, back off on 403/429, and do NOT bypass auth/paywalls or solve CAPTCHAs without explicit permission.
13. **Multi-region deployment.** Place compute pools (and edge fetchers) close to merchants to cut latency and block risk, replicate/shard the control-plane frontier metadata, mirror Kafka and object storage cross-region, and handle data residency (e.g. EU user data in EU).
14. **Cost controls.** Use spot/preemptible fetchers that autoscale on queue depth, strictly budget headless rendering, conditional GETs (If-Modified-Since/ETag) and compression, write-on-change + downsampling/tiering of old history to Parquet, per-merchant cost dashboards, and kill switches for runaway domains.
Throughout, discuss data-model trade-offs, the price-change math (e.g. percent drop = (new − old)/old), and the operational concerns (capacity planning, observability, freshness SLOs).
Quick Answer: Meta software-engineer system-design technical-screen question: design a price tracking system for e-commerce that crawls product prices respectfully, stores historical prices, detects changes, and alerts users on price drops. It tests web-crawl orchestration and politeness, multi-strategy parsing and currency normalization, Offer-vs-Product identity resolution, time-series storage, change detection and alerting, plus freshness SLOs, multi-region deployment, and cost controls.