##### Question Design a price tracking system for e-commerce sites (similar to price-history tools such as CamelCamelCamel or Keepa). The system ingests product URLs, crawls prices respectfully over time, captures historical prices, visualizes trends, and notifies users on price drops or back-in-stock events. Walk through the high-level architecture and then go deep on the components the interviewer probes. 1. **Scope, scale, and SLOs.** State your assumptions: number of tracked URLs/offers (e.g. 10M–100M offers across thousands of domains), number of users, and freshness targets. Define freshness SLA tiers (e.g. hot/popular items refreshed within ~1 hour, the long tail within 24 hours) and an extraction-accuracy and false-alert budget. 2. **Crawl ingestion and orchestration.** Design the scheduler/frontier that decides what to fetch next, per-domain rate limiting and politeness (token/leaky bucket, robots crawl-delay), a fetcher pool (mostly static HTTP with a small budgeted headless-browser pool for JS-rendered pages), and how you scale fetchers with backpressure off a message bus. 3. **Parsing, extraction, and normalization.** Extract price, currency, availability, and shipping from pages using a multi-strategy parser (structured data / JSON-LD / Schema.org first, then site-specific CSS/XPath templates, then heuristic/ML fallback). Normalize currency (convert to a reference currency with daily FX, keep native + normalized), timezones (UTC), and tax/shipping fields. Validate prices (positive, in a sane range, ignore obvious personalization/geo walls). 4. **URL canonicalization and deduplication.** Canonicalize URLs (strip tracking params, sort query params, normalize host/casing, resolve redirects) and dedupe content via a hash of the salient DOM/price block so you skip redundant history writes. 5. **Product identity resolution.** Distinguish an Offer (site+seller+variant listing) from a Product (canonical entity). Cluster offers to products using strong signals (UPC/EAN/GTIN, MPN, brand+model), medium signals (normalized title tokens, spec overlap, image embeddings), and weak signals; use blocking + pairwise scoring with accept/review/reject thresholds and a human-in-the-loop for ambiguous cases. 6. **Storage schema and database choices.** Choose fit-for-purpose stores: OLTP for offer/product metadata and subscriptions; a time-series/wide-column store (plus a cold object-store tier) for price histories; a search index for product search/faceting; a cache for hot endpoints; object storage for raw HTML snapshots. Lay out the core schemas (product, offer, offer_state, price_history, user/watchlist/alert, fx_rate). 7. **Change detection.** Write a history record only on a meaningful change in [price, availability, currency, shipping] (plus a periodic heartbeat), suppress noise (rounding thresholds, require stability across consecutive crawls for A/B-testing sites), and compute derived metrics like rolling 30/90-day min/max. 8. **Trends and alerting.** Support absolute-threshold, relative-drop, historical-low, and back-in-stock rules. Apply debounce/hysteresis to avoid flapping, per-user/per-offer cooldowns, and fan out notifications (email/push/signed webhooks with retry + DLQ). 9. **Search, subscriptions, and APIs.** Expose product search, price-history, and watchlist/alert APIs (REST/GraphQL) with caching/ETag, rate limiting, and auth. 10. **Backfill and replay.** Seed via sitemaps/merchant feeds/affiliate APIs with per-domain depth caps; keep raw pages in object storage and Kafka with retention so you can re-parse history when the parser is updated (pin a parser_version) and re-consume from offsets idempotently. 11. **Failure recovery and reliability.** Idempotent fetch jobs (deterministic job_id), retries with exponential backoff + jitter, dead-letter queues for persistent failures (paywalls/CAPTCHAs), and canary/golden-page regression tests for parsers. 12. **Anti-bot, CAPTCHA, and legal/robots compliance.** Stay compliance-first: honor robots.txt and crawl-delay, identify your crawler with contact info, prefer official/affiliate APIs, back off on 403/429, and do NOT bypass auth/paywalls or solve CAPTCHAs without explicit permission. 13. **Multi-region deployment.** Place compute pools (and edge fetchers) close to merchants to cut latency and block risk, replicate/shard the control-plane frontier metadata, mirror Kafka and object storage cross-region, and handle data residency (e.g. EU user data in EU). 14. **Cost controls.** Use spot/preemptible fetchers that autoscale on queue depth, strictly budget headless rendering, conditional GETs (If-Modified-Since/ETag) and compression, write-on-change + downsampling/tiering of old history to Parquet, per-merchant cost dashboards, and kill switches for runaway domains. Throughout, discuss data-model trade-offs, the price-change math (e.g. percent drop = (new − old)/old), and the operational concerns (capacity planning, observability, freshness SLOs).

Meta software-engineer system-design technical-screen question: design a price tracking system for e-commerce that crawls product prices respectfully, stores historical prices, detects changes, and alerts users on price drops. It tests web-crawl orchestration and politeness, multi-strategy parsing and currency normalization, Offer-vs-Product identity resolution, time-series storage, change detection and alerting, plus freshness SLOs, multi-region deployment, and cost controls.

How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a hard difficulty System Design question, commonly asked during Technical Screen rounds at Meta.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Meta during technical interviews.

Design a price tracking system | Meta Interview Question

Question

Design a price tracking system for e-commerce sites (similar to price-history tools such as CamelCamelCamel or Keepa). The system ingests product URLs, crawls prices respectfully over time, captures historical prices, visualizes trends, and notifies users on price drops or back-in-stock events. Walk through the high-level architecture and then go deep on the components the interviewer probes.

Scope, scale, and SLOs. State your assumptions: number of tracked URLs/offers (e.g. 10M–100M offers across thousands of domains), number of users, and freshness targets. Define freshness SLA tiers (e.g. hot/popular items refreshed within ~1 hour, the long tail within 24 hours) and an extraction-accuracy and false-alert budget.
Crawl ingestion and orchestration. Design the scheduler/frontier that decides what to fetch next, per-domain rate limiting and politeness (token/leaky bucket, robots crawl-delay), a fetcher pool (mostly static HTTP with a small budgeted headless-browser pool for JS-rendered pages), and how you scale fetchers with backpressure off a message bus.
Parsing, extraction, and normalization. Extract price, currency, availability, and shipping from pages using a multi-strategy parser (structured data / JSON-LD / Schema.org first, then site-specific CSS/XPath templates, then heuristic/ML fallback). Normalize currency (convert to a reference currency with daily FX, keep native + normalized), timezones (UTC), and tax/shipping fields. Validate prices (positive, in a sane range, ignore obvious personalization/geo walls).
URL canonicalization and deduplication. Canonicalize URLs (strip tracking params, sort query params, normalize host/casing, resolve redirects) and dedupe content via a hash of the salient DOM/price block so you skip redundant history writes.
Product identity resolution. Distinguish an Offer (site+seller+variant listing) from a Product (canonical entity). Cluster offers to products using strong signals (UPC/EAN/GTIN, MPN, brand+model), medium signals (normalized title tokens, spec overlap, image embeddings), and weak signals; use blocking + pairwise scoring with accept/review/reject thresholds and a human-in-the-loop for ambiguous cases.
Storage schema and database choices. Choose fit-for-purpose stores: OLTP for offer/product metadata and subscriptions; a time-series/wide-column store (plus a cold object-store tier) for price histories; a search index for product search/faceting; a cache for hot endpoints; object storage for raw HTML snapshots. Lay out the core schemas (product, offer, offer_state, price_history, user/watchlist/alert, fx_rate).
Change detection. Write a history record only on a meaningful change in [price, availability, currency, shipping] (plus a periodic heartbeat), suppress noise (rounding thresholds, require stability across consecutive crawls for A/B-testing sites), and compute derived metrics like rolling 30/90-day min/max.
Trends and alerting. Support absolute-threshold, relative-drop, historical-low, and back-in-stock rules. Apply debounce/hysteresis to avoid flapping, per-user/per-offer cooldowns, and fan out notifications (email/push/signed webhooks with retry + DLQ).
Search, subscriptions, and APIs. Expose product search, price-history, and watchlist/alert APIs (REST/GraphQL) with caching/ETag, rate limiting, and auth.
Backfill and replay. Seed via sitemaps/merchant feeds/affiliate APIs with per-domain depth caps; keep raw pages in object storage and Kafka with retention so you can re-parse history when the parser is updated (pin a parser_version) and re-consume from offsets idempotently.
Failure recovery and reliability. Idempotent fetch jobs (deterministic job_id), retries with exponential backoff + jitter, dead-letter queues for persistent failures (paywalls/CAPTCHAs), and canary/golden-page regression tests for parsers.
Anti-bot, CAPTCHA, and legal/robots compliance. Stay compliance-first: honor robots.txt and crawl-delay, identify your crawler with contact info, prefer official/affiliate APIs, back off on 403/429, and do NOT bypass auth/paywalls or solve CAPTCHAs without explicit permission.
Multi-region deployment. Place compute pools (and edge fetchers) close to merchants to cut latency and block risk, replicate/shard the control-plane frontier metadata, mirror Kafka and object storage cross-region, and handle data residency (e.g. EU user data in EU).
Cost controls. Use spot/preemptible fetchers that autoscale on queue depth, strictly budget headless rendering, conditional GETs (If-Modified-Since/ETag) and compression, write-on-change + downsampling/tiering of old history to Parquet, per-merchant cost dashboards, and kill switches for runaway domains.

Throughout, discuss data-model trade-offs, the price-change math (e.g. percent drop = (new − old)/old), and the operational concerns (capacity planning, observability, freshness SLOs).

Design a price tracking system

Quick Overview

Question

Solution

Submit Your Answer to Earn 20XP

Design a price tracking system

Quick Overview

Question

Solution

Submit Your Answer to Earn 20XP