Design the high-level pipeline of a web search engine.
Assume you need to support internet-scale search (billions of web pages) with low-latency queries. Describe the major components and data flow for both:
-
Offline / batch side
: from discovering web pages to building and maintaining an index.
-
Online / serving side
: from when a user types a query to when they see ranked results.
In your answer, cover at least:
-
How you would
discover and fetch
documents from the web.
-
How you would
parse, process, and index
documents (e.g., inverted index, sharding, replication).
-
How a
user query
is processed, including query understanding/normalization.
-
How you would
retrieve candidate documents
efficiently.
-
How you would
rank
results (you may optionally mention ML ranking models).
-
How you would ensure
low latency, scalability, and fault tolerance
.
-
How you would
log user interactions
for future improvements.
You do not need exact APIs or code; focus on architecture, components, and data flow.