Design a single- and multi-threaded web crawler

Q: Design a single- and multi-threaded web crawler

This question evaluates skills in web crawling, URL parsing and fragment sanitization, graph traversal for deduplication, and concurrent programming for thread-safe crawling, and it falls under the Coding & Algorithms domain.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

Web Crawler (single-threaded, then multi-threaded)

You are given:

A starting URL startUrl (e.g., "http://news.example.com/a/index.html" ).
An interface HtmlParser with a method getUrls(url) that returns all URLs found on the page at url .

Your task is to crawl web pages starting from startUrl and return all unique pages that are reachable by following links, subject to the rules below.

Rules

Same-hostname only: Only crawl pages whose hostname is the same as the hostname of startUrl .
- The hostname is the part between the scheme ( http:// or https:// ) and the next / .
Fragment handling: URLs may contain a fragment part starting with # (e.g., http://a.com/x#section2 ).
- You must strip the fragment (remove # and everything after it) before:
  - deciding whether you have already visited the URL (deduplication), and
  - including the URL in the returned result.
- Example: http://a.com/x#p1 and http://a.com/x#p2 should be treated as the same page : http://a.com/x .
No other normalization: Do not perform other URL normalization (e.g., do not canonicalize trailing slashes, default ports, query parameter ordering, etc.). Only strip the fragment.

Part A — Single-threaded

Implement a single-threaded crawler that returns the set/list of visited URLs (after stripping fragments), restricted to the same hostname.

Part B — Follow-up: Multi-threaded

Now implement a multi-threaded crawler to speed up crawling.

Multiple worker threads may call HtmlParser.getUrls concurrently.
Your solution must be thread-safe and must not crawl the same sanitized URL more than once.

Output

Return all unique sanitized URLs that are reachable from startUrl and satisfy the hostname constraint.

Assumptions / Notes

HtmlParser.getUrls(url) is a black box and may be slow (network/IO-like).
The link graph may contain cycles.
URLs returned by getUrls are absolute URLs.