Implement a web crawler using a provided API

Q: Implement a web crawler using a provided API

This question evaluates understanding of web crawling mechanics, URL/hostname filtering, graph traversal concepts, and concurrent fetching, assessing skills in reachability determination, duplicate detection, and thread-safety.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

Loading...

Web Crawler (BFS)

You are implementing a simple web crawler.

Given

A starting URL startUrl .
A provided API:
- List<String> getUrls(String url) which returns all URLs (as strings) found on the page at url .

Task

Return all unique URLs that are reachable from startUrl by repeatedly calling getUrls, subject to these constraints:

Only crawl URLs with the same hostname as startUrl.
Do not visit the same URL more than once.

Requirements

Use a graph traversal approach (e.g., BFS with a queue).
The output can be in any order.

Follow-up

How would you modify your crawler to use multiple threads to improve throughput while still ensuring:

Each URL is fetched at most once.
Only same-host URLs are crawled.
The program terminates correctly when the crawl is complete.