A web crawler starts from a URL and explores reachable pages level by level.
You are given:
-
start_url: string
-
An interface
fetch_links(url: string) -> List[string]
, which returns all hyperlinks found on that page
Write a function that returns all pages reachable from start_url that belong to the same hostname as start_url.
Requirements:
-
Use breadth-first traversal.
-
Visit each URL at most once.
-
Ignore links whose hostname differs from the hostname of
start_url
.
-
fetch_links
may return duplicate links.
-
You may return the result in BFS visitation order or in any order, as long as all valid reachable pages are included exactly once.
Possible follow-up discussion:
-
How do you avoid revisiting the same page?
-
What are the time and space complexities?