How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

What difficulty level is this interview question?

This is a medium difficulty Coding & Algorithms question, commonly asked during Onsite rounds at Salesforce.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Salesforce during technical interviews.

Build a concurrent site crawler | Salesforce Interview Question

Q: Build a concurrent site crawler

This question evaluates a candidate's ability to implement a concurrent web crawler, covering concurrency control, URL normalization and deduplication, HTTP handling, HTML parsing, and result serialization.

Implement a small crawler in Python for a single target website.

Requirements:

Input: a start URL, a maximum crawl depth, and an output CSV path.
Crawl only pages within the same domain as the start URL.
Use concurrency so that many pages can be fetched in parallel.
For each successfully fetched HTML page, extract:
- page URL
- page title
- HTTP status code
Deduplicate pages by normalized URL.
Ignore non-HTML resources and skip pages that cannot be fetched.
After crawling completes, sort the results by page title in ascending order and then by URL in ascending order.
Export the final result set to a CSV file with columns: url, title, status_code.

Discuss the main edge cases and how you would improve this into a production-grade crawler.

Implement a small crawler in Python for a single target website.

Requirements:

Input: a start URL, a maximum crawl depth, and an output CSV path.
Crawl only pages within the same domain as the start URL.
Use concurrency so that many pages can be fetched in parallel.
For each successfully fetched HTML page, extract:
- page URL
- page title
- HTTP status code
Deduplicate pages by normalized URL.
Ignore non-HTML resources and skip pages that cannot be fetched.
After crawling completes, sort the results by page title in ascending order and then by URL in ascending order.
Export the final result set to a CSV file with columns: url, title, status_code.

Discuss the main edge cases and how you would improve this into a production-grade crawler.

Build a concurrent site crawler

Quick Overview

Submit Your Answer

Build a concurrent site crawler

Quick Overview

Submit Your Answer