PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Coding & Algorithms/Salesforce

Build a concurrent site crawler

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's ability to implement a concurrent web crawler, covering concurrency control, URL normalization and deduplication, HTTP handling, HTML parsing, and result serialization.

  • medium
  • Salesforce
  • Coding & Algorithms
  • Machine Learning Engineer

Build a concurrent site crawler

Company: Salesforce

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

Implement a small crawler in Python for a single target website. Requirements: - Input: a start URL, a maximum crawl depth, and an output CSV path. - Crawl only pages within the same domain as the start URL. - Use concurrency so that many pages can be fetched in parallel. - For each successfully fetched HTML page, extract: - page URL - page title - HTTP status code - Deduplicate pages by normalized URL. - Ignore non-HTML resources and skip pages that cannot be fetched. - After crawling completes, sort the results by page title in ascending order and then by URL in ascending order. - Export the final result set to a CSV file with columns: url, title, status_code. Discuss the main edge cases and how you would improve this into a production-grade crawler.

Quick Answer: This question evaluates a candidate's ability to implement a concurrent web crawler, covering concurrency control, URL normalization and deduplication, HTTP handling, HTML parsing, and result serialization.

Related Interview Questions

  • Solve Two OA Coding Problems - Salesforce (medium)
  • Maximize events attended given date ranges - Salesforce (medium)
  • Implement common data-structure and JS tasks - Salesforce (medium)
  • Minimize operations to reduce integer to zero - Salesforce (medium)
  • Implement an LFU cache with O(1) operations - Salesforce (medium)
Salesforce logo
Salesforce
Jan 12, 2026, 12:00 AM
Machine Learning Engineer
Onsite
Coding & Algorithms
5
0

Implement a small crawler in Python for a single target website.

Requirements:

  • Input: a start URL, a maximum crawl depth, and an output CSV path.
  • Crawl only pages within the same domain as the start URL.
  • Use concurrency so that many pages can be fetched in parallel.
  • For each successfully fetched HTML page, extract:
    • page URL
    • page title
    • HTTP status code
  • Deduplicate pages by normalized URL.
  • Ignore non-HTML resources and skip pages that cannot be fetched.
  • After crawling completes, sort the results by page title in ascending order and then by URL in ascending order.
  • Export the final result set to a CSV file with columns: url, title, status_code.

Discuss the main edge cases and how you would improve this into a production-grade crawler.

Submit Your Answer

Sign in to leave a comment

Loading comments...

Browse More Questions

More Coding & Algorithms•More Salesforce•More Machine Learning Engineer•Salesforce Machine Learning Engineer•Salesforce Coding & Algorithms•Machine Learning Engineer Coding & Algorithms
PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.