PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Coding & Algorithms/HubSpot

Design file deduplication at scale

Last updated: Mar 29, 2026

Quick Overview

This question evaluates understanding of file-system algorithms, hashing and deduplication techniques, scalable I/O and memory management, and correctness considerations for hash collisions, and it belongs to the Coding & Algorithms domain.

  • Medium
  • HubSpot
  • Coding & Algorithms
  • Software Engineer

Design file deduplication at scale

Company: HubSpot

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Onsite

Design an algorithm to identify duplicate files in a large directory tree. You are given an iterator over files providing (path, size) and a function read_chunks(path) -> Iterator[bytes]. Requirements: minimize I/O by comparing sizes and using rolling or cryptographic hashes; handle hash collisions safely; support datasets that do not fit in memory; and output groups of paths that are byte-for-byte identical. Explain time/space trade-offs and how you would parallelize the solution.

Quick Answer: This question evaluates understanding of file-system algorithms, hashing and deduplication techniques, scalable I/O and memory management, and correctness considerations for hash collisions, and it belongs to the Coding & Algorithms domain.

Related Interview Questions

  • Validate hiring request under role constraints - HubSpot (medium)
  • Find a special person using knows(a,b) - HubSpot (easy)
  • Design and implement a bank account system - HubSpot (Medium)
  • Design a bank with scheduled payments and merges - HubSpot (Medium)
  • Implement Python LRU cache with varargs - HubSpot (Medium)
HubSpot logo
HubSpot
Sep 6, 2025, 12:00 AM
Software Engineer
Onsite
Coding & Algorithms
3
0

Design an algorithm to identify duplicate files in a large directory tree. You are given an iterator over files providing (path, size) and a function read_chunks(path) -> Iterator[bytes]. Requirements: minimize I/O by comparing sizes and using rolling or cryptographic hashes; handle hash collisions safely; support datasets that do not fit in memory; and output groups of paths that are byte-for-byte identical. Explain time/space trade-offs and how you would parallelize the solution.

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Coding & Algorithms•More HubSpot•More Software Engineer•HubSpot Software Engineer•HubSpot Coding & Algorithms•Software Engineer Coding & Algorithms
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.