PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Coding & Algorithms/Applied

Group duplicate files by content

Last updated: Mar 29, 2026

Quick Overview

This question evaluates understanding of file deduplication, efficient I/O and hashing strategies, streaming large-file processing, and scalable algorithm design for identifying files with identical byte-for-byte contents.

  • easy
  • Applied
  • Coding & Algorithms
  • Software Engineer

Group duplicate files by content

Company: Applied

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: easy

Interview Round: Onsite

## Problem You are given a list of files in a filesystem. Each file has: - a full path (string) - a file size in bytes (integer) - file contents (conceptually; assume you can stream/read the file when needed) Two files are **duplicates** if their contents are identical (byte-for-byte), regardless of file name or directory. **Task:** Return groups of duplicate files. Each group should contain the full paths of files that have identical content, and groups of size 1 should be omitted. ### Constraints / Expectations - There can be millions of files. - Reading file contents is expensive. - Aim to minimize unnecessary content reads / hashing. ### Follow-ups 1. Describe an approach that avoids hashing every file if most files are unique. 2. How would you handle extremely large files (e.g., >GB) without loading them fully into memory? 3. If hash collisions are a concern, how do you confirm duplicates? ### Example Input files (path, size, content): - /a/x.txt, 4, "ABCD" - /b/y.txt, 4, "ABCD" - /c/z.txt, 4, "WXYZ" - /d/w.txt, 7, "1234567" Output: - ["/a/x.txt", "/b/y.txt"]

Quick Answer: This question evaluates understanding of file deduplication, efficient I/O and hashing strategies, streaming large-file processing, and scalable algorithm design for identifying files with identical byte-for-byte contents.

Related Interview Questions

  • Merge Overlapping Collinear Segments - Applied (hard)
  • Implement a Fixed-Capacity Deque - Applied (medium)
  • Implement Nested Transactional Key-Value Store - Applied (hard)
  • Merge overlapping 2D line segments - Applied (medium)
  • Find intersection of two line segments - Applied (easy)
Applied logo
Applied
Feb 12, 2026, 12:00 AM
Software Engineer
Onsite
Coding & Algorithms
2
0

Problem

You are given a list of files in a filesystem. Each file has:

  • a full path (string)
  • a file size in bytes (integer)
  • file contents (conceptually; assume you can stream/read the file when needed)

Two files are duplicates if their contents are identical (byte-for-byte), regardless of file name or directory.

Task: Return groups of duplicate files. Each group should contain the full paths of files that have identical content, and groups of size 1 should be omitted.

Constraints / Expectations

  • There can be millions of files.
  • Reading file contents is expensive.
  • Aim to minimize unnecessary content reads / hashing.

Follow-ups

  1. Describe an approach that avoids hashing every file if most files are unique.
  2. How would you handle extremely large files (e.g., >GB) without loading them fully into memory?
  3. If hash collisions are a concern, how do you confirm duplicates?

Example

Input files (path, size, content):

  • /a/x.txt, 4, "ABCD"
  • /b/y.txt, 4, "ABCD"
  • /c/z.txt, 4, "WXYZ"
  • /d/w.txt, 7, "1234567"

Output:

  • ["/a/x.txt", "/b/y.txt"]

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Coding & Algorithms•More Applied•More Software Engineer•Applied Software Engineer•Applied Coding & Algorithms•Software Engineer Coding & Algorithms
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.