PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates understanding of file deduplication, efficient I/O and hashing strategies, streaming large-file processing, and scalable algorithm design for identifying files with identical byte-for-byte contents.

  • easy
  • Applied
  • Coding & Algorithms
  • Software Engineer

Group duplicate files by content

Company: Applied

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: easy

Interview Round: Onsite

## Problem You are given a list of files in a filesystem. Each file has: - a full path (string) - a file size in bytes (integer) - file contents (conceptually; assume you can stream/read the file when needed) Two files are **duplicates** if their contents are identical (byte-for-byte), regardless of file name or directory. **Task:** Return groups of duplicate files. Each group should contain the full paths of files that have identical content, and groups of size 1 should be omitted. ### Constraints / Expectations - There can be millions of files. - Reading file contents is expensive. - Aim to minimize unnecessary content reads / hashing. ### Follow-ups 1. Describe an approach that avoids hashing every file if most files are unique. 2. How would you handle extremely large files (e.g., >GB) without loading them fully into memory? 3. If hash collisions are a concern, how do you confirm duplicates? ### Example Input files (path, size, content): - /a/x.txt, 4, "ABCD" - /b/y.txt, 4, "ABCD" - /c/z.txt, 4, "WXYZ" - /d/w.txt, 7, "1234567" Output: - ["/a/x.txt", "/b/y.txt"]

Quick Answer: This question evaluates understanding of file deduplication, efficient I/O and hashing strategies, streaming large-file processing, and scalable algorithm design for identifying files with identical byte-for-byte contents.

Given file records [path, size, content], return duplicate path groups for identical contents, omitting singleton groups.

Constraints

  • Inputs are Python literals matching the function signature.
  • Return a deterministic exact-match value.

Examples

Input: ([("/a/x.txt",4,"ABCD"),("/b/y.txt",4,"ABCD"),("/c/z.txt",4,"WXYZ"),("/d/w.txt",7,"1234567")],)

Expected Output: [['/a/x.txt', '/b/y.txt']]

Explanation: Only the two ABCD files are duplicates.

Input: ([("a",1,"x"),("b",1,"y"),("c",1,"x"),("d",2,"x")],)

Expected Output: [['a', 'c']]

Explanation: Content matches are considered only after size grouping.

Input: ([],)

Expected Output: []

Explanation: No files means no duplicate groups.

Hints

  1. Group by size before checking contents.
  2. Return groups in first-seen order for deterministic grading.
Last updated: Jun 27, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Multi-Agent Collision Simulator (Unicycle Kinematics) - Applied (medium)
  • Merge Overlapping Collinear Segments - Applied (hard)
  • Implement a Fixed-Capacity Deque - Applied (medium)
  • Implement Nested Transactional Key-Value Store - Applied (hard)
  • Merge overlapping 2D line segments - Applied (medium)