How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a easy difficulty Coding & Algorithms question, commonly asked during Onsite rounds at Applied.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Applied during technical interviews.

Group duplicate files by content | Applied Coding Question

Quick Overview

This question evaluates understanding of file deduplication, efficient I/O and hashing strategies, streaming large-file processing, and scalable algorithm design for identifying files with identical byte-for-byte contents.

Group duplicate files by content

Company: Applied

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: easy

Interview Round: Onsite

## Problem You are given a list of files in a filesystem. Each file has: - a full path (string) - a file size in bytes (integer) - file contents (conceptually; assume you can stream/read the file when needed) Two files are **duplicates** if their contents are identical (byte-for-byte), regardless of file name or directory. **Task:** Return groups of duplicate files. Each group should contain the full paths of files that have identical content, and groups of size 1 should be omitted. ### Constraints / Expectations - There can be millions of files. - Reading file contents is expensive. - Aim to minimize unnecessary content reads / hashing. ### Follow-ups 1. Describe an approach that avoids hashing every file if most files are unique. 2. How would you handle extremely large files (e.g., >GB) without loading them fully into memory? 3. If hash collisions are a concern, how do you confirm duplicates? ### Example Input files (path, size, content): - /a/x.txt, 4, "ABCD" - /b/y.txt, 4, "ABCD" - /c/z.txt, 4, "WXYZ" - /d/w.txt, 7, "1234567" Output: - ["/a/x.txt", "/b/y.txt"]

Quick Answer: This question evaluates understanding of file deduplication, efficient I/O and hashing strategies, streaming large-file processing, and scalable algorithm design for identifying files with identical byte-for-byte contents.

Given file records [path, size, content], return duplicate path groups for identical contents, omitting singleton groups.

Constraints

Inputs are Python literals matching the function signature.
Return a deterministic exact-match value.

Examples

Input: ([("/a/x.txt",4,"ABCD"),("/b/y.txt",4,"ABCD"),("/c/z.txt",4,"WXYZ"),("/d/w.txt",7,"1234567")],)

Expected Output: [['/a/x.txt', '/b/y.txt']]

Explanation: Only the two ABCD files are duplicates.

Input: ([("a",1,"x"),("b",1,"y"),("c",1,"x"),("d",2,"x")],)

Expected Output: [['a', 'c']]

Explanation: Content matches are considered only after size grouping.

Input: ([],)

Expected Output: []

Explanation: No files means no duplicate groups.

Hints

Group by size before checking contents.
Return groups in first-seen order for deterministic grading.

Last updated: Jun 27, 2026

Loading coding console...