Find duplicates in a list efficiently
Company: TCS
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Given a Python list that can contain integers (possibly negative) and may be very large, write code and explain approaches to find all values that appear more than once, along with their occurrence counts and the index of their first appearance, while preserving the original order of first occurrence. For example, input: [3, 1, 2, 3, -1, 2, 3, 4, 1] → output should enumerate 3 (count 3, first idx 0), 2 (count 2, first idx 2), 1 (count 2, first idx 1) in that order. Then answer:
1) Provide an O(n) time, O(k) extra space solution (k = number of distinct values) and analyze time/space complexities. Explain how you maintain stable output order.
2) If the integers are guaranteed in the range 0..n-1 and in-place modification is allowed, design an O(n) time, O(1) extra space method (e.g., index marking or modular counting). Explain how to adapt it when negatives may appear.
3) For streaming data that cannot fit in memory (n up to 1e8), outline a solution using fixed memory (e.g., hashing to buckets on disk, Count-Min Sketch + exact verification). Analyze error bounds if you use probabilistic structures.
4) Discuss trade-offs among hash-based counting, sorting-based methods (O(n log n), O(1) extra space if in-place), and bitset/bitmap approaches for large integer domains. Include worst-case behaviors and cache implications.
5) Provide thorough tests covering empty list, all-unique, all-duplicates, very large inputs, and mixtures of positive/negative values.
Quick Answer: This question evaluates a candidate's ability to design efficient algorithms for duplicate detection, occurrence counting, and preserving first-occurrence order while reasoning about time and space complexity across in-memory, in-place, and streaming data scenarios.
Given a list of integers (possibly negative, possibly very large), return all values that appear more than once. For each such value, return a triple `[value, count, first_index]` where `count` is the total number of occurrences and `first_index` is the 0-based index where the value first appears. The results must be ordered by first occurrence in the input (i.e., by `first_index` ascending).
For example, `[3, 1, 2, 3, -1, 2, 3, 4, 1]` yields `[[3, 3, 0], [1, 2, 1], [2, 2, 2]]`: value 3 occurs 3 times first seen at index 0, value 1 occurs 2 times first seen at index 1, value 2 occurs 2 times first seen at index 2. Values that appear only once (like -1 and 4) are excluded.
Return an empty list if there are no duplicates. Aim for O(n) time and O(k) extra space where k is the number of distinct values.
Constraints
- 0 <= len(nums) <= 10^7
- Values may be negative.
- Values may be arbitrarily large or small (no fixed range assumed for the general solution).
- Output order must follow the first-occurrence index of each duplicate value.
Examples
Input: ([3, 1, 2, 3, -1, 2, 3, 4, 1],)
Expected Output: [[3, 3, 0], [1, 2, 1], [2, 2, 2]]
Explanation: 3 appears 3x (first idx 0), 1 appears 2x (first idx 1), 2 appears 2x (first idx 2); -1 and 4 appear once so are excluded. Ordered by first occurrence.
Input: ([],)
Expected Output: []
Explanation: Empty input has no duplicates.
Input: ([1, 2, 3, 4, 5],)
Expected Output: []
Explanation: All values are unique, so no duplicates are returned.
Input: ([7, 7, 7, 7],)
Expected Output: [[7, 4, 0]]
Explanation: A single value repeated 4 times, first seen at index 0.
Input: ([-5, -5, -3, -3, -3, 0, 0],)
Expected Output: [[-5, 2, 0], [-3, 3, 2], [0, 2, 5]]
Explanation: Mix of negatives and zero; all three values duplicate, ordered by first index 0, 2, 5.
Input: ([42],)
Expected Output: []
Explanation: Single element cannot be a duplicate.
Input: ([2, 1, 2, 1, 3, 3],)
Expected Output: [[2, 2, 0], [1, 2, 1], [3, 2, 4]]
Explanation: Interleaved duplicates; output ordered by first-occurrence index (2 at 0, 1 at 1, 3 at 4).
Hints
- A single pass with a hash map keyed by value can track both an occurrence count and the index of first appearance simultaneously.
- To preserve stable output order, record values in a separate list the moment you see them for the first time, then emit only those whose count exceeds 1 in that recorded order.
- Storing [count, first_index] together per value avoids a second map and keeps everything O(n) time with O(k) extra space.