Implement Reservoir Sampling; Analyze Time and Space Complexity
Company: Amazon
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Onsite
Quick Answer: This question evaluates understanding of streaming randomized algorithms and probabilistic reasoning—specifically reservoir sampling—along with analysis of time and space complexity and use of hash-based structures for duplicate detection; it is in the Coding & Algorithms domain and tests both conceptual probability invariants and practical implementation skills. Such problems are commonly asked to gauge a candidate's ability to design one-pass, linear-time solutions for unbounded data streams and to reason about algorithmic correctness and resource bounds in technical interviews.
Constraints
- 0 <= len(nums) <= 200000
- nums[i] fits in 32-bit signed integer
- Time complexity must be O(n)
- Use only set() and enumerate(); do not use dict, Counter, or sorting
- Return indices in increasing order
Solution
from typing import List
def duplicate_indices(nums: list[int]) -> list[int]:
seen = set()
dup_vals = set()
for x in nums:
if x in seen:
dup_vals.add(x)
else:
seen.add(x)
return [i for i, x in enumerate(nums) if x in dup_vals]
Explanation
Time complexity: O(n). Space complexity: O(u) additional space for sets, where u is the number of unique values; plus O(k) for the output indices where k is the number of indices returned.
Hints
- First pass: use a set to track seen values and collect which values are duplicates.
- Second pass: use enumerate to collect indices whose values are in the duplicates set.
- No need to sort; enumerating maintains increasing index order.