Implement anagram check and stable deduplication
Company: Google
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Part A — Anagram checker:
Write a function is_anagram(a: str, b: str, locale: str = 'en') -> bool that returns True iff a and b are anagrams under the following rules:
- Ignore case, whitespace, and Unicode punctuation.
- Normalize Unicode using NFKD; when locale='en', strip combining marks so accented letters compare equal to their base letters (e.g., 'é' -> 'e').
- Consider only letters and digits after normalization.
- Must work for inputs up to 10^7 characters each with O(σ) additional memory where σ is the alphabet size under the rules above, and O(n) time. Streaming solutions that do a single pass over each string are preferred; do not materialize full normalized strings.
- Return False immediately if, after filtering, the multiset lengths differ.
Follow-ups:
1) How would you adapt it to be locale-aware where 'ß' in German equals 'ss'? 2) How to handle right-to-left scripts without breaking normalization?
Part B — Stable de-duplication:
Write a function unique_stable(iterable, key=None, treat_nan_as_equal=False) that returns a lazy generator yielding the first occurrence of each distinct element while preserving original order.
- If key is None, distinctness is by the element itself (assume elements are hashable). If key is provided, distinctness is by key(x).
- If treat_nan_as_equal=True, treat all NaN floats as equal for distinctness even though NaN != NaN in Python.
- Target O(n) time and O(m) space where m is the number of distinct keys seen. The implementation must be lazy (streaming) over the input and must not sort.
Provide tight big-O bounds, and brief tests covering edge cases like empty input, all-duplicates, Unicode strings, very large inputs, and NaN handling.
Quick Answer: This question evaluates streaming algorithm design, Unicode normalization and locale-aware text processing, multiset frequency counting under tight time and memory bounds for an anagram checker, and stable, order-preserving deduplication with custom keys and NaN handling; it is in the Coding & Algorithms domain for a Data Scientist position.