Part A — Anagram checker: Write a function is_anagram(a: str, b: str, locale: str = 'en') -> bool that returns True iff a and b are anagrams under the following rules: - Ignore case, whitespace, and Unicode punctuation. - Normalize Unicode using NFKD; when locale='en', strip combining marks so accented letters compare equal to their base letters (e.g., 'é' -> 'e'). - Consider only letters and digits after normalization. - Must work for inputs up to 10^7 characters each with O(σ) additional memory where σ is the alphabet size under the rules above, and O(n) time. Streaming solutions that do a single pass over each string are preferred; do not materialize full normalized strings. - Return False immediately if, after filtering, the multiset lengths differ. Follow-ups: 1) How would you adapt it to be locale-aware where 'ß' in German equals 'ss'? 2) How to handle right-to-left scripts without breaking normalization? Part B — Stable de-duplication: Write a function unique_stable(iterable, key=None, treat_nan_as_equal=False) that returns a lazy generator yielding the first occurrence of each distinct element while preserving original order. - If key is None, distinctness is by the element itself (assume elements are hashable). If key is provided, distinctness is by key(x). - If treat_nan_as_equal=True, treat all NaN floats as equal for distinctness even though NaN != NaN in Python. - Target O(n) time and O(m) space where m is the number of distinct keys seen. The implementation must be lazy (streaming) over the input and must not sort. Provide tight big-O bounds, and brief tests covering edge cases like empty input, all-duplicates, Unicode strings, very large inputs, and NaN handling.

This question evaluates streaming algorithm design, Unicode normalization and locale-aware text processing, multiset frequency counting under tight time and memory bounds for an anagram checker, and stable, order-preserving deduplication with custom keys and NaN handling; it is in the Coding & Algorithms domain for a Data Scientist position.

How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

What difficulty level is this interview question?

This is a Medium difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at Google.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Google during technical interviews.

Implement anagram check and stable deduplication

Part A — Anagram checker: Write a function is_anagram(a: str, b: str, locale: str = 'en') -> bool that returns True iff a and b are anagrams under the following rules:

Ignore case, whitespace, and Unicode punctuation.
Normalize Unicode using NFKD; when locale='en', strip combining marks so accented letters compare equal to their base letters (e.g., 'é' -> 'e').
Consider only letters and digits after normalization.
Must work for inputs up to 10^7 characters each with O(σ) additional memory where σ is the alphabet size under the rules above, and O(n) time. Streaming solutions that do a single pass over each string are preferred; do not materialize full normalized strings.
Return False immediately if, after filtering, the multiset lengths differ. Follow-ups:

How would you adapt it to be locale-aware where 'ß' in German equals 'ss'? 2) How to handle right-to-left scripts without breaking normalization?

Part B — Stable de-duplication: Write a function unique_stable(iterable, key=None, treat_nan_as_equal=False) that returns a lazy generator yielding the first occurrence of each distinct element while preserving original order.

If key is None, distinctness is by the element itself (assume elements are hashable). If key is provided, distinctness is by key(x).
If treat_nan_as_equal=True, treat all NaN floats as equal for distinctness even though NaN != NaN in Python.
Target O(n) time and O(m) space where m is the number of distinct keys seen. The implementation must be lazy (streaming) over the input and must not sort. Provide tight big-O bounds, and brief tests covering edge cases like empty input, all-duplicates, Unicode strings, very large inputs, and NaN handling.

Part A — Anagram checker: Write a function is_anagram(a: str, b: str, locale: str = 'en') -> bool that returns True iff a and b are anagrams under the following rules:

Ignore case, whitespace, and Unicode punctuation.
Normalize Unicode using NFKD; when locale='en', strip combining marks so accented letters compare equal to their base letters (e.g., 'é' -> 'e').
Consider only letters and digits after normalization.
Must work for inputs up to 10^7 characters each with O(σ) additional memory where σ is the alphabet size under the rules above, and O(n) time. Streaming solutions that do a single pass over each string are preferred; do not materialize full normalized strings.
Return False immediately if, after filtering, the multiset lengths differ. Follow-ups:

How would you adapt it to be locale-aware where 'ß' in German equals 'ss'? 2) How to handle right-to-left scripts without breaking normalization?

If key is None, distinctness is by the element itself (assume elements are hashable). If key is provided, distinctness is by key(x).
If treat_nan_as_equal=True, treat all NaN floats as equal for distinctness even though NaN != NaN in Python.
Target O(n) time and O(m) space where m is the number of distinct keys seen. The implementation must be lazy (streaming) over the input and must not sort. Provide tight big-O bounds, and brief tests covering edge cases like empty input, all-duplicates, Unicode strings, very large inputs, and NaN handling.

Implement anagram check and stable deduplication

Quick Overview

Submit Your Answer to Earn 20XP

Implement anagram check and stable deduplication

Quick Overview

Submit Your Answer to Earn 20XP