Match alphanumeric patterns in a stream
Company: Optiver
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Take-home Project
Quick Answer: This question evaluates skills in exact and approximate alphanumeric string matching, streaming pattern lookup, noise-tolerant comparison, and performance-oriented data structures and encodings.
Constraints
- 1 <= len(reference) <= 64
- 0 <= len(candidates) <= 10^4
- Each candidate string has length 0 to 64
- Codes contain ASCII letters and digits only
- 0 <= maxHammingDistance <= len(reference)
- Indices in the result must be in strictly ascending order
Examples
Input: ('A1B2C3D', ['A1B2C3D', 'X9Y8Z7Q', 'A1B2C3X', 'A1B2C3D'], False, 0)
Expected Output: [0, 3]
Explanation: Exact match, no normalization, threshold 0. Indices 0 and 3 are identical to the reference; index 2 differs in the last char; index 1 is entirely different.
Input: ('A1B2C3D', ['AIB2C3D', 'A1B2C30', 'a1b2c3d', 'A1B2C3D'], True, 0)
Expected Output: [0, 2, 3]
Explanation: With normalization, reference canonicalizes to 'AIB2C3D'. Index 0 ('AIB2C3D') and index 3 already match; index 2 ('a1b2c3d' -> 'AIB2C3D') matches via case+1->I folding. Index 1 ('A1B2C30' -> 'AIB2C3O') differs in the last position, so it is excluded.
Input: ('ABC1234', ['ABC1235', 'ABC1244', 'ABC9234', 'XYZ0000'], False, 1)
Expected Output: [0, 1, 2]
Explanation: Threshold 1: indices 0, 1, 2 each differ in exactly one position from 'ABC1234'. Index 3 differs in many positions and is excluded.
Input: ('CODE123', ['CODE124', 'CODX125', 'CODE123'], False, 2)
Expected Output: [0, 1, 2]
Explanation: Threshold 2: index 0 differs in 1 position, index 1 in 2 positions, index 2 in 0 positions; all are within the allowed distance.
Input: ('SHORT12', ['SHORT1', 'SHORT123', 'SHORT12'], False, 0)
Expected Output: [2]
Explanation: Length pre-filter: 'SHORT1' (6 chars) and 'SHORT123' (8 chars) differ in length from the 7-char reference and never match; only the exact 'SHORT12' at index 2 qualifies.
Input: ('Z9Z9Z9Z', [], False, 0)
Expected Output: []
Explanation: Empty candidate list yields an empty result.
Input: ('OOIIILOO', ['00IIIL00', 'OOI1ILOO', '12345678'], True, 0)
Expected Output: [0, 1]
Explanation: With normalization, '00IIIL00' folds 0->O to 'OOIIILOO' (match) and 'OOI1ILOO' folds 1->I to 'OOIIILOO' (match). '12345678' folds to 'I2345678', which does not match.
Hints
- Length is a cheap pre-filter: any candidate whose length differs from the reference can never match, regardless of the Hamming threshold.
- Do the lookalike/case canonicalization once per string up front, then compare canonical forms character by character.
- When counting differing positions, short-circuit as soon as the running count exceeds maxHammingDistance to stay fast on long batches.
- Order matters in the canonicalization: uppercase first (so '1'->I and '0'->O folding applies uniformly to lowercase input too).