Implement phrase search and JSON path
Company: Netflix
Role: Data Engineer
Category: Coding & Algorithms
Difficulty: hard
Interview Round: Onsite
Quick Answer: This question evaluates the ability to build an inverted index for in-memory exact phrase search and to implement JSON path traversal with wildcard matching, testing competency in data structures, string normalization and tokenization, and recursive or iterative traversal of nested maps.
Part 1: In-Memory Document Phrase Search
Constraints
- 0 <= len(documents) <= 10^4
- The total number of whitespace-separated tokens across all documents is at most 2 * 10^5
- Matching is case-insensitive
- Tokenization is done with standard whitespace splitting only
- An empty or whitespace-only query should return []
Examples
Input: (['The quick brown fox', 'quick brown', 'brown quick'], 'quick brown')
Expected Output: [0, 1]
Explanation: Documents 0 and 1 contain the exact phrase 'quick brown'. Document 2 has the words in reverse order.
Input: (['Alpha beta', 'gamma ALPHA gamma', 'beta'], 'alpha')
Expected Output: [0, 1]
Explanation: Single-word search is case-insensitive, so both 'Alpha' and 'ALPHA' match.
Input: (['a a a', 'a b a', 'a a'], 'a a')
Expected Output: [0, 2]
Explanation: Document 0 contains the phrase more than once, and document 2 contains it once. Document 1 does not have consecutive 'a a'.
Input: ([], 'anything')
Expected Output: []
Explanation: Edge case: an empty corpus has no matches.
Input: (['one two three'], ' ')
Expected Output: []
Explanation: Edge case: a whitespace-only query is treated as empty.
Hints
- For phrase queries, storing only document IDs is not enough; you also need word positions.
- First narrow down candidate documents by intersecting the document sets for all query words, then verify consecutive positions.
Part 2: JSON Path Query with Wildcards
Constraints
- The structure contains only dictionaries and scalar values; arrays are not used
- Keys are non-empty strings and do not contain '.'
- 0 <= total number of keys in the structure <= 10^5
- Leading dots in the path should be ignored
- If wildcard expansion yields multiple matches, return them in dictionary iteration order
Examples
Input: ({'contacts': {'cell': '111', 'home': '222'}}, '.contacts.cell')
Expected Output: '111'
Explanation: Leading dots are ignored, so the path resolves to the scalar value at contacts -> cell.
Input: ({'contacts': {'cell': '111', 'home': '222'}}, 'contacts.*')
Expected Output: ['111', '222']
Explanation: The wildcard at the end matches both keys inside 'contacts'.
Input: ({'contacts': {'cell': '111', 'home': '222'}, 'work': {'cell': '333'}, 'meta': 5}, '.*.cell')
Expected Output: ['111', '333']
Explanation: The top-level wildcard visits each top-level value. Only dictionary branches with a 'cell' key produce matches.
Input: ({'a': {'b': 1}}, 'a.c')
Expected Output: None
Explanation: The path does not exist, so the function should return None.
Input: ({'a': 1}, '.')
Expected Output: {'a': 1}
Explanation: Edge case: an empty path after removing dots returns the original object.
Hints
- Split the path by '.' and ignore empty tokens created by leading dots.
- A recursive or DFS-style traversal works well: when you see '*', try every child at the current dictionary level.