Implement file word count
Company: Adobe
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Quick Answer: This question evaluates a candidate's skills in text processing, large-scale I/O, token normalization, memory-constrained computation and algorithmic trade-offs for frequency aggregation, and it falls under the Coding & Algorithms domain because it combines parsing, data structures and external processing concerns.
Constraints
- 0 <= len(chunks) <= 100000
- 0 <= total number of characters across all chunks <= 10^7
- 0 <= k
- Chunks must be processed in order, and words may span chunk boundaries
Examples
Input: (['Hello, world! HELLO... world?', 'hello'], 2)
Expected Output: [('hello', 3), ('world', 2)]
Explanation: After normalization, the words are hello, world, hello, world, hello. The top 2 are hello (3) and world (2).
Input: (["Don", "'t stop belie", "vin'!", " Don't, stop."], 3)
Expected Output: [("don't", 2), ('stop', 2), ('believin', 1)]
Explanation: The parser must join words across chunk boundaries. `Don't` appears twice, `stop` appears twice, and `believin'` is normalized to `believin` because the trailing apostrophe is not internal.
Input: ([], 5)
Expected Output: []
Explanation: An empty file contains no words.
Input: (['Straße stra', 'sse café CAFÉ 123 123', " O’Reilly o'reilly"], 4)
Expected Output: [('123', 2), ('café', 2), ("o'reilly", 2), ('strasse', 2)]
Explanation: `Straße` and `strasse` normalize to `strasse`, both café variants normalize to `café`, and both apostrophe styles normalize to `o'reilly`. All four words have frequency 2, so tie-breaking is lexicographical.
Input: (['One two two'], 0)
Expected Output: []
Explanation: If k is 0, the function should return no results.
Solution
def solution(chunks, k):
import heapq
import unicodedata
if k <= 0:
return []
counts = {}
token = []
pending_apostrophes = 0
apostrophes = {"'", '’', 'ʼ', '''}
def finalize():
nonlocal token, pending_apostrophes
if token:
word = ''.join(token)
counts[word] = counts.get(word, 0) + 1
token = []
pending_apostrophes = 0
for chunk in chunks:
chunk = unicodedata.normalize('NFKC', chunk)
for ch in chunk:
if ch.isalnum():
if pending_apostrophes:
token.extend("'" for _ in range(pending_apostrophes))
pending_apostrophes = 0
token.append(ch.casefold())
elif ch in apostrophes:
if token:
pending_apostrophes += 1
else:
finalize()
if token:
word = ''.join(token)
counts[word] = counts.get(word, 0) + 1
if not counts:
return []
class Entry:
__slots__ = ('count', 'word')
def __init__(self, count, word):
self.count = count
self.word = word
def __lt__(self, other):
if self.count != other.count:
return self.count < other.count
return self.word > other.word
heap = []
for word, count in counts.items():
entry = Entry(count, word)
if len(heap) < k:
heapq.heappush(heap, entry)
else:
worst = heap[0]
if count > worst.count or (count == worst.count and word < worst.word):
heapq.heapreplace(heap, entry)
result = [(entry.word, entry.count) for entry in heap]
result.sort(key=lambda item: (-item[1], item[0]))
return resultTime complexity: O(T + U log k), where T is the total number of characters across all chunks and U is the number of unique normalized words.. Space complexity: O(U + k + L), where U is the number of unique normalized words and L is the length of the current token being buffered across chunk boundaries..
Hints
- Keep a current token and any pending apostrophes between chunks so a word split across two chunks is still counted correctly.
- After building the frequency map, use a min-heap of size `k` to avoid sorting every distinct word when `k` is much smaller than the number of unique words.