PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches

Quick Overview

This question evaluates a candidate's skills in text processing, large-scale I/O, token normalization, memory-constrained computation and algorithmic trade-offs for frequency aggregation, and it falls under the Coding & Algorithms domain because it combines parsing, data structures and external processing concerns.

  • Medium
  • Adobe
  • Coding & Algorithms
  • Software Engineer

Implement file word count

Company: Adobe

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

Write a function that reads a large text file and returns the frequency count of each word. Define how you will normalize tokens (case, Unicode, punctuation, contractions), handle memory limits (streaming, chunking), and output the top-k most frequent words efficiently. Analyze time and space complexity and discuss trade-offs between hash maps, tries, and external sorting when the file barely fits in memory.

Quick Answer: This question evaluates a candidate's skills in text processing, large-scale I/O, token normalization, memory-constrained computation and algorithmic trade-offs for frequency aggregation, and it falls under the Coding & Algorithms domain because it combines parsing, data structures and external processing concerns.

A very large text file is provided as a list of text chunks in reading order to simulate streaming file reads. Write a function `solution(chunks, k)` that returns the top-`k` most frequent words in the file without concatenating the entire file into one giant string. Normalize tokens using these rules: 1. Apply Unicode normalization with NFKC. 2. Compare words case-insensitively using Unicode `casefold()`. 3. A word consists of Unicode letters and digits, and may contain apostrophes only when the apostrophe is inside the word (for example, `don't` stays one word, but `'hello'` becomes `hello`, and `believin'` becomes `believin`). Treat common Unicode apostrophes like `’` as `'`. 4. Any other character is a separator. 5. Words may be split across chunk boundaries, so your parser must preserve partial tokens between chunks. Return the result as a list of `(word, count)` tuples sorted by descending frequency, then ascending lexicographical word order for ties. If `k` is 0 or there are no words, return an empty list. For the coding portion, implement the exact in-memory approach using a hash map for counts and an efficient top-`k` extraction strategy. In discussion, candidates should be able to compare this approach with tries and external sorting when the distinct-word set barely fits in memory.

Constraints

  • 0 <= len(chunks) <= 100000
  • 0 <= total number of characters across all chunks <= 10^7
  • 0 <= k
  • Chunks must be processed in order, and words may span chunk boundaries

Examples

Input: (['Hello, world! HELLO... world?', 'hello'], 2)

Expected Output: [('hello', 3), ('world', 2)]

Explanation: After normalization, the words are hello, world, hello, world, hello. The top 2 are hello (3) and world (2).

Input: (["Don", "'t stop belie", "vin'!", " Don't, stop."], 3)

Expected Output: [("don't", 2), ('stop', 2), ('believin', 1)]

Explanation: The parser must join words across chunk boundaries. `Don't` appears twice, `stop` appears twice, and `believin'` is normalized to `believin` because the trailing apostrophe is not internal.

Input: ([], 5)

Expected Output: []

Explanation: An empty file contains no words.

Input: (['Straße stra', 'sse café CAFÉ 123 123', " O’Reilly o'reilly"], 4)

Expected Output: [('123', 2), ('café', 2), ("o'reilly", 2), ('strasse', 2)]

Explanation: `Straße` and `strasse` normalize to `strasse`, both café variants normalize to `café`, and both apostrophe styles normalize to `o'reilly`. All four words have frequency 2, so tie-breaking is lexicographical.

Input: (['One two two'], 0)

Expected Output: []

Explanation: If k is 0, the function should return no results.

Solution

def solution(chunks, k):
    import heapq
    import unicodedata

    if k <= 0:
        return []

    counts = {}
    token = []
    pending_apostrophes = 0
    apostrophes = {"'", '’', 'ʼ', '''}

    def finalize():
        nonlocal token, pending_apostrophes
        if token:
            word = ''.join(token)
            counts[word] = counts.get(word, 0) + 1
            token = []
        pending_apostrophes = 0

    for chunk in chunks:
        chunk = unicodedata.normalize('NFKC', chunk)
        for ch in chunk:
            if ch.isalnum():
                if pending_apostrophes:
                    token.extend("'" for _ in range(pending_apostrophes))
                    pending_apostrophes = 0
                token.append(ch.casefold())
            elif ch in apostrophes:
                if token:
                    pending_apostrophes += 1
            else:
                finalize()

    if token:
        word = ''.join(token)
        counts[word] = counts.get(word, 0) + 1

    if not counts:
        return []

    class Entry:
        __slots__ = ('count', 'word')

        def __init__(self, count, word):
            self.count = count
            self.word = word

        def __lt__(self, other):
            if self.count != other.count:
                return self.count < other.count
            return self.word > other.word

    heap = []
    for word, count in counts.items():
        entry = Entry(count, word)
        if len(heap) < k:
            heapq.heappush(heap, entry)
        else:
            worst = heap[0]
            if count > worst.count or (count == worst.count and word < worst.word):
                heapq.heapreplace(heap, entry)

    result = [(entry.word, entry.count) for entry in heap]
    result.sort(key=lambda item: (-item[1], item[0]))
    return result

Time complexity: O(T + U log k), where T is the total number of characters across all chunks and U is the number of unique normalized words.. Space complexity: O(U + k + L), where U is the number of unique normalized words and L is the length of the current token being buffered across chunk boundaries..

Hints

  1. Keep a current token and any pending apostrophes between chunks so a word split across two chunks is still counted correctly.
  2. After building the frequency map, use a min-heap of size `k` to avoid sorting every distinct word when `k` is much smaller than the number of unique words.
Last updated: May 16, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Traverse a path and print directory tree - Adobe (medium)
  • Build a React team builder with role constraints - Adobe (medium)
  • Implement K-means clustering from scratch - Adobe (medium)
  • Design a nested-list iterator - Adobe (Medium)
  • Maximize pay by flipping k rest days - Adobe (Medium)