Word Count
Asked of: Software Engineer
Last updated
What's being tested
These problems test scalable text processing and frequency-aggregation skills: tokenization/normalization, memory‑efficient counting, and top‑K extraction. Interviewers expect clear tradeoffs between exact vs. approximate counting, external‑memory algorithms, and simple distributed sharding/merge strategies.
Patterns & templates
- Single-pass streaming tokenization — read fixed-size chunks, handle boundary tokens, normalize (lowercase, unicode NFKC), O(n) time, O(1) extra per byte.
- In-memory hash aggregation — use
hashmap/Counterto tally tokens, O(n) time, O(u) space where u = distinct tokens; spill when memory high. - External hashing/bucketing — hash tokens to M buckets on disk so each bucket fits RAM, then count buckets individually and merge counts.
- External merge sort + group-by — sort token list on disk (external sort), then single linear pass to aggregate adjacent equal tokens; I/O cost ~ O(N log N) via runs.
- Partial-aggregation for distributed — local counts per shard, then shuffle only (token,partial_count) to reducers; reduces network by combining locally.
- Top-K via min-heap — maintain size-K
heapqmin-heap of (count,token), O(u log K) time, memory O(K). - Approximate counting — Count‑Min Sketch for streaming approximate frequencies, O(1) update, small memory, has probabilistic overestimate bounds.
Common pitfalls
Pitfall: Loading the entire file into memory — declare streaming or external approach; naive
read()causes O(N) memory blowups.
Pitfall: Incorrect token boundaries across chunk reads — always keep the trailing partial token and prepend to next chunk.
Pitfall: Forgetting normalization (case/unicode/punctuation) produces fragmented keys; define normalization rules upfront.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions
Related concepts
- Hash Map Counting And Frequency AnalysisCoding & Algorithms
- Dynamic Programming And Combinatorial CountingCoding & Algorithms
- Arrays, Strings, Hash Maps, And Frequency CountingCoding & Algorithms
- Arrays, Strings, and HashingCoding & Algorithms
- Streaming, Large Inputs, And External MemorySoftware Engineering Fundamentals
- Count Data Modeling