Count words in a document robustly

Q: Count words in a document robustly

This is a Data Manipulation (SQL/Python) interview question from Microsoft for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

Given a text document, return the number of words under a precise definition. First, state the tokenization rules you will use (e.g., treat contractions like "it's" as one word, decide how to handle hyphenated terms like "state-of-the-art", numbers like "3.14", punctuation, Unicode apostrophes/quotes, and multiple whitespace). Then implement a function that counts words accordingly, handles very large files/streams, and includes unit tests for corner cases (empty input, only punctuation, mixed languages). Analyze time and space complexity and discuss trade-offs between regex-based tokenization and a manual scanner.

Count words in a document robustly

Comments (0)