PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Data Manipulation (SQL/Python)/Microsoft

Count words in a document robustly

Last updated: Mar 29, 2026

Quick Overview

This question evaluates text-processing and algorithmic engineering skills, specifically precise tokenization rule definition, robust handling of Unicode and punctuation, streaming/large-file processing, unit testing for corner cases, and analysis of time and space complexity.

  • Medium
  • Microsoft
  • Data Manipulation (SQL/Python)
  • Software Engineer

Count words in a document robustly

Company: Microsoft

Role: Software Engineer

Category: Data Manipulation (SQL/Python)

Difficulty: Medium

Interview Round: Onsite

Given a text document, return the number of words under a precise definition. First, state the tokenization rules you will use (e.g., treat contractions like "it's" as one word, decide how to handle hyphenated terms like "state-of-the-art", numbers like "3.14", punctuation, Unicode apostrophes/quotes, and multiple whitespace). Then implement a function that counts words accordingly, handles very large files/streams, and includes unit tests for corner cases (empty input, only punctuation, mixed languages). Analyze time and space complexity and discuss trade-offs between regex-based tokenization and a manual scanner.

Quick Answer: This question evaluates text-processing and algorithmic engineering skills, specifically precise tokenization rule definition, robust handling of Unicode and punctuation, streaming/large-file processing, unit testing for corner cases, and analysis of time and space complexity.

Related Interview Questions

  • Query departments and top earners - Microsoft (easy)
  • Query email logs for deliverability insights - Microsoft (Medium)
  • Find common friends from directed edges - Microsoft (Medium)
  • Compute most popular location with weights - Microsoft (Medium)
Microsoft logo
Microsoft
Aug 14, 2025, 12:00 AM
Software Engineer
Onsite
Data Manipulation (SQL/Python)
5
0

Given a text document, return the number of words under a precise definition. First, state the tokenization rules you will use (e.g., treat contractions like "it's" as one word, decide how to handle hyphenated terms like "state-of-the-art", numbers like "3.14", punctuation, Unicode apostrophes/quotes, and multiple whitespace). Then implement a function that counts words accordingly, handles very large files/streams, and includes unit tests for corner cases (empty input, only punctuation, mixed languages). Analyze time and space complexity and discuss trade-offs between regex-based tokenization and a manual scanner.

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Data Manipulation (SQL/Python)•More Microsoft•More Software Engineer•Microsoft Software Engineer•Microsoft Data Manipulation (SQL/Python)•Software Engineer Data Manipulation (SQL/Python)
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.