Given a text document, return the number of words under a precise definition. First, state the tokenization rules you will use (e.g., treat contractions like "it's" as one word, decide how to handle hyphenated terms like "state-of-the-art", numbers like "3.14", punctuation, Unicode apostrophes/quotes, and multiple whitespace). Then implement a function that counts words accordingly, handles very large files/streams, and includes unit tests for corner cases (empty input, only punctuation, mixed languages). Analyze time and space complexity and discuss trade-offs between regex-based tokenization and a manual scanner.