Detail NLP preprocessing and n‑gram choices

Q: Detail NLP preprocessing and n‑gram choices

This is a Machine Learning interview question from Thumbtack for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Describe your text preprocessing pipeline given the source modality: typed text, scanned/handwritten OCR, or speech-to-text. Specify language handling, normalization (casing, punctuation, unicode), tokenization choice (whitespace vs. rule-based vs. subword methods like BPE/WordPiece), stopwording, lemmatization/stemming, handling emojis/URLs/code, and OOV terms. You used 1–3 n-grams: justify these choices theoretically and empirically—discuss sparsity, vocabulary size, context length, and effects on linear models vs. tree/NN models; report how performance and feature importances changed across 1-gram, 1–2, and 1–3 settings. Contrast word vs. character n-grams and when each helps (misspellings, morphology). Finally, outline how you would validate the pipeline (train/validation split, leakage checks) and compare this approach with a modern transformer-based tokenizer/embedding.

Detail NLP preprocessing and n‑gram choices

Comments (0)