How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Onsite rounds at Thumbtack.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Thumbtack during technical interviews.

Detail NLP preprocessing and n‑gram choices

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a data scientist's competency in NLP preprocessing and feature engineering, covering modality-specific text normalization, tokenization and subword choices, n-gram selection and sparsity trade-offs, handling of OOV terms/emojis/URLs/code, and empirical validation and model comparison.

|Home/Machine Learning/Thumbtack

Detail NLP preprocessing and n‑gram choices

Thumbtack

Oct 13, 2025, 9:49 PM

mediumData ScientistOnsiteMachine Learning

Describe your text preprocessing pipeline given the source modality: typed text, scanned/handwritten OCR, or speech-to-text. Specify language handling, normalization (casing, punctuation, unicode), tokenization choice (whitespace vs. rule-based vs. subword methods like BPE/WordPiece), stopwording, lemmatization/stemming, handling emojis/URLs/code, and OOV terms. You used 1–3 n-grams: justify these choices theoretically and empirically—discuss sparsity, vocabulary size, context length, and effects on linear models vs. tree/NN models; report how performance and feature importances changed across 1-gram, 1–2, and 1–3 settings. Contrast word vs. character n-grams and when each helps (misspellings, morphology). Finally, outline how you would validate the pipeline (train/validation split, leakage checks) and compare this approach with a modern transformer-based tokenizer/embedding.

Loading comments...

Browse More Questions

More Machine Learning•More Thumbtack•More Data Scientist•Thumbtack Data Scientist•Thumbtack Machine Learning•Data Scientist Machine Learning