PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches
|Home/Machine Learning/Thumbtack

Detail NLP preprocessing and n‑gram choices

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a data scientist's competency in NLP preprocessing and feature engineering, covering modality-specific text normalization, tokenization and subword choices, n-gram selection and sparsity trade-offs, handling of OOV terms/emojis/URLs/code, and empirical validation and model comparison.

  • Medium
  • Thumbtack
  • Machine Learning
  • Data Scientist

Detail NLP preprocessing and n‑gram choices

Company: Thumbtack

Role: Data Scientist

Category: Machine Learning

Difficulty: Medium

Interview Round: Onsite

Describe your text preprocessing pipeline given the source modality: typed text, scanned/handwritten OCR, or speech-to-text. Specify language handling, normalization (casing, punctuation, unicode), tokenization choice (whitespace vs. rule-based vs. subword methods like BPE/WordPiece), stopwording, lemmatization/stemming, handling emojis/URLs/code, and OOV terms. You used 1–3 n-grams: justify these choices theoretically and empirically—discuss sparsity, vocabulary size, context length, and effects on linear models vs. tree/NN models; report how performance and feature importances changed across 1-gram, 1–2, and 1–3 settings. Contrast word vs. character n-grams and when each helps (misspellings, morphology). Finally, outline how you would validate the pipeline (train/validation split, leakage checks) and compare this approach with a modern transformer-based tokenizer/embedding.

Quick Answer: This question evaluates a data scientist's competency in NLP preprocessing and feature engineering, covering modality-specific text normalization, tokenization and subword choices, n-gram selection and sparsity trade-offs, handling of OOV terms/emojis/URLs/code, and empirical validation and model comparison.

Related Interview Questions

  • Choose clustering vs regression; explain KNN - Thumbtack (Medium)
  • Build a defensible ML pipeline end-to-end - Thumbtack (hard)
  • Forecast response-rate trends with backtesting - Thumbtack (medium)
Thumbtack logo
Thumbtack
Oct 13, 2025, 9:49 PM
Data Scientist
Onsite
Machine Learning
4
0

Describe your text preprocessing pipeline given the source modality: typed text, scanned/handwritten OCR, or speech-to-text. Specify language handling, normalization (casing, punctuation, unicode), tokenization choice (whitespace vs. rule-based vs. subword methods like BPE/WordPiece), stopwording, lemmatization/stemming, handling emojis/URLs/code, and OOV terms. You used 1–3 n-grams: justify these choices theoretically and empirically—discuss sparsity, vocabulary size, context length, and effects on linear models vs. tree/NN models; report how performance and feature importances changed across 1-gram, 1–2, and 1–3 settings. Contrast word vs. character n-grams and when each helps (misspellings, morphology). Finally, outline how you would validate the pipeline (train/validation split, leakage checks) and compare this approach with a modern transformer-based tokenizer/embedding.

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Thumbtack•More Data Scientist•Thumbtack Data Scientist•Thumbtack Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.