Describe your text preprocessing pipeline given the source modality: typed text, scanned/handwritten OCR, or speech-to-text. Specify language handling, normalization (casing, punctuation, unicode), tokenization choice (whitespace vs. rule-based vs. subword methods like BPE/WordPiece), stopwording, lemmatization/stemming, handling emojis/URLs/code, and OOV terms. You used 1–3 n-grams: justify these choices theoretically and empirically—discuss sparsity, vocabulary size, context length, and effects on linear models vs. tree/NN models; report how performance and feature importances changed across 1-gram, 1–2, and 1–3 settings. Contrast word vs. character n-grams and when each helps (misspellings, morphology). Finally, outline how you would validate the pipeline (train/validation split, leakage checks) and compare this approach with a modern transformer-based tokenizer/embedding.