Implement n-gram model and select n
Company: Runway
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
Implement an n-gram language model class with fit and generate methods. The fit(file_path, n) method should read a text file, tokenize consistently, build n-gram and (n−
1)-gram frequency counts, and compute conditional probabilities with smoothing (e.g., add-k or Kneser–Ney). The generate(start_tokens, max_len, sampling_strategy) method should sample next tokens according to learned probabilities (e.g., multinomial, top-k, or temperature) to produce text. Discuss how to select the optimal n given data size and domain: propose validation procedures (e.g., train/validation split), metrics (perplexity), regularization/backoff or interpolation, and analyze the time/space complexity and memory footprint for different n values.
Quick Answer: This question evaluates competency in probabilistic language modeling and practical engineering of n-gram systems, covering n-gram construction, smoothing methods, sampling strategies, model selection and complexity analysis within the Machine Learning domain.