Task: Implement an n-gram Language Model with Training, Sampling, and Model Selection Guidance
Objective
Implement an n-gram language model class with the following methods and discuss model selection and complexity trade-offs:
-
fit(file_path, n):
-
Read and tokenize a text file consistently.
-
Build n-gram and (n−1)-gram frequency counts.
-
Compute conditional probabilities with smoothing (e.g., add-k or Kneser–Ney).
-
generate(start_tokens, max_len, sampling_strategy):
-
Sample next tokens according to learned probabilities (e.g., multinomial, top-k, temperature) to produce text.
Requirements and Notes
-
Tokenization must be consistent between training and generation. Include BOS/EOS handling if using sentence generation.
-
Smoothing options: implement at least add-k; explain and, if possible, implement interpolated Kneser–Ney.
-
Sampling strategies: support multinomial; add top-k and temperature scaling.
-
Model selection: discuss how to select the optimal n given data size and domain. Propose validation procedures (train/validation split), metrics (perplexity), regularization and backoff/interpolation.
-
Analyze time/space complexity and memory footprint for different n values.
Deliverables
-
Description of class design and data structures.
-
Clear pseudocode (or concise code sketch) for fit and generate.
-
Explanation of smoothing methods with formulas.
-
Explanation of sampling methods.
-
Strategy for choosing n with validation and perplexity.
-
Complexity and memory analysis.