Tokenization and Transformer Architecture Deep Dive
You are asked to explain common tokenization approaches and modern Transformer design choices used in large language models.
Answer the following:
-
SentencePiece
-
What is SentencePiece, and how does it work at a high level?
-
Tokenizers used in BERT and typical Transformer-based LMs
-
Which tokenizers do BERT and common decoder-only LMs (e.g., GPT-style, LLaMA, Qwen) typically use, and why?
-
Transformer block internals
-
Enumerate the core components inside a Transformer block and briefly describe the role of each.
-
Architectural comparisons and design trade-offs
-
Compare a vanilla Transformer (Vaswani et al., 2017) to modern LLaMA and Qwen architectures.
-
Discuss the benefits and trade-offs of choices such as Mixture-of-Experts (MoE), RMSNorm, and rotary positional embeddings (RoPE).