Task: Deep Learning Optimization and Parallelism
You are asked to explain optimization techniques commonly used to improve deep learning training and inference. Address the following:
Part A: Optimization Techniques
Describe common AI optimization techniques for both training and inference. For each technique, state:
-
Goal(s)
-
How it works (at a high level)
-
Typical benefits
-
Trade-offs and pitfalls
Cover at least these categories and examples:
-
Quantization (e.g., INT8, FP8, PTQ vs QAT)
-
Pruning (unstructured vs structured, N:M sparsity)
-
Knowledge distillation (teacher–student)
-
Kernel/operator fusion (e.g., bias+GELU, FlashAttention)
-
Memory optimizations (e.g., activation checkpointing, sharding/offload, KV cache)
-
Throughput/latency optimizations (e.g., mixed precision, CUDA Graphs/compilation, batching, overlap of compute/comm)
Part B: Model Parallelism Comparison
Compare tensor parallelism and pipeline parallelism:
-
How each works
-
Communication patterns and collectives used
-
When to use each (practical scenarios)
-
Typical performance bottlenecks and how to mitigate them