Design and benchmark optimized inference pipelines

Q: Design and benchmark optimized inference pipelines

This is a ML System Design interview question from NVIDIA for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

Accelerating PyTorch Inference: TorchDynamo, Techniques, and Benchmark Design

Context

You are asked to explain how PyTorch's compilation stack accelerates inference and to design a fair, reproducible benchmark for measuring improvements over a vanilla PyTorch baseline.

Tasks

A) TorchDynamo (aka PyTorch Dynamo) for Inference

Describe:

What TorchDynamo does for accelerating inference.
How it captures and compiles graphs (graph breaks, guards, shape specialization).
How it relates to TorchInductor and to external backends (e.g., TensorRT).

B) Techniques to Speed Up Inference (briefly list and define)

Include, at minimum:

Data/model/pipeline parallelism
Effective batching
Operator fusion
Quantization
Kernel autotuning
CUDA Graphs
Overlapping compute and data transfer (pinned memory, streams)
Sparsity
Caching (e.g., KV-cache, allocator)
Graph-level compilers

C) Design a Fair Inference Benchmark

Specify:

Metrics: latency (p50/p95/p99), throughput, GPU/SM utilization, memory.
Test setup: GPU model, software versions, precision, batch size, sequence length, warmup and iteration counts, concurrency.
Baselines: vanilla PyTorch eager.
Reporting: absolute values and percentage improvements.