This question evaluates knowledge of the end-to-end model-to-GPU execution pipeline, including frontend model representations, intermediate representations, lowering/compilation to device code, runtime memory and scheduling, and common compiler/runtime optimizations with their trade-offs.
You are asked to explain the end-to-end path a machine learning model takes from authoring to high-performance inference on a GPU.
Walk through the stages below and describe what happens at each step:
Then, discuss common compiler/runtime optimization techniques and their trade-offs, including:
Finally, contrast considerations for data-center versus edge hardware targets:
Assume a modern GPU software stack with an IR (e.g., ONNX/MLIR/Relay/XLA), a compiler/runtime (e.g., ONNX Runtime, vendor-specific runtimes), and access to common math libraries (e.g., BLAS/DNN). Keep your explanation structured and concise.
Login required