Kernel Fusion

What's being tested

Interviewers are probing whether you can reason about ML inference performance as a systems problem: memory bandwidth, launch overhead, GPU occupancy, compiler lowering, and hardware fit. For NVIDIA, this matters because small framework-level choices can determine whether a model fully uses CUDA cores, Tensor Cores, memory hierarchy, and interconnect bandwidth. A strong Software Engineer answer should connect high-level operators from PyTorch or TensorFlow to lower-level execution: graph capture, compiler IR, scheduling, layout, tiling, and runtime benchmarking. The goal is not to invent a new neural architecture; it is to explain how to make an existing model execute faster, cheaper, and more predictably on GPU hardware.

Core knowledge

Kernel fusion combines multiple adjacent operations into one GPU kernel to reduce kernel launch overhead, avoid intermediate global-memory writes, and improve cache/register reuse. Common examples include bias + activation, matmul + bias + GELU, layernorm variants, and elementwise chains like add -> relu -> dropout.
Memory bandwidth is often the bottleneck for elementwise and normalization-heavy workloads. A simple unfused chain may read and write tensors multiple times; fusion can reduce traffic from roughly $k \times N$ reads/writes to closer to one streaming pass, improving arithmetic intensity: $\text{arithmetic intensity} = \frac{\text{FLOPs}}{\text{bytes moved}}.$
Kernel launch overhead is nontrivial for small tensors or many tiny operators. Even if each individual kernel is fast, launching hundreds of kernels per inference request adds latency. Fusion helps most when the workload has many small memory-bound ops rather than one dominant compute-bound GEMM.
Operator fusion has limits. Fusing too much can increase register pressure, reduce occupancy, hurt instruction cache behavior, or prevent reuse of highly optimized library kernels like cuBLAS, cuDNN, or CUTLASS. A strong answer says “fuse cheap surrounding ops into expensive kernels when profitable,” not “fuse everything.”
ML compilation stacks typically move from eager framework code to graph capture, IR optimization, code generation, and runtime execution. In PyTorch 2.x, TorchDynamo captures Python frames, AOTAutograd handles ahead-of-time graph extraction, and TorchInductor lowers graphs to generated Triton or C++ kernels.
Graph breaks are a practical failure mode. Dynamic Python control flow, data-dependent shapes, unsupported ops, mutation, or custom extensions can prevent compilation and fusion. In an interview, mention inspecting compiler logs, generated kernels, and fallback paths rather than assuming the whole model compiles.
Layout selection affects fusion and hardware utilization. Formats like NCHW, NHWC, blocked layouts, or tensor-core-friendly tiling determine coalesced memory access and vectorization. For NVIDIA GPUs, dimensions aligned to multiples such as 8, 16, or 32 often matter for Tensor Core efficiency, depending on dtype and architecture.
Tiling and scheduling decide how work maps to threads, warps, shared memory, and registers. For matrix-like operations, good tiling maximizes data reuse and coalesced loads; for reductions, scheduling must handle parallel reduction, synchronization, and numerical behavior. Poor tiling can erase the benefit of fusion.
Quantization and fusion interact. INT8 or FP8 inference may fuse dequantization, matrix multiply, scaling, bias, and activation to avoid extra conversions. But quantized pipelines require careful calibration, supported hardware instructions, and validation that accuracy and latency both improve.
Memory planning reduces allocation overhead and peak memory by reusing buffers whose lifetimes do not overlap. Compilers can perform liveness analysis across a static graph; dynamic shapes make this harder. For inference services, lower peak memory can increase batch capacity and reduce p99 latency spikes from allocator pressure.
Benchmarking must separate warmup, compilation time, steady-state latency, throughput, and tail latency. Use torch.compile, CUDA Graphs, or TensorRT-style engines carefully: first-run compile/build time may be irrelevant for long-running services but unacceptable for cold-start workloads.
Parallelism choices are adjacent but distinct. Tensor parallelism splits individual tensor operations across GPUs, often requiring collectives like all-reduce; pipeline parallelism splits layers across stages, improving model capacity but introducing bubbles. Fusion primarily optimizes local execution, while parallelism addresses model size and multi-GPU throughput.

Worked example

For “Design and benchmark optimized inference pipelines”, start by framing the workload: “I’d first ask about model type, batch-size range, latency SLO, target GPU, precision constraints, dynamic shapes, and whether this is online serving or offline batch inference.” Then declare an assumption, such as “we have a PyTorch transformer-like model deployed on NVIDIA GPUs with strict p50 and p99 latency targets.” Organize the answer around four pillars: graph capture/compilation, kernel-level optimizations, runtime serving configuration, and measurement methodology.

For graph capture, mention trying torch.compile with TorchInductor, checking graph breaks, and validating generated Triton kernels or fallback eager ops. For kernel-level optimization, describe fusing elementwise chains, fusing bias + GELU, using optimized attention kernels where applicable, choosing FP16, BF16, INT8, or FP8 based on accuracy and hardware, and avoiding unnecessary host-device synchronization. For runtime, discuss batching strategy, CUDA Graphs for stable shapes, memory preallocation, and pinned host memory if requests involve CPU-GPU transfer.

A specific tradeoff to flag: aggressive dynamic batching can improve throughput but may worsen p99 latency; for online inference, you might cap queue delay or maintain separate engines for common shape buckets. Close by saying you would compare eager PyTorch, torch.compile, TensorRT, and possibly custom Triton kernels using identical inputs, warmups, synchronization, and profiler traces. If you had more time, you would add production observability: per-stage latency, GPU utilization, memory bandwidth counters, kernel timeline analysis, and regression tests for accuracy drift after quantization.

A second angle

For “Explain ML compilation optimizations and hardware fit”, the same concept is less about designing a full serving pipeline and more about explaining how compiler transformations map to GPU realities. Start from the model graph and walk downward: canonicalization, constant folding, dead-code elimination, fusion, layout propagation, memory planning, scheduling, and code generation. The constraint is that every compiler optimization should be tied to hardware effects: fewer global-memory round trips, better coalescing, higher occupancy, fewer launches, or use of Tensor Cores. A good answer also names when compiler automation is insufficient: custom ops, unstable dynamic shapes, numerically sensitive reductions, or cases where library kernels outperform generated fused kernels. This framing shows you understand not just what fusion is, but why a compiler may or may not legally and profitably apply it.

Common pitfalls

Pitfall: Saying “kernel fusion always makes inference faster.”

This is analytically wrong because fusion can increase register usage, reduce occupancy, block use of cuBLAS/cuDNN, or duplicate computation across branches. A better answer is to describe a cost model: fuse memory-bound producer-consumer chains when intermediate tensors are large or launch overhead dominates, but benchmark against optimized vendor kernels.

Pitfall: Staying at framework buzzword level.

Answers like “use torch.compile, TensorRT, and quantization” sound shallow if you cannot explain what changes at runtime. Land better by tracing one operator chain, such as matmul -> bias -> GELU -> dropout, and explaining which part remains a library GEMM, which part can be fused, and how memory traffic changes.

Pitfall: Ignoring measurement correctness.

A tempting but weak benchmark is timing Python code with time.time() around asynchronous GPU calls. A stronger answer mentions warmup, torch.cuda.synchronize(), CUDA events, fixed seeds and inputs, separate compile/build time from steady state, and reports both latency distribution and throughput under realistic batch sizes.

Connections

Interviewers may pivot from fusion into quantization, GPU memory hierarchy, distributed model parallelism, or compiler IR design. Be ready to compare TorchInductor, Triton, TensorRT, XLA, and TVM at a high level, especially how they represent graphs, lower ops, and decide whether an optimization is legal and profitable.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts