Low-Latency/Batch Inference and GPU Resource Management

What's being tested

Interviewers are probing your ability to design and operate low-latency inference pipelines that meet strict SLOs while maximizing GPU utilization and controlling cost. They expect you to reason about queueing, batching, memory residency, hardware primitives (e.g., tensor cores, NVLink), and tradeoffs between latency, throughput, and model accuracy — all from the Machine Learning Engineer’s scope: deployment, runtime behavior, and model-level optimizations, not raw network or cluster orchestration.

Core knowledge

Latency decomposition — end-to-end inference latency = queue_wait + host↔device_transfer + model_compute + postprocess; reduce whichever dominates your p99 tail first.
Key SLAs and metrics — monitor p50/p95/p99, throughput (qps), GPU utilization, queue depth, and cold-start rate; SLOs are typically expressed on p99 for user-facing services.
Batching tradeoff — increasing batch_size improves throughput until hardware saturation; latency grows roughly as queue_wait ∝ batch_fill_delay. Throughput ≈ batch_size / (transfer + compute(batch_size)). Optimize max_batch_size and max_queue_delay per model.
Dynamic batching algorithms — use time-window coalescing, size-thresholding, or adaptive controllers that expand contract batch windows based on load to hit latency SLOs while maximizing utilization.
Model residency & warmup — keep hot models resident in GPU memory; warmup by running representative batches to populate JIT/TF-TRT/ONNX caches and avoid first-inference tails.
Frameworks & runtimes — pragmatic choices: Triton Inference Server for multi-model batching/auto-batching, TensorRT and ONNX Runtime for optimized kernels, TorchScript/TorchServe for PyTorch models. Know constraints and batching knobs each exposes.
Precision & accuracy tradeoffs — mixed precision (FP16/bfloat16) and INT8 quantization reduce memory, increase throughput, but require calibration to bound accuracy drift; use post-training or quant-aware training as needed.
GPU sharing options — NVIDIA MIG (A100) gives hardware slices; CUDA MPS enables process multiplexing; multi-process time-slicing reduces latency variance but can increase overhead—measure before choosing.
Memory accounting — inference memory = model_weights + activation_memory(batch) + workspace; estimate activation ≈ batch_size × activation_tensor_bytes. If sum exceeds GPU RAM, you must use smaller batches or shard.
Data transfer overlap — overlap PCIe/NVLink host→device transfers with compute using CUDA streams and pinned memory; data staging reduces CPU bottleneck and improves p99.
Model optimizations — operator fusion, kernel autotuning, pruning, distillation, and conversion to TensorRT/ONNX frequently yield 2–10× speedups; measure accuracy/latency tradeoffs post-conversion.
Autoscaling & cost controls — prefer instance-level autoscaling with warm pools; use predictive scaling to pre-warm GPUs for diurnal patterns and avoid cold-start p99 spikes.

Tip: Prototype performance envelopes (latency vs batch) on representative hardware; synthetic microbenchmarks rarely predict real-system tails.

Worked example

Example prompt: “Design a GPU-backed inference service for a ranking model that must meet 50 ms p99 latency at 2k qps.” First 30s: clarify SLOs (p99 vs p95), payload sizes, model size, accuracy constraints, and burstiness. Skeleton answer pillars: (1) latency budget split (queueing ≤10ms, transfer ≤5ms, compute ≤30ms, postprocess ≤5ms), (2) batching policy (max_batch_size, max_queue_delay), (3) runtime choices (Triton + TensorRT with FP16) and model residency, (4) autoscaling & warm pools to ensure capacity. A concrete tradeoff to flag: larger batches raise throughput but increase queue_wait and p99; choose adaptive batching where max_queue_delay is tuned so p99 remains ≤50ms while GPU utilization stays high. Also discuss using MIG if the model is small to run multiple tenants on one GPU and reduce cost. Close with scope-of-work: “if I had more time I’d run end-to-end load tests with realistic request jitter, collect p99 across percentiles and cold starts, and iterate batch size and warm pool sizing using the observed latency envelope.”

A second angle

Consider a nightly batch scoring job that must process 100M users within a 4-hour window using the same GPU cluster. The core concept is identical — maximize GPU throughput under memory constraints — but constraints shift from p99 latency to throughput and cost. Emphasize large fixed batches, operator fusion, INT8 quantization, and model sharding across many GPUs. Here you prefer large micro-batches that saturate tensor cores, use TensorRT engine caching, and exploit NVLink for multi-GPU aggregation. You’d trade out warm pooling for spot-instance utilization and checkpointed pipelines, and measure overall wall-clock throughput (examples/sec) and cost-per-100k inferences.

Common pitfalls

Pitfall: Thinking “maximize batch size always” — this ignores queueing tails; a large batch may push p99 dramatically beyond SLO even if average latency improves. Always tune for the tail.

Pitfall: Ignoring data-transfer costs — high compute but large host→device transfer can dominate latency; overlapping DMA with compute using pinned memory is required to meet tight SLOs.

Pitfall: Treating quantization as push-button — INT8 can introduce non-negligible accuracy regressions; always run calibration and label-based QA and have an accuracy guardrail for production rollouts.

Connections

Autoscaling and workload forecasting (predictive warm pools) often intersects with SRE/infra but an MLE must own warmup strategy and SLOs.
Model monitoring and drift detection: p99 latency shifts often correlate with input distribution drift; bridge to model-quality monitors and feature-store freshness.