Low-Latency/Batch Inference and GPU Resource Management
Asked of: Machine Learning Engineer
Last updated
What's being tested
Interviewers are probing your ability to design and operate low-latency inference pipelines that meet strict SLOs while maximizing GPU utilization and controlling cost. They expect you to reason about queueing, batching, memory residency, hardware primitives (e.g., tensor cores, NVLink), and tradeoffs between latency, throughput, and model accuracy — all from the Machine Learning Engineer’s scope: deployment, runtime behavior, and model-level optimizations, not raw network or cluster orchestration.
Core knowledge
-
Latency decomposition — end-to-end inference latency = queue_wait + host↔device_transfer + model_compute + postprocess; reduce whichever dominates your
p99tail first. -
Key SLAs and metrics — monitor p50/p95/p99, throughput (qps), GPU utilization, queue depth, and cold-start rate; SLOs are typically expressed on
p99for user-facing services. -
Batching tradeoff — increasing batch_size improves throughput until hardware saturation; latency grows roughly as queue_wait ∝ batch_fill_delay. Throughput ≈ batch_size / (transfer + compute(batch_size)). Optimize max_batch_size and max_queue_delay per model.
-
Dynamic batching algorithms — use time-window coalescing, size-thresholding, or adaptive controllers that expand contract batch windows based on load to hit latency SLOs while maximizing utilization.
-
Model residency & warmup — keep hot models resident in GPU memory; warmup by running representative batches to populate
JIT/TF-TRT/ONNXcaches and avoid first-inference tails. -
Frameworks & runtimes — pragmatic choices:
Triton Inference Serverfor multi-model batching/auto-batching,TensorRTandONNX Runtimefor optimized kernels,TorchScript/TorchServeforPyTorchmodels. Know constraints and batching knobs each exposes. -
Precision & accuracy tradeoffs — mixed precision (FP16/bfloat16) and INT8 quantization reduce memory, increase throughput, but require calibration to bound accuracy drift; use post-training or quant-aware training as needed.
-
GPU sharing options — NVIDIA MIG (A100) gives hardware slices; CUDA MPS enables process multiplexing; multi-process time-slicing reduces latency variance but can increase overhead—measure before choosing.
-
Memory accounting — inference memory = model_weights + activation_memory(batch) + workspace; estimate activation ≈ batch_size × activation_tensor_bytes. If sum exceeds GPU RAM, you must use smaller batches or shard.
-
Data transfer overlap — overlap
PCIe/NVLinkhost→device transfers with compute usingCUDAstreams and pinned memory; data staging reduces CPU bottleneck and improvesp99. -
Model optimizations — operator fusion, kernel autotuning, pruning, distillation, and conversion to
TensorRT/ONNXfrequently yield 2–10× speedups; measure accuracy/latency tradeoffs post-conversion. -
Autoscaling & cost controls — prefer instance-level autoscaling with warm pools; use predictive scaling to pre-warm GPUs for diurnal patterns and avoid cold-start
p99spikes.
Tip: Prototype performance envelopes (latency vs batch) on representative hardware; synthetic microbenchmarks rarely predict real-system tails.
Worked example
Example prompt: “Design a GPU-backed inference service for a ranking model that must meet 50 ms p99 latency at 2k qps.” First 30s: clarify SLOs (p99 vs p95), payload sizes, model size, accuracy constraints, and burstiness. Skeleton answer pillars: (1) latency budget split (queueing ≤10ms, transfer ≤5ms, compute ≤30ms, postprocess ≤5ms), (2) batching policy (max_batch_size, max_queue_delay), (3) runtime choices (Triton + TensorRT with FP16) and model residency, (4) autoscaling & warm pools to ensure capacity. A concrete tradeoff to flag: larger batches raise throughput but increase queue_wait and p99; choose adaptive batching where max_queue_delay is tuned so p99 remains ≤50ms while GPU utilization stays high. Also discuss using MIG if the model is small to run multiple tenants on one GPU and reduce cost. Close with scope-of-work: “if I had more time I’d run end-to-end load tests with realistic request jitter, collect p99 across percentiles and cold starts, and iterate batch size and warm pool sizing using the observed latency envelope.”
A second angle
Consider a nightly batch scoring job that must process 100M users within a 4-hour window using the same GPU cluster. The core concept is identical — maximize GPU throughput under memory constraints — but constraints shift from p99 latency to throughput and cost. Emphasize large fixed batches, operator fusion, INT8 quantization, and model sharding across many GPUs. Here you prefer large micro-batches that saturate tensor cores, use TensorRT engine caching, and exploit NVLink for multi-GPU aggregation. You’d trade out warm pooling for spot-instance utilization and checkpointed pipelines, and measure overall wall-clock throughput (examples/sec) and cost-per-100k inferences.
Common pitfalls
Pitfall: Thinking “maximize batch size always” — this ignores queueing tails; a large batch may push
p99dramatically beyond SLO even if average latency improves. Always tune for the tail.
Pitfall: Ignoring data-transfer costs — high compute but large host→device transfer can dominate latency; overlapping
DMAwith compute using pinned memory is required to meet tight SLOs.
Pitfall: Treating quantization as push-button —
INT8can introduce non-negligible accuracy regressions; always run calibration and label-based QA and have an accuracy guardrail for production rollouts.
Connections
-
Autoscaling and workload forecasting (predictive warm pools) often intersects with SRE/infra but an
MLEmust own warmup strategy and SLOs. -
Model monitoring and drift detection:
p99latency shifts often correlate with input distribution drift; bridge to model-quality monitors and feature-store freshness.
Further reading
-
NVIDIA Triton Inference Server docs — practical knobs for dynamic batching and model residency.
-
TensorRT Optimization Guide — patterns for mixed precision,
INT8calibration, and engine building.
Related concepts
- ML Inference APIs And GPU BatchingML System Design
- GPU And Batch Inference Operations
- Real-Time Edge Inference OptimizationML System Design
- Distributed Training and GPU Efficiency for Autonomy Models
- LLM Serving, Inference Scaling, KV Cache, and Latency-Cost Tradeoffs
- LLM Inference Serving, Batching, And KV Cache