GPU And Batch Inference Operations
Asked of: Machine Learning Engineer
Last updated
What's being tested
Interviewers probe whether you can design and reason about GPU-backed batch inference pipelines that meet real SLAs (throughput, latency, cost) while remaining operable at Reddit scale. They want to see both system-level tradeoffs (batching strategy, scheduling, data transfer) and ML-specific knobs (precision, model partitioning, warm-up) a Machine Learning Engineer would own when deploying models to GPU clusters.
Core knowledge
-
Batching tradeoff — Larger batch size increases GPU throughput (images/sec) but raises end-to-end latency; quantify with throughput ≈ batch_size × model_inferences_per_second_per_batch and measure
p50/p95/p99latencies against SLOs. -
GPU utilization vs latency SLOs — Target >60–70% GPU utilization for cost-efficiency; if SLO requires low tail latency, accept lower utilization or use micro-batching / pre-warmed instances.
-
Precision reduction — Use FP16 or INT8 (via
TensorRT/ONNX Runtime) to reduce memory and increase throughput; validate numeric stability and downstream metric drift after quantization. -
Model optimization tooling —
NVIDIA TensorRT,NVIDIA Triton Inference Server,ONNX Runtime, andTorchScriptconvert and optimize networks for inference; each has different graph-fusion and kernel advantages. -
Data transfer costs — Moving tensors CPU↔GPU over PCIe (or NVLink) is non-trivial; amortize by batching pre/post-processing on CPU and overlapping transfer via CUDA streams or
gdr_copy/GPUDirect where available. -
Scheduling and orchestration — Use
Kubernetes(with device plugins),Ray, or batch schedulers for large runs; pre-warm GPU pods to avoid cold-start latency and use node labeling for GPU type (e.g.,A100,T4) matching workload. -
Model parallelism patterns — Use data-parallel for independent inferences, model-sharding (pipeline or tensor parallelism) only when single-model memory exceeds one GPU; prefer sharding for very large LLMs and accept increased communication via
NCCL. -
Throughput math & memory — GPU memory required ≈ model_weights + activation_size(batch) + workspace. Activation_size grows roughly linearly with batch; monitor OOM and tune batch_size accordingly.
-
Micro-batching and request coalescing — Implement a short queuing window (e.g., 5–50 ms) to coalesce requests into GPU-friendly batches; measure effect on tail latency and
p99. -
Failure modes and retries — Define idempotency at request level and backoff strategies; for long-running batch jobs, checkpoint progress and support chunked re-runs rather than all-or-nothing.
-
Monitoring & observability — Track
p50/p95/p99latencies, GPU utilization, memory pressure, batch size distribution, queue length, and cost per million inferences; correlate model output metrics to changes in precision or batching. -
Cost optimization levers — Choose GPU type by performance-per-dollar (e.g.,
T4for throughput-bound lightweight models,A100for large transformers), leverage spot/preemptible instances with checkpointing only when acceptable for SLOs.
Worked example — "Design a GPU batch inference pipeline for 1M images/day with 100 ms p95 latency"
Frame: start by clarifying SLAs (exact p95 target, burstiness, availability), data source (object store vs streaming), acceptable cost envelope, and model size. Organize answer into three pillars: (1) inference model optimization (quantization, graph fusion), (2) serving architecture (Triton on Kubernetes, pre-warmed pods, request coalescing micro-batcher), and (3) operational controls (autoscaling, monitoring, retry semantics). Quantify: 1M images/day ≈ 11.6 images/sec sustained; with bursts plan capacity for peak (e.g., 10×) and choose batch_size that keeps GPU at 60–80% utilization while meeting 100 ms p95 — likely micro-batches of 8–32 if per-image model inference is ~2–5 ms on T4. Flag tradeoff: tighter p95 pushes you toward more instances and higher cost; relaxing p95 allows larger batches and fewer GPUs. Close by noting validation: run A/B traffic, measure production accuracy drift after FP16/INT8 conversion, and if more time, propose automated batch-size tuning and integration with cost-aware autoscaler.
A second angle — "Maximize throughput for nightly offline batch inference of 500M records"
Here latency SLOs are loose, so focus shifts to throughput and cost. Use large batch sizes, multi-GPU data-parallel jobs, and possibly model-sharding if a single model is huge. Tradeoffs include checkpointing progress to survive preemptions (spot VMs), staging inputs as TFRecords/Parquet for efficient IO, and overlapping CPU preprocessing with GPU execution via pipelined workers. Use Ray or Spark for distributed orchestration and Triton or containerized custom runner per GPU for efficient scaling. Highlight validation: end-to-end throughput benchmarks, numeric equivalence after quantization, and downstream metric spot-checking.
Common pitfalls
Pitfall: Ignoring transfer and CPU bottlenecks. A tempting design is "throw big batch sizes at GPUs" without measuring CPU preprocessing or PCIe transfer times; the GPU can be starved or memory-swapped, collapsing throughput. Always profile full pipeline.
Pitfall: Not asking SLO/traffic-shape questions. Presenting a single-architecture answer without clarifying
p95/p99, burstiness, or acceptable cost will sound incomplete; specify assumptions or ask them up front.
Pitfall: Over-optimizing one knob. Focusing only on precision (INT8) or only on orchestration (spot instances) without end-to-end validation risks numeric drift or brittle production behavior. Show balanced tradeoffs and monitoring plans.
Connections
This area commonly leads to questions about model compression & distillation, feature store consistency for offline vs online inference, and autoscaling policies and cost allocation. Be prepared to pivot to feature-latency parity, model freshness, or experiment design for inference-quality validation.
Further reading
-
NVIDIA Triton Inference Server — practical deployment patterns and batching features.
-
MLPerf Inference — benchmarks and performance best-practices for GPU inference.
-
NVIDIA TensorRT — guidelines for precision reduction and kernel optimizations.
Related concepts
- ML Inference APIs And GPU BatchingML System Design
- GPU Credit Ledgers And SchedulersCoding & Algorithms
- GPU Programming, Graphics APIs, And Shader CompilersSystem Design
- GPU Credit Ledgers And Resource AccountingSystem Design
- Distributed Batch Processing With Partial AggregationSystem Design
- LLM Inference Optimization And KV CacheSoftware Engineering Fundamentals