GPU And Batch Inference Operations

What's being tested

Interviewers probe whether you can design and reason about GPU-backed batch inference pipelines that meet real SLAs (throughput, latency, cost) while remaining operable at Reddit scale. They want to see both system-level tradeoffs (batching strategy, scheduling, data transfer) and ML-specific knobs (precision, model partitioning, warm-up) a Machine Learning Engineer would own when deploying models to GPU clusters.

Core knowledge

Batching tradeoff — Larger batch size increases GPU throughput (images/sec) but raises end-to-end latency; quantify with throughput ≈ batch_size × model_inferences_per_second_per_batch and measure p50/p95/p99 latencies against SLOs.
GPU utilization vs latency SLOs — Target >60–70% GPU utilization for cost-efficiency; if SLO requires low tail latency, accept lower utilization or use micro-batching / pre-warmed instances.
Precision reduction — Use FP16 or INT8 (via TensorRT/ONNX Runtime) to reduce memory and increase throughput; validate numeric stability and downstream metric drift after quantization.
Model optimization tooling — NVIDIA TensorRT, NVIDIA Triton Inference Server, ONNX Runtime, and TorchScript convert and optimize networks for inference; each has different graph-fusion and kernel advantages.
Data transfer costs — Moving tensors CPU↔GPU over PCIe (or NVLink) is non-trivial; amortize by batching pre/post-processing on CPU and overlapping transfer via CUDA streams or gdr_copy/GPUDirect where available.
Scheduling and orchestration — Use Kubernetes (with device plugins), Ray, or batch schedulers for large runs; pre-warm GPU pods to avoid cold-start latency and use node labeling for GPU type (e.g., A100, T4) matching workload.
Model parallelism patterns — Use data-parallel for independent inferences, model-sharding (pipeline or tensor parallelism) only when single-model memory exceeds one GPU; prefer sharding for very large LLMs and accept increased communication via NCCL.
Throughput math & memory — GPU memory required ≈ model_weights + activation_size(batch) + workspace. Activation_size grows roughly linearly with batch; monitor OOM and tune batch_size accordingly.
Micro-batching and request coalescing — Implement a short queuing window (e.g., 5–50 ms) to coalesce requests into GPU-friendly batches; measure effect on tail latency and p99.
Failure modes and retries — Define idempotency at request level and backoff strategies; for long-running batch jobs, checkpoint progress and support chunked re-runs rather than all-or-nothing.
Monitoring & observability — Track p50/p95/p99 latencies, GPU utilization, memory pressure, batch size distribution, queue length, and cost per million inferences; correlate model output metrics to changes in precision or batching.
Cost optimization levers — Choose GPU type by performance-per-dollar (e.g., T4 for throughput-bound lightweight models, A100 for large transformers), leverage spot/preemptible instances with checkpointing only when acceptable for SLOs.

Worked example — "Design a GPU batch inference pipeline for 1M images/day with 100 ms `p95` latency"

Frame: start by clarifying SLAs (exact p95 target, burstiness, availability), data source (object store vs streaming), acceptable cost envelope, and model size. Organize answer into three pillars: (1) inference model optimization (quantization, graph fusion), (2) serving architecture (Triton on Kubernetes, pre-warmed pods, request coalescing micro-batcher), and (3) operational controls (autoscaling, monitoring, retry semantics). Quantify: 1M images/day ≈ 11.6 images/sec sustained; with bursts plan capacity for peak (e.g., 10×) and choose batch_size that keeps GPU at 60–80% utilization while meeting 100 ms p95 — likely micro-batches of 8–32 if per-image model inference is ~2–5 ms on T4. Flag tradeoff: tighter p95 pushes you toward more instances and higher cost; relaxing p95 allows larger batches and fewer GPUs. Close by noting validation: run A/B traffic, measure production accuracy drift after FP16/INT8 conversion, and if more time, propose automated batch-size tuning and integration with cost-aware autoscaler.

A second angle — "Maximize throughput for nightly offline batch inference of 500M records"

Here latency SLOs are loose, so focus shifts to throughput and cost. Use large batch sizes, multi-GPU data-parallel jobs, and possibly model-sharding if a single model is huge. Tradeoffs include checkpointing progress to survive preemptions (spot VMs), staging inputs as TFRecords/Parquet for efficient IO, and overlapping CPU preprocessing with GPU execution via pipelined workers. Use Ray or Spark for distributed orchestration and Triton or containerized custom runner per GPU for efficient scaling. Highlight validation: end-to-end throughput benchmarks, numeric equivalence after quantization, and downstream metric spot-checking.

Common pitfalls

Pitfall: Ignoring transfer and CPU bottlenecks. A tempting design is "throw big batch sizes at GPUs" without measuring CPU preprocessing or PCIe transfer times; the GPU can be starved or memory-swapped, collapsing throughput. Always profile full pipeline.

Pitfall: Not asking SLO/traffic-shape questions. Presenting a single-architecture answer without clarifying p95/p99, burstiness, or acceptable cost will sound incomplete; specify assumptions or ask them up front.

Pitfall: Over-optimizing one knob. Focusing only on precision (INT8) or only on orchestration (spot instances) without end-to-end validation risks numeric drift or brittle production behavior. Show balanced tradeoffs and monitoring plans.

Connections

This area commonly leads to questions about model compression & distillation, feature store consistency for offline vs online inference, and autoscaling policies and cost allocation. Be prepared to pivot to feature-latency parity, model freshness, or experiment design for inference-quality validation.

What's being tested

Core knowledge

Worked example — "Design a GPU batch inference pipeline for 1M images/day with 100 ms `p95` latency"

A second angle — "Maximize throughput for nightly offline batch inference of 500M records"

Common pitfalls

Connections

Further reading

Related concepts

What's being tested

Core knowledge

Worked example — "Design a GPU batch inference pipeline for 1M images/day with 100 ms p95 latency"

A second angle — "Maximize throughput for nightly offline batch inference of 500M records"

Common pitfalls

Connections

Further reading

Related concepts

Worked example — "Design a GPU batch inference pipeline for 1M images/day with 100 ms `p95` latency"