RLHF And Preference Optimization Basics

What's being tested

You must demonstrate practical ownership of a production RLHF-style loop: building and operating the reward model and preference optimization components, integrating them into training and serving pipelines, and evaluating their reliability. Interviewers probe engineering choices around data collection, model training (scaling, reproducibility), safe optimization controls (KL penalties, constraints), and production monitoring/rollback. They want an ML Engineer who can translate research recipes into robust, observable, and maintainable pipelines that keep model behavior aligned with human preferences.

Core knowledge

Reward modeling (pairwise): common formulation uses a Bradley–Terry or softmax model: $P(a>b)=\frac{\exp(s(a))}{\exp(s(a))+\exp(s(b))}$ ; loss = $-\log P$ ; train with cross-entropy on labeled pairwise comparisons.
Preference data pipeline: label schema, annotation UI, deduping, metadata (annotator id, prompt, temperature), storage in S3/datastore; include versioned dataset manifests and checksums for reproducibility.
Policy optimization algorithm: production RLHF usually uses PPO with a reference policy and a KL penalty or constraint: $L = \mathbb{E}\left[ \min\left(r_t A_t, \text{clip}(r_t,1-\varepsilon,1+\varepsilon) A_t\right)\right] - \lambda \text{KL}(\pi || \pi_{\text{ref}})$ , where $\lambda$ controls deviation.
Reward overfitting & regularization: guard with early stopping, held-out RM validation, mix supervised fine-tuning (SFT) data, and explicit KL/entropy penalties to prevent reward hacking.
Offline-to-online parity: simulate deployment via importance sampling / off-policy evaluation; track correlation (Spearman/Pearson) between RM scores and human utility; calibrate RM (Platt isotonic) if misaligned.
Scaling & compute: RM and policy training can require GPU fleets; use sharded checkpoints, mixed precision, gradient accumulation for large batch sizes; distributed training frameworks like PyTorch DDP or DeepSpeed for >8 GPUs.
Serving architecture: expose RM as low-latency gRPC/HTTP service, autoscaled (horizontal) behind Kubernetes, with versioned endpoints and canary routing; cache frequent prompt-RM pairs to reduce p99 latency and cost.
Observability & metrics: log p99 latency, throughput, RM AUC/accuracy, calibration error, KL drift vs baseline policy, offline vs online correlation, and annotator-level disagreement distributions.
Data quality & label noise: track annotator consistency, inter-rater agreement (Cohen’s kappa), and use pair aggregation (majority or Dawid–Skene) or per-annotator calibration weights when noisy.
Reproducibility & checkpoints: store random seeds, optimizer state, dataset manifests, and use content-addressable model artifacts with immutable IDs for rollbacks.
Deployment safety controls: runtime KL guards, response filters, and human-in-the-loop escalation for model updates that exceed allowed divergence from reference.
Cost vs fidelity tradeoffs: cheaper RM (smaller model) for fast iteration, larger RM for final alignment; evaluate marginal benefit per GPU-hour; consider distillation for serving.

Worked example — designing a production RLHF training + serving pipeline

Frame the problem: clarify expected scale (queries/sec), annotation cadence, latency SLOs, and whether preference labels are pairwise or scalar—declare defaults. Organize the answer into three pillars: (1) Data ingestion & labeling: annotate pairs, store manifests, add annotator metadata and QC checks; (2) Model training: train reward model with pairwise cross-entropy, hold out validation set, checkpoint artifacts to S3 with reproducible metadata, then run policy optimization via PPO with KL penalty to a frozen reference policy; (3) Serving & monitoring: deploy RM as versioned gRPC service behind canarying and runtime KL checks, and surface metrics to dashboards. Flag a key tradeoff: heavier KL regularization reduces reward-hacking but may under-deliver on preference gains; choose $\lambda$ via offline correlation and conservative canaries. Close by saying what you'd do with more time: implement offline policy evaluation (importance sampling) to estimate online uplift and build annotator-trust weighting and automated rollbacks tied to degradation thresholds.

A second angle — evaluating and operating the reward model as a production service

Here focus on RM reliability and drift detection. Start by validating RM offline: AUC on held-out pairs, calibration, and rank correlation with held-out human scores. For deployment, add online checks: track correlation between RM score and sampled human labels, monitor distribution shifts in RM outputs and input prompts (KL divergence or population embedding drift). Implement automated alerts for reduced human-RM correlation, rising annotator disagreement, or sudden RM score shifts. Explain the engineering tradeoff: frequent human labeling improves detection but costs money; select representative sampling and stratified labeling to maximize catch rate per label-dollar.

Common pitfalls

Pitfall: Assuming RM accuracy implies aligned behavior.
High RM AUC doesn't prevent reward hacking; models can exploit spurious patterns. Better to test with adversarial prompts and constrained policy updates (KL/entropy).

Pitfall: Ignoring annotation metadata.
Treating labels as IID loses signal—annotator bias and prompt clusters matter; aggregate with annotator modeling or weight by annotator reliability.

Pitfall: Over-optimizing offline metrics without rollout safety.
Deploying a policy optimized solely for RM score without canaries or runtime KL constraints can produce regressions in user-facing quality and safety.

Connections

MLEs should be ready to pivot to adjacent topics: fine-tuning & parameter-efficient methods (e.g., LoRA/PEFT) for faster iterations, and offline policy evaluation / importance sampling techniques for safer rollouts. Also expect intersections with dataset engineering (annotation tooling) and SRE (service SLIs/SLAs and autoscaling).