Autonomy Data Engine and Active Learning

What's being tested

Interviewers are probing your ability to design and operate an active learning loop inside an autonomy data engine: selecting informative driving data, integrating human annotation, retraining models reliably, and measuring real-world improvement under cost and safety constraints. They want to see technical judgment on uncertainty estimation, sample selection vs. representativeness, pipeline reproducibility, online/offline parity, and evaluation metrics that matter for safety-critical perception and planning models.

Core knowledge

Active learning loop components: model scoring (acquisition), selection policy, annotation queue, retraining, and evaluation; ensure deterministic dataset versioning for reproducibility using tools like DVC / MLflow.
Acquisition functions: uncertainty sampling (entropy $H = -\Sigma p_i \log p_i$ ), margin sampling ( $p_1 - p_2$ ), BALD (mutual information $I[y,w|x,D] = H[y|x,D] - E_{p(w|D)}[H[y|x,w]]$ ), and query-by-committee; compute complexity depends on model ensembles or MC-dropout passes.
Diversity/coverage strategies: use k-Center (coreset), clustering (k-means on embedding), or determinantal point processes to avoid redundant selections; scale to $N\approx10M$ by approximate nearest neighbors (FAISS) and sampling.
Rare-event / tail sampling: combine importance weighting and stratified sampling; maintain per-class sampling quotas for safety-critical labels (e.g., pedestrians at night) instead of naive top-k uncertainty.
Labeling cost model: model expected utility per sample = $\Delta metric / labeling\_cost$ ; prioritize samples with high Expected Model Change per unit cost and account for annotation latency and disagreement.
Calibration & uncertainty: models often miscalibrate; use temperature scaling or Platt scaling, and evaluate with reliability diagrams and ECE (expected calibration error) before trusting uncertainty for selection.
Evaluation metrics: for perception, emphasize safety-first metrics like false-negative rate, recall at critical distance, mAP / IoU for detection, and scenario-level metrics (e.g., missed-critical-events per 1000km).
Offline-to-online parity: simulate selection on historical stream using logged model outputs (counterfactual logging) and ensure sample distribution shift between train and deployed data is quantified (PSI, KL-divergence).
Drift monitoring: monitor feature distribution drift, confidence shift, and label distribution drift; trigger re-selection or retraining when drift crosses thresholds.
Human annotation workflows: pre-label with model outputs to reduce annotator time; use hierarchical labeling (coarse → fine) and consensus labeling with arbitration for low-agreement items; track inter-annotator agreement (Cohen's kappa).
Dataset governance: maintain immutable dataset snapshots, provenance, and schema; ensure experiment tracking logs model seeds, acquisition seeds, and selection criteria for auditability.
Continuous training cadence: decide between batch retrain (periodic full retrain) vs incremental update (warm-start/online learning) based on architecture (e.g., PyTorch fine-tune vs gradient-accumulating incremental).

Worked example — "Design an active learning pipeline to collect edge-case driving data for perception models"

First 30 seconds: clarify safety constraints (which false-negative rate is critical?), labeling budget/latency, and whether models provide calibrated probabilities. Frame the answer around four pillars: (1) scoring & acquisition, (2) selection balancing uncertainty vs coverage, (3) annotation workflow & QA, (4) retrain + evaluation loop and monitoring. Propose using an ensemble or MC-dropout to produce uncertainty, run an acquisition that combines BALD for epistemic uncertainty and a diversity sampler (embedding k-means with FAISS) to avoid duplicates. Explain a cost-aware scheduler that prioritizes high-utility samples per labeling time and enforces quotas for pre-defined rare scenarios. Flag tradeoff: ensembles and MC passes are expensive at scoring time — mitigate by two-stage filtering (cheap heuristic first, then expensive uncertainty on candidates). Close with next steps: A/B test retraining cadence, simulate selection on logged data to estimate gain, and add calibration steps and QA metrics (label agreement) to validate annotations.

A second angle — "How to prioritize labeling under a strict budget for long-tail weather/night driving?"

Here the framing shifts from maximal information gain to constrained budget allocation and risk management. Use stratified sampling: partition by metadata (time-of-day, weather) and allocate budget proportional to scenario criticality and estimated model error rate. For each stratum, run targeted uncertainty sampling + diversity to get representative but informative samples. Apply importance weighting at training to correct sampling bias. Emphasize measurable objectives: define utility functions ( $\Delta recall$ in stratum per label), maintain per-stratum minimum coverage, and monitor downstream safety metrics rather than just loss reduction.

Common pitfalls

Pitfall: Relying solely on raw softmax probabilities for uncertainty. Softmax overconfidence leads to poor acquisition; always calibrate or use Bayesian/ensemble approaches.

Pitfall: Optimizing for acquisition score without preserving dataset representativeness. This creates blind spots; always mix uncertainty-selected samples with random/stratified samples.

Pitfall: Treating labeling as a black box. Ignoring annotation latency, disagreement rates, and pre-label quality underestimates real-world turnaround and harms retraining cadence.

Connections

Interviewers may pivot to continual learning / online adaptation (catastrophic forgetting, replay buffers) or to dataset shift detection and remediation (covariate vs label shift). They may also ask about scaling selection at production rates (approximate nearest neighbors, sharding) or instrumentation for safety dashboards.