Autonomous Driving Perception Models

What's being tested

Interviewers are probing your ability to design, train, evaluate, and operate production-grade perception models for autonomous driving under real-world constraints: latency, safety margins, class imbalance, domain shift, and continual data drift. They'll assess whether you can translate a functional requirement (detect/predict/segment) into a reproducible training pipeline, meaningful evaluation metrics, robust deployment strategy, and monitoring/rollback controls consistent with a large-scale fleet. At Tesla, this maps to delivering models that are accurate in the lab and reliable in production under strict latency and safety SLAs.

Core knowledge

Perception task taxonomy: know differences between object detection, semantic segmentation, instance segmentation, and tracking; each has distinct labels, losses, and evaluation metrics like mAP, IoU, and CLEAR MOT.
Evaluation metrics & tradeoffs: compute precision, recall, and F1 with $F1 = \frac{2 \cdot \text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}}$ ; prioritize recall for safety-critical classes and measure p95/p99 latency and false-negative rates per class.
Labeling & class imbalance: rare-object classes need strategies: oversampling, class-weighted loss (e.g., focal loss), and targeted data collection; expect heavy-tailed distributions with millions of background examples vs. thousands of rare positives.
Losses & calibration: use focal loss or weighted cross-entropy for imbalance; enforce probabilistic calibration (temperature scaling) so output confidences map to observed frequencies for downstream decision-making.
Data augmentation & sim2real: apply geometric, photometric, and domain-randomization augmentations; for simulation-to-reality (sim2real) use domain adaptation techniques like adversarial feature alignment or self-supervised pseudo-labeling.
Training infra & scaling: distributed data-parallel with PyTorch/TensorFlow and NCCL works to ~100 GPUs; beyond that use sharded datasets, gradient accumulation, and mixed-precision (AMP) to fit larger batch sizes while controlling generalization.
Model compression & runtime: plan quantization-aware training, pruning, and distillation for deployment; export to ONNX then optimize with TensorRT or NVIDIA Triton for latency targets under limited compute.
Online/offline parity & shadow testing: validate offline metrics against shadow-deployed models running on real inputs; measure distributional shift and prediction delta before any rollout.
Monitoring & drift detection: track input-distribution stats, per-class TPR/FPR, calibration drift, and model confidence histograms; use population stability index (PSI) or KL divergence thresholds to trigger retraining.
Versioning & rollout: use model registry with metadata (dataset hash, seed, commit), and deploy with canary / phased rollout; maintain reproducible TFRecord/Parquet manifests and deterministic data splits.
Safety & degradation strategies: define graceful fallback policies (e.g., reduce automation level) for low-confidence or detected distributional anomalies; quantify end-to-end fallbacks' impact on system safety metrics.
Label noise & QA: anticipate 1–5% label noise; detect via loss-based outlier mining, model disagreement ensembles, and human-in-the-loop relabeling prioritization.

Worked example — "Design an object-detection model pipeline for urban driving"

Start by clarifying scope: which sensors count as inputs (camera only vs. multisensor), latency budget (ms), target classes and failure-cost hierarchy (pedestrian > cyclist > vehicle), and whether bounding boxes or 3D boxes are needed. Organize your answer around data, model, evaluation, and deployment pillars: data curation & augmentation strategy; model architecture and loss choices; offline evaluation and safety-focused metrics; deployment/monitoring plan. For model choice, justify a backbone (e.g., efficient ResNet/MobileNet variants) vs. heavier Transformer backbones based on latency and hardware. For imbalance, propose focal loss plus targeted rare-class collection and synthetic augmentation. Explicit tradeoff to flag: achieving high recall for pedestrians may increase false positives and downstream braking activations — discuss threshold tuning and cross-module coordination. Close by describing rollout: shadow evaluation on fleet, phased canary, and automated rollback triggers based on TPR and latency breaches; say "with more time I'd instrument per-scenario metrics (night/rain/intersection) and implement online active learning to capture new edge cases."

A second angle — "Handle domain shift between simulation and real-world camera data"

The same pipeline priorities apply but emphasize domain adaptation and validation design. Start with a sim-to-real gap analysis: compare color distributions, noise, and occlusion statistics; instrument per-channel covariate shift metrics. Propose adaptation: combine large simulated labeled sets with smaller real unlabeled sets using unsupervised domain-adversarial training or self-training (pseudo-labeling) with confidence filtering. In deployment, stress the need for rigorous shadow validation on real-world logs and per-weather/per-time-of-day sub-cohort evaluation. The main tradeoff is simulation scale vs. adaptation complexity: large sim data reduces collection costs but requires stronger adaptation to avoid overfitting simulation artifacts.

Common pitfalls

Pitfall: Optimizing only global metrics like mAP while ignoring per-class safety requirements.
Interviewers will mark down candidates who don't propose class-prioritized metrics or thresholds; always present per-class TPR/FNR and scenario-specific slices (night, occlusion).

Pitfall: Treating deployment as “model export” only.
A common wrong answer omits monitoring, shadow tests, calibration checks, and rollback criteria — describe the full lifecycle from training data version to production telemetry.

Pitfall: Assuming more data always fixes edge cases.
Collecting uncurated data can reinforce label noise or spurious correlations; propose targeted data collection, active learning, and quality thresholds rather than indiscriminate scaling.

Connections

Interviewers may pivot to sensor fusion (how perception models consume camera + radar + lidar), motion prediction and planning integration (how perception confidences feed downstream), or data platform topics like annotation pipelines and feature stores for learned modules.