Computer Vision For Google MLE

What's being tested

Interviewers probe whether you can take a computer-vision model from research to production: choose appropriate architectures, build robust training and labeling pipelines, satisfy latency/throughput SLOs, and operate reliable monitoring and rollouts. They want evidence you can reason about tradeoffs (accuracy vs latency, cost vs quality), validate offline metrics that predict online behavior, and design deployment/monitoring that prevents silent failure in the wild. For a Google MLE, emphasis is on productionization: reproducible pipelines, model serving at scale, offline/online parity, and clear instrumentation for drift and slice-level performance.

Core knowledge

Evaluation metrics: know mean Average Precision (mAP) for detection, Intersection over Union (IoU) formula $IoU=\frac{Area_{intersection}}{Area_{union}}$ , mIoU for segmentation, and precision/recall/F1 tradeoffs across thresholds.
Offline→online parity: identical preprocessing, deterministic transforms, and seed-handling; mismatches (e.g., different image resizing or normalization) explain why offline gains don’t replicate online.
Model families & tradeoffs: single-stage detectors (YOLO, SSD) prioritize speed and lower mAP; two-stage detectors (Faster R-CNN, Detectron2) prioritize accuracy at higher latency; choose by SLO and hardware.
Latency optimizations: quantization (post-training ~4x size reduction), pruning, operator fusion, and vendor runtimes TensorRT/TFLite — quantization may change calibration and reduce accuracy, so validate on holdout.
Model compression approaches: knowledge distillation transfers performance to smaller students; pruning reduces FLOPs but may hurt structured sparsity unless hardware supports it.
Training scale & infra: distributed data-parallel training for models with >100M parameters; synchronous SGD with warmup and learning-rate schedules (Cosine/AdamW) for stability; checkpointing and reproducible hyperparameters are essential.
Data & annotation quality: label noise, inconsistent bounding-box policies, and class imbalance dominate error; use annotation guidelines, consensus labeling, and active learning to prioritize labeling budget.
Monitoring & drift detection: monitor input-feature distributions (KL divergence, PSI), per-slice metrics, calibration (Expected Calibration Error), and online business metrics; set alerts on significant PSI or mAP degradation.
Serving patterns: edge vs cloud tradeoffs — TFLite or on-device models reduce network cost and privacy exposure but limit model size; cloud serving enables larger models and batching, using TensorFlow Serving/custom gRPC endpoints with autoscaling.
Throughput vs latency: batching increases throughput but raises tail latency (p99); design dynamic batching with latency budgets and max-batch-size limits. Always quantify both average and tail latencies.
Rollout & validation: use canary rollouts, shadow traffic, and offline-in-production evaluation (shadow inference) to detect regressions before full launch. Maintain a model registry + dataset hashes for traceability.
Class-imbalance & long-tail handling: use focal loss, class-aware sampling, and per-class thresholding in production; include per-slice evaluation and targeted data collection for low-frequency classes.

Tip: instrument policy and tooling to replay historical traffic deterministically through new preprocessing and model variants—this is the fastest way to validate offline/online parity.

Worked example

"Design an object-detection serving pipeline that meets a 200ms p99 latency SLO for live 30fps video." In the first 30 seconds ask: target accuracy (mAP), hardware constraints (edge CPU, mobile GPU, cloud GPU), allowed downsampling, acceptable false-positive vs false-negative costs, and whether batching is permitted. Organize your response around three pillars: (1) model selection and compression — pick a lightweight backbone (e.g., MobileNetV3 + SSD or a pruned YOLO), apply quantization-aware training or post-training quantization, and consider distillation from a high-accuracy teacher; (2) serving architecture — push minimal preprocessing to the camera, use on-device inference with TFLite if privacy/latency require, or cloud inference with dynamic batching and TensorRT instances for GPUs; (3) validation & rollout — shadow traffic for a % of live stream, synthetic benchmarks for latency, and per-class monitoring. Explicit tradeoff: batching on the cloud improves throughput but risks violating 200ms tail latency; favor micro-batching or adaptive batching with max-latency cap. Close by stating measurable next steps: implement a microbenchmark, measure p50/p90/p99 on target hardware, and if more time, run a small A/B canary with shadow logging and a targeted active-learning loop to collect failure cases.

A second angle

"Design a data-collection and monitoring strategy to address long-tail class drift for a face-recognition pipeline." The core technical ideas are the same — measure per-slice performance, prioritize label collection, and ensure offline/online parity — but constraints differ: heavy privacy requirements, rare classes with few examples, and potentially user-facing false-reject harms. Emphasize privacy-preserving telemetry (aggregate/hashed metrics), active learning to select ambiguous or underrepresented cohorts, and automated slice alerting for sudden PSI increases on demographic features. Operationally, prefer lightweight on-device embeddings with server-side nearest-neighbor lookup (FAISS) and add periodic re-training using a curator-driven labeling workflow; for rare classes, synthetic augmentation and class-aware sampling help bootstrap models.

Common pitfalls

Pitfall: Optimizing only for a single metric (e.g., peak mAP) without considering production SLOs like p99 latency or memory constraints; always pair accuracy gains with serving-cost and latency analysis.

Pitfall: Failing to ask about hardware and deployment context — proposing a heavy two-stage detector without confirming available GPUs or edge constraints will derail the design.

Pitfall: Ignoring data quality and annotation policy first; a tempting fix is model architecture tuning, but label inconsistencies or skew usually explain most real-world failures.

Connections

Expect interviewer pivots to embedding retrieval (image embeddings + FAISS), experiment design / A/B testing for model changes, or privacy/federated learning for on-device pipelines. Be ready to connect model design decisions to cost, user-impact metrics, and reproducibility tooling.