Camera Calibration and 3D Geometry for Autonomy

What's being tested

Interviewers probe whether you can connect image formation and calibration math to practical ML pipelines: how to convert pixels to rays, use intrinsics/extrinsics in training and inference, quantify geometric error, and design data/metrics that expose calibration drift. Tesla cares because learned perception models must consume geometrically-correct inputs (undistorted images, registered 3D data) and because small calibration errors cascade into large depth/pose errors during autonomy. Expect clarifying questions about coordinate frames, units, and where calibration lives in the stack.

Core knowledge

Pinhole camera model and intrinsic matrix: $K = \begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}$ , projection $[u,v,1]^T \propto K [X,Y,Z]^T$ after dividing by $Z$ ; first-order mapping used everywhere in reprojection and augmentations.
Camera intrinsics: focal lengths ( $f_x,f_y$ ), principal point ( $c_x,c_y$ ), and skew (usually 0); intrinsics are in pixels and must match image resolution and rectification pipeline.
Distortion models: Brown–Conrady radial ( $k_1,k_2,k_3$ ) and tangential ( $p_1,p_2$ ) terms; undistort maps via OpenCV’s initUndistortRectifyMap. Unmodeled distortion biases learned features.
Extrinsics: rigid transform (rotation R, translation t) between camera and vehicle/LiDAR frames; expressed as $T_{cam}^{veh} = [R|t]$ and used to transform back-projected rays into world coordinates.
Stereo geometry & depth: disparity $d = x_L - x_R$ , depth $Z = \frac{f B}{d}$ for rectified stereo (baseline $B$ ); depth sensitivity scales with $Z^2/(fB)$ so long-range depth is fragile.
Epipolar constraints & matrices: Fundamental matrix $F$ (uncalibrated) and Essential matrix $E$ (calibrated) satisfy $x_2^\top E x_1 = 0$ ; used for outlier rejection and self-supervised losses.
PnP and pose estimation: given 3D-2D correspondences, solve Perspective-n-Point (PnP) for camera pose; RANSAC for robust inliers. Accuracy depends on distribution of 3D points (depth spread, non-coplanar).
Bundle adjustment & calibration refinement: joint optimization of poses, intrinsics, and 3D points; implemented with Ceres Solver or g2o; costly but gives global consistency — used offline or as a refinement stage.
Metrics: reprojection error in pixels (mean / median; <0.5px excellent), depth RMSE (meters) and % within thresholds; for stereo also report disparity error in pixels. Monitor per-camera, per-temperature, and per-lens.
Differentiable reprojection: integrate camera transforms into training with losses like reprojection, photometric, or geometric consistency; ensure gradients flow through camera intrinsics if you learn them.
Rolling-shutter & temporal sync: rolling shutter warps projection for moving platforms; timestamp alignment across sensors is critical — sync errors appear as geometric residuals and should be instrumented in datasets.
Data practices for MLEs: produce both undistorted and raw images, save K, distortion params, and T_cam_to_vehicle per-file; store calibration metadata in training manifests for reproducibility and drift analysis.
Synthetic data & domain gap: simulate correct intrinsics/distortion and add realistic noise (motion blur, sensor noise) to narrow sim2real; consider learning per-frame calibration offsets if cameras have small time-varying biases.

Worked example — common interview prompt: "Project 3D points into camera and compute reprojection error"

Frame it: ask whether points are in the same coordinate frame as the camera, whether intrinsics and distortion are already known, and whether to report mean or RMS pixel error. Skeleton: (1) transform 3D points into camera frame using extrinsics: $[X_c;Y_c;Z_c]=R[X;Y;Z]+t$ ; (2) apply pinhole projection $u=f_x X_c/Z_c + c_x$ , $v=f_y Y_c/Z_c + c_y$ ; (3) apply distortion model or undistort observed pixels consistently; (4) compute per-point pixel residuals and summarise (mean, median, >1px percent). Tradeoff to flag: whether to undistort points first or project then distort to match raw observations — both valid but must be consistent with how ground-truth keypoints were measured. Close by noting practicalities: clip points with $Z_c\le0$ , robustify with RANSAC or Huber loss, and if time allowed propose bundle adjustment to jointly refine pose and intrinsics.

A second angle — "Estimate depth to a lane marker using calibrated stereo while accounting for low texture"

Here the constraint changes: you must reason about disparity quantization, matching quality, and uncertainty propagation. Outline: rectify images using initUndistortRectifyMap, compute disparity (block matcher or learned network), convert to depth $Z=fB/d$ , and propagate disparity variance $\sigma_d$ into depth variance $\sigma_Z \approx \frac{fB}{d^2}\sigma_d$ . Practical MLE moves: filter by confidence maps, fuse LiDAR when available, and train stereo networks with geometric consistency and photometric augmentation to handle low-texture regions. Emphasize baseline selection, subpixel refinement, and metrics that penalize long-range depth errors more.

Common pitfalls

Pitfall: Treating intrinsics as immutable constants. In practice, intrinsics drift (thermal, focus changes); a better answer explains monitoring, per-drive re-calibration triggers, or learning small per-frame intrinsics offsets during training.

Pitfall: Applying undistortion inconsistently. A tempting but wrong approach is undistorting only training images; inference still uses raw pipeline — always specify if your model expects rectified/undistorted images and document conversion in the runtime pipeline.

Pitfall: Reporting only mean reprojection error. Mean hides heavy-tailed failures; report median, percentiles, and per-scene breakdowns and demonstrate robustness methods (RANSAC, Huber) you’d add.

Connections

Sensor fusion & state estimation (visual-inertial odometry, LiDAR-camera calibration) — interviewers may pivot to fusing calibrated camera rays with IMU or LiDAR.
Self-supervised geometry (depth/pose networks) and SLAM — expect pivots to end-to-end learning of depth with geometric losses and drift correction.