Sequence Models and Model Predictive Control

What's being tested

Interviewers are probing your ability to choose and justify sequence-modeling architectures versus a control-based solution under production constraints: predictive accuracy over time, handling long-range dependencies, optimization/learning trade-offs, and deployment/latency/robustness implications that an ML Engineer must own. They're checking you can reason about data-driven sequence learners (RNN, LSTM, Transformer) alongside an optimization-driven controller (Model Predictive Control (MPC)), quantify tradeoffs, and propose practical training, validation, and serving strategies that satisfy Tesla-grade latency, safety, and monitoring requirements.

Core knowledge

Recurrent Neural Network (RNN): stateful sequence model with hidden state update $h_t = φ(Wx_t + Uh_{t-1})$ ; cheap per-step compute $O(d^2)$ but suffers from vanishing gradients on long sequences and limited long-range memory.
Long Short-Term Memory (LSTM): gated RNN that mitigates vanishing gradients via input/forget/output gates; better for moderate-length dependencies (hundreds of steps) but still sequential and slower at training/inference than parallel models.
Transformer: uses self-attention to connect any pair of positions; encoder/decoder stacks compute $O(L^2·d)$ compute and $O(L^2)$ memory for sequence length $L$ , enabling long-range modeling but costly for large $L$ ; can be made linear/efficient with sparse or locality-restricted attention.
Model Predictive Control (MPC): online optimization over control horizon $H$ solving $\min_{u_{0:H-1}} \sum_{t=0}^{H-1} \ell(x_t,u_t) \quad\text{s.t.}\quad x_{t+1}=f(x_t,u_t),\; g(x_t,u_t)\le0$ Receding-horizon, handles constraints explicitly, deterministic guarantees with accurate model, but needs fast solvers and accurate dynamics.
Data-vs-modeling tradeoff: Learned models (RNN/LSTM/Transformer) approximate $f$ or predict future observations; MPC uses explicit $f$ and optimizes—combine via learning dynamics (neural $f$ ) or learned policy to warm-start MPC.
Training pitfalls: teacher forcing reduces training-inference mismatch; causes exposure bias where model never sees its own errors at train time; remedies: scheduled sampling or sequence-level losses.
Evaluation: use open-loop (prediction horizon) and closed-loop (rollout) metrics; report per-step RMSE and cumulative control cost, and safety metrics (constraint violations per 1000 episodes).
Deployment constraints: measure latency and p99 for end-to-end loop, model memory and CPU/GPU availability, and required control frequency (e.g., 10 Hz vs 100 Hz) to decide between heavy Transformers and lightweight LSTMs or learned MPC policy.
Robustness & monitoring: track distributional drift in inputs, online replay for concept-drift detection, and maintain a shadow MPC or rule-based fallback if learned model confidence or constraint checks fail.
Hybrid approaches: Differentiable MPC or learning residual dynamics where neural net corrects an analytic model, combining MPC's constraints with learned flexibility; requires differentiable solvers or implicit differentiation.
Complexity heuristics: Transformers pay off when $L > \sim200$ and parallel training matters; for real-time control with $H$ small and tight latency, prefer LSTM or a model-predictive policy approximator.

Worked example — "Compare RNNs, LSTMs, Transformers, and MPC"

Frame: Ask clarifying questions: required control/prediction frequency, sequence lengths of interest, hard constraints, availability of system dynamics or simulator, and compute budget at inference. A strong structure: (1) Functional capability: what temporal dependencies and constraints each method can represent; (2) Training and data demands: sample complexity, supervised vs model-based; (3) Deployment: latency, determinism, safety, fallback; (4) Hybrid options and risk mitigation. Explicit tradeoff: state that Transformers offer best long-range pattern modeling but incur $O(L^2)$ memory, so for a 1 kHz loop or $L>500$ a Transformer may be infeasible without sparse/streaming attention. Close by proposing an MLE plan: benchmark a small LSTM baseline for latency, train a Transformer offline for batch forecasting, implement an MPC baseline using known dynamics, and measure closed-loop cost; if closed-loop errors persist, try residual learning (neural correction to dynamics). If more time: propose experiments (ablation of horizon $H$ , scheduled sampling rates, MPC horizon sweep) and safety validation protocols.

A second angle

Reframe to a production constraint: suppose you must provide 100 Hz control with 10 ms budget and limited GPU. The same concepts apply but emphasis shifts: sequential inference cost dominates, so an LSTM or a distilled policy network becomes more attractive than a full Transformer; MPC is feasible only with a very small horizon or when using a tailored quadratic program solver that meets latency; otherwise, use an MPC-informed dataset to train a policy network (behavior cloning + DAGGER) and keep MPC as a safety monitor. This shows transfer: modeling power vs operational constraints dictates whether to favor learned sequential models, online optimization, or a hybrid with model-based warm-starting.

Common pitfalls

Pitfall: Over-emphasizing training-set forecasting metrics like open-loop RMSE without validating closed-loop performance. A model can have low one-step error yet produce catastrophic drift when used autoregressively; always include rollout/closed-loop tests.

Pitfall: Choosing Transformer purely for "state-of-the-art" without accounting for $O(L^2)$ memory and inference latency constraints. For real-time loops, quantify end-to-end latency on target hardware first.

Pitfall: Presenting MPC as a silver bullet for safety; MPC requires accurate dynamics and fast reliable solvers. If dynamics are learned, you must account for model uncertainty and provide fallback policies or robustification.

Connections

Interviewers may pivot to reinforcement learning (model-based vs model-free) when discussing closed-loop control, or to model compression/distillation (quantization, pruning, knowledge distillation to meet latency). They may also ask about calibration and uncertainty estimation (e.g., ensembles, Bayesian nets) for safe decision thresholds.

What's being tested

Core knowledge

Worked example — "Compare RNNs, LSTMs, Transformers, and MPC"

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts