Multimodal LLM System Design

What's being tested

Candidates must demonstrate practical design and engineering judgment for building multimodal LLM systems: how inputs from different modalities are preprocessed, fused, trained, evaluated, and served at scale while controlling cost, latency, and quality. Interviewers probe whether you can choose architectures (encoders, fusion), construct training objectives (contrastive, generative, multitask), design offline/online evaluation, and specify deployment tradeoffs that an ML Engineer owns (training pipelines, serving stack, monitoring), not adjacent infra or product strategy.

Core knowledge

Modal encoders: common choices are pretrained vision encoders (CLIP, ViT, ResNet) and text encoders (BERT, GPT family); embeddings typically sized $d∈[512,2048]$ . Use pretrained frozen backbones for sample efficiency, fine-tune when alignment needed.
Fusion strategies: early fusion (concatenate raw features then jointly attend) versus late fusion (separate encoders + cross-attention or logits-level merging). Early fusion increases model compute and multimodal context length; late fusion preserves modality specialization and simplifies retrieval.
Contrastive vs generative objectives: contrastive loss (InfoNCE) for alignment and retrieval; autoregressive / seq2seq generative loss for multimodal generation. Multi-objective training mixes weights $\lambda$ : $L = \lambda_{gen} L_{gen} + \lambda_{ctr} L_{contr}$ . Tune $\lambda$ to balance retrieval vs generation quality.
Retrieval-augmented systems: use dual-encoder retrieval (encode query and corpus separately) with an approximate nearest neighbor index (FAISS) for scale; IVF+PQ scales to 10s–100sM vectors with sub-ms to single-digit-ms single-query latency depending on infrastructure.
Training scale & infrastructure: for >100B params use model/data parallelism frameworks (DeepSpeed ZeRO stage 3, torch.distributed). Batch size, gradient accumulation, and learning-rate schedules (linear warmup + cosine decay) critically affect convergence.
Quantization & distillation for serving: 8-bit quantization or GPTQ and sequence-level distillation reduce model size and latency; expect 2–4× speedups but validate metric drop on downstream multimodal tasks.
Offline/online evaluation: combine automatic metrics (BLEU/METEOR for generation weak, CLIPScore/FID for visual-text alignment) and task-specific metrics (accuracy for VQA, ROUGE for summarization). Complement with human evals focusing on hallucination, relevance, and safety.
Drift & monitoring: monitor input-modality distributions (image histograms, audio SNR), embedding drift (cosine similarity to historical centroid), and downstream metrics (latency p_{95}, error_rate). Trigger retraining when distributional shifts or metric degradation exceed thresholds.
Data curation & labels: curate paired multimodal data, deduplicate near-duplicates, and track provenance; synthetic augmentation (caption paraphrasing, image augmentations) helps but can increase hallucination if alignment noise is high.
Latency & cost tradeoffs: cross-attention fusion typically adds $O(L_{text} × L_{image\_patch})$ compute; approximate numbers: ViT patches ~196, text tokens ~512 ⇒ attention matrix work scales with product, so prefer late-fusion or adapter modules for low-latency inference.
Evaluation for hallucination and safety: design targeted tests (prompted adversarial images/text) and automated detectors (consistency checks with retrieval). Tune generation beam size and sampling temperature; greedy decoding reduces hallucination but harms creativity.
Feature stores & online/offline parity: serve the same preprocessing transforms in training and inference (tokenization, image normalization, mean/std) via shared transform libraries (Torchvision Transform pipelines or serialized preprocessing containers) to avoid skew.

Worked example

Design a scalable multimodal LLM that supports image+text queries and returns grounded answers. First 30s: clarify scope — expected throughput, latency SLO (e.g., p95 ≤ 300ms), modalities (static images vs video frames), and whether retrieval is allowed. Skeleton answer pillars: (1) input & preprocessing: standardized image resizing, patching, and tokenization pipelines with deterministic transforms; (2) model architecture: dual-encoder baseline (image encoder + text encoder) for retrieval, plus a lightweight cross-attention generator when grounded generation required; (3) training objectives: combine contrastive alignment (InfoNCE) with sequence-level generative fine-tuning on paired QA data, balancing with $\lambda$ ; (4) serving & scaling: host encoder in GPU-backed microservices, use FAISS index for retrieval, and route generation to a slower GPU pool. A key tradeoff to flag: using heavy cross-attention in the main path yields better multimodal fusion but violates tight latency SLOs; choose late fusion + cached retrieval for low-latency interactive use. Close by stating validation plan (offline metrics + human eval) and next steps: prototype dual-encoder retrieval and measure p95 + quality before full joint-fine-tune.

A second angle

Consider low-latency on-device image captioning for resource-constrained phones. The same concepts apply, but constraints shift design choices: prefer model compression (quantization to 4–8 bit, pruning), distilled transformers for both image and text encoders, and avoid large cross-attention by using a compact multimodal adapter that merges small visual tokens into a lightweight decoder. Emphasize on-device preprocessing determinism and fallback to cloud generation when confidence is low. Monitoring shifts toward client-side telemetry (tokenization failures, camera exposure distributions) and careful privacy-preserving logging.

Common pitfalls

Pitfall: Overfitting to automatic metrics. Relying only on BLEU or CLIPScore can reward safe, generic captions and miss hallucinations; always include targeted human evaluation and task-specific checks.

Pitfall: Ignoring inference parity. Training transforms that differ from serving transforms (e.g., different normalization/cropping) produce large offline/online gaps; enforce shared transform code and unit tests.

Pitfall: Treating modalities interchangeably. A tempting simplification is concatenating image tokens with text tokens naively; that often blows up compute and hurts performance unless you justify patch/token budgeting and positional encodings for each modality.

Connections

Interviewers may pivot to retrieval-augmented generation, model compression & distillation, or RLHF/alignment for multimodal outputs. They might also ask about dataset provenance and bias mitigation, which touches data engineering and policy but expects engineering-level mitigation plans.