LLM Architecture, Tuning, And Evaluation

What's being tested

Interviewers are probing whether you can reason about LLM architecture, fine-tuning, serving, and evaluation as an ML Engineer, not just recite Transformer terminology. You need to connect model internals—attention, tokenization, loss, routing, decoding—to operational concerns like latency, cost, validation coverage, safety regressions, and offline/online parity. Amazon cares because LLM-backed systems must be reliable under scale: small evaluation gaps can create customer-facing hallucinations, safety failures, or costly inference inefficiencies. A strong answer shows you can move between model math, training/evaluation pipelines, and production deployment tradeoffs.

Core knowledge

Transformer decoder architecture is the default foundation for generative LLMs: token embeddings plus positional information feed stacked blocks of masked self-attention, MLP layers, residual connections, and layer normalization. At inference, autoregressive decoding predicts $p(x_t \mid x_{<t})$ one token at a time.
Self-attention computes $\text{Attention}(Q,K,V)=\text{softmax}\left({QK^\top \over \sqrt{d_k}}\right)V$ and has $O(n^2)$ time and memory in sequence length $n$ . For long contexts, mention mitigations such as KV caching, sliding-window attention, grouped-query attention, FlashAttention, retrieval augmentation, or chunking.
Tokenization usually uses subword tokenizers such as BPE, WordPiece, or SentencePiece. Tokenization affects latency, multilingual quality, prompt length, and evaluation comparability. Edge cases include rare identifiers, URLs, code, non-Latin scripts, and whitespace-sensitive formats like JSON or Python.
Pretraining optimizes next-token prediction over broad corpora, while instruction tuning teaches task-following behavior from curated prompt-response pairs. RLHF or DPO then aligns outputs with preference data. For an MLE, the key is data pipeline quality, reproducible training jobs, checkpointing, evaluation gates, and deployment compatibility.
Surprisal measures how unexpected a token is: $I(x)=-\log p(x).$ With $\log_2$ , units are bits; with natural log, units are nats. Average surprisal is cross-entropy, and perplexity is $\exp(H)$ for nats or $2^H$ for bits.
Perplexity is useful for language-model fit but insufficient for assistant quality. A model can have better perplexity and worse helpfulness, safety, factuality, or instruction adherence. Pair it with task-level metrics such as exact match, pass@k, factuality checks, toxicity classifiers, human preference win rate, and calibrated LLM-as-judge evaluations.
Mixture-of-Experts models activate a subset of expert MLPs per token, often top-1 or top-2 routed. This increases parameter count without proportional FLOPs, but creates load-balancing, routing instability, and distributed communication challenges. Training often adds auxiliary load-balancing losses to avoid expert collapse.
MoE serving is not “free sparsity.” Tokens routed to different experts require all-to-all communication across devices, and batching becomes harder because expert assignment is data-dependent. Good answers discuss throughput, p99 latency, capacity factors, dropped tokens, expert parallelism, and fallback behavior during overload.
Fine-tuning choices include full fine-tuning, LoRA, QLoRA, prefix tuning, and prompt tuning. LoRA injects low-rank adapters into weight matrices, reducing trainable parameters substantially. It is attractive when you need cheaper experimentation, safer rollback, and multiple task-specific adapters sharing one base model.
Evaluation systems should test multiple layers: numerical integrity, data contamination, prompt formatting, model behavior, safety, latency, cost, and regression against known failures. A production validation suite often combines fixed golden sets, adversarial prompts, synthetic tests, shadow traffic, canaries, and human review for high-risk categories.
Generation settings change quality and reproducibility. Temperature, top-p, top-k, max tokens, repetition penalties, and stop sequences affect hallucination, diversity, latency, and determinism. For validation, use deterministic decoding where possible, store seeds/configs, and separately test stochastic behavior distributions.
RAG systems shift quality risk from only model weights to retrieval, chunking, embeddings, ranking, prompt assembly, and citation grounding. MLE-relevant evaluation includes recall@k for retrieval, answer faithfulness, source attribution, latency budget split, index freshness, and online/offline feature or embedding parity.

Worked example

For “Design an LLM quality validation system”, start by framing scope in the first 30 seconds: “Are we validating a base model, a fine-tuned assistant, or an end-to-end RAG application? What are the launch gates—quality, safety, latency, cost, or all of them? Is this for offline release validation, online monitoring, or both?” Then declare assumptions: a customer-facing assistant, multiple model versions, automated CI-style checks before deployment, and post-launch drift monitoring.

Organize the answer into four pillars. First, define evaluation coverage: curated golden prompts, task-specific benchmarks, safety/adversarial sets, regression tests from prior incidents, and representative production-like prompts with privacy-safe sampling. Second, define metrics: exact match or rubric score for structured tasks, LLM-as-judge preference with calibration, hallucination or groundedness for RAG, toxicity/safety rates, refusal correctness, latency p50/p95/p99, tokens per second, and cost per 1K requests. Third, describe the system architecture: model registry, prompt/version registry, evaluation runner, deterministic inference harness, result store, dashboard, and automated deployment gates. Fourth, cover online validation: shadow tests, canary rollout, alerting on metric regressions, drift detection in prompt distribution, and rollback.

A specific tradeoff to flag is LLM-as-judge versus human evaluation. LLM judges scale and catch many semantic issues, but they can be biased, brittle to prompt wording, and poorly calibrated for safety-critical categories; human review should remain the source of truth for high-risk or ambiguous cases. Close by saying: “If I had more time, I’d add red-team dataset refresh, inter-rater agreement tracking, and slice-based reporting by locale, device, prompt length, and task type.”

A second angle

For “Explain Transformers and MoE in LLMs”, the same concept shifts from validation-system design to architecture and scaling mechanics. Instead of leading with dashboards and release gates, lead with the Transformer block: masked self-attention, feed-forward layers, residual paths, layer norm, and autoregressive decoding. Then introduce MoE as a sparse replacement for dense feed-forward computation, where a router sends each token to a small number of experts. The MLE angle is not just “MoE has more parameters”; it is how routing affects training stability, GPU utilization, distributed all-to-all communication, checkpoint layout, serving latency, and monitoring for expert imbalance. A strong answer explicitly contrasts dense models’ simpler serving path with MoE’s better parameter/FLOP scaling but higher systems complexity.

Common pitfalls

Pitfall: Treating perplexity as the only quality metric.

Perplexity measures average next-token likelihood, not whether the assistant follows instructions, tells the truth, refuses unsafe requests correctly, or produces useful task outputs. A better answer says perplexity is one offline signal, then layers on task-specific, safety, human preference, and production telemetry metrics.

Pitfall: Explaining architecture without operational consequences.

A tempting answer defines attention, MoE, and RLHF correctly but never mentions batching, KV cache memory, latency, cost, rollout gates, or monitoring. For an ML Engineer interview, tie every architectural choice to training pipeline complexity, inference behavior, validation coverage, or deployment risk.

Pitfall: Being vague about evaluation data.

Saying “I’d test on a benchmark and some human labels” is too shallow. Stronger answers describe golden sets, adversarial sets, regression suites, slice analysis, contamination checks, deterministic generation configs, and a clear pass/fail threshold before model promotion.

Connections

Interviewers may pivot from here into RAG evaluation, model serving optimization, feature and embedding drift monitoring, or fine-tuning pipeline design. Be ready to discuss SageMaker, model registries, canary deployments, GPU memory bottlenecks, approximate nearest neighbor retrieval, and how offline validation connects to online rollback criteria.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts