LLM Foundations, Embeddings, Prompts, And Fine-Tuning

What's being tested

Interviewers are probing whether you understand how modern large language models work well enough to apply them to product, ranking, integrity, ads, support, or creator tools without treating them as magic. For a Data Scientist at Meta, the key skill is translating model behavior into measurable business and user outcomes: accuracy, latency, cost, safety, engagement, fairness, and experimentation validity. They are testing whether you can choose among prompting, embeddings, retrieval-augmented generation, and fine-tuning under real constraints, not whether you can recite Transformer trivia. Strong answers connect technical design choices to metrics like CTR, DAU, violation recall, human review load, latency p95, and cost per inference.

Core knowledge

Transformer models use self-attention to contextualize tokens: $\text{Attention}(Q,K,V)=\text{softmax}(QK^\top/\sqrt{d_k})V.$ Attention scales roughly $O(n^2)$ with sequence length, so long-context applications create latency and memory tradeoffs unless using sparse, sliding-window, or retrieval-based designs.
Tokenization usually uses byte-pair encoding, WordPiece, or sentencepiece-style subword units. Token counts matter for cost, latency, truncation, and prompt design; short English text may average ~0.75 words/token, but URLs, code, multilingual text, and emojis can tokenize inefficiently.
Embeddings map text, images, users, items, or sessions into dense vectors where semantic similarity can be measured using cosine similarity, dot product, or Euclidean distance. Cosine similarity is $\cos(x,y)=\frac{x\cdot y}{\|x\|\|y\|},$ and works well when vector magnitude is not meaningful.
Approximate nearest neighbor search is required once candidate vectors exceed roughly millions of items. Exact search over 100M embeddings is usually too slow; systems like FAISS, ScaNN, HNSW, and IVF-PQ trade recall for latency, memory, and update complexity.
Contrastive learning trains embeddings by pulling positives together and pushing negatives apart. A common loss is InfoNCE: $L=-\log\frac{\exp(s(q,k^+)/\tau)}{\sum_i \exp(s(q,k_i)/\tau)}.$ Negative sampling quality is crucial; easy negatives inflate offline metrics without improving product performance.
Prompt engineering controls model behavior through task instructions, examples, constraints, schemas, and context. Good prompts specify role, objective, inputs, output format, edge cases, and refusal policy; brittle prompts rely on vague instructions like “be accurate” without verifiable criteria.
Retrieval-augmented generation combines search with generation: retrieve relevant documents or examples, insert them into the prompt, and ask the model to answer using that context. It is often preferable to fine-tuning when knowledge changes frequently, provenance matters, or hallucination risk is high.
Fine-tuning updates model weights for task specialization, tone, format, or domain adaptation. Full fine-tuning is expensive and risky; parameter-efficient approaches like LoRA and adapters update a small number of parameters, reducing compute and enabling faster iteration.
Instruction tuning, RLHF, and DPO align models to human preferences. For DS interviews, focus less on implementation details and more on data quality, reward misspecification, preference bias, policy drift, and evaluating whether alignment improves user-facing outcomes.
Evaluation needs both offline and online layers. Offline metrics may include exact match, F1, ROUGE, BLEU, embedding recall@k, toxicity rate, human preference win rate, and calibration; online evaluation needs guardrail metrics such as latency, reports, hides, retention, and reviewer escalations.
Hallucination is not just factual error; it includes unsupported claims, invalid citations, unsafe advice, and overconfident classification. Mitigations include constrained decoding, retrieval with citations, abstention thresholds, human-in-the-loop review, confidence calibration, and post-generation validators.
Privacy and integrity constraints matter at Meta scale. Training or prompting with user data requires minimization, access control, retention limits, PII redaction, auditability, and careful leakage testing; memorization can expose rare strings, private messages, or policy-sensitive content.

Worked example

Design an LLM-based classifier for policy-violating content

A strong candidate would start by clarifying the scope: “Are we classifying text only, or multimodal posts with images, comments, and user history? Is the goal enforcement, review prioritization, or user-facing explanation?” They would also ask about the target policy area, label availability, acceptable false positive rate, and whether decisions must be real-time in Feed or can run asynchronously. The answer can then be organized around four pillars: data and labels, modeling approach, evaluation, and deployment/monitoring. For modeling, they might propose starting with a prompted LLM or embedding-based classifier as a baseline, then moving to fine-tuning if the policy taxonomy is stable and enough high-quality reviewer labels exist. For evaluation, they should separate offline metrics like precision, recall, F1, and area under the precision-recall curve from online metrics such as reviewer queue reduction, successful appeals, content takedown accuracy, and user reports. One explicit tradeoff is recall versus false positives: aggressive enforcement may reduce harmful exposure but incorrectly demote borderline speech, so thresholds may differ by severity and market. They should mention human-in-the-loop review for low-confidence or high-severity cases, plus calibration by language, region, and content type. A strong close would be: “If I had more time, I’d design an experiment comparing LLM-assisted review against the current classifier, with guardrails for appeal rate, latency, and fairness across languages.”

A second angle

Use embeddings to improve search or recommendations

The same foundations apply, but the framing shifts from generation or classification to candidate retrieval and ranking. Instead of asking “What should the model say?”, the core question is “Which items should we retrieve from a very large corpus quickly and semantically?” A good answer would discuss embedding users, queries, posts, reels, or ads into a shared vector space, retrieving top candidates with FAISS or HNSW, then re-ranking with a heavier model using engagement, freshness, integrity, and personalization features. The major constraint is scale: an offline embedding model with great semantic quality may be unusable if it cannot refresh vectors fast enough for new posts or handle billions of items within latency budgets. Evaluation should include recall@k, downstream ranking lift, diversity, freshness, and online impact on CTR, watch time, hides, and long-term retention.

Common pitfalls

Pitfall: Treating fine-tuning as the default answer.

A tempting answer is “just fine-tune a model on Meta data.” That misses the key design decision: if the task needs fresh knowledge, auditability, or source grounding, retrieval-augmented generation or a classifier over embeddings may be safer, cheaper, and easier to update than changing model weights.

Pitfall: Reporting only generic model accuracy.

Saying “we’ll optimize accuracy” is weak because many LLM and integrity tasks are imbalanced, high-stakes, or threshold-dependent. A better answer names precision-recall tradeoffs, severity-weighted costs, calibration, human review capacity, and product guardrails such as latency p95, appeal rate, and user report rate.

Pitfall: Overexplaining architecture while ignoring product constraints.

Some candidates spend five minutes describing attention heads and pretraining but never say how the system will be evaluated or deployed. Interviewers usually care more about whether you can make the model useful in a Meta product: data quality, failure modes, experimentation, monitoring, privacy, and operational cost.

Connections

Expect pivots into ranking systems, experimentation, causal inference, and responsible AI. If the interviewer pushes on evaluation, be ready to discuss A/B testing, counterfactual logging, inter-rater reliability, calibration, and bias by language or demographic group. If they push on systems, expect questions about ANN indexes, latency budgets, feature freshness, and model monitoring.