LLM Fundamentals For Data Scientists

What's being tested

Interviewers are testing whether you understand large language models well enough to reason about product tradeoffs, evaluation, and failure modes—not whether you can recite transformer trivia. For a Data Scientist at Meta, the key skill is translating model behavior into measurable user, integrity, and business outcomes across products like Feed, Reels, Ads, Messenger, and creator tools. Expect probes on how LLMs are trained, why they fail, how to evaluate them offline and online, and how to decide whether a model-powered feature is safe and worth shipping. The interviewer is looking for structured thinking: can you connect tokens, embeddings, attention, fine-tuning, hallucination, latency, and metrics like CTR, DAU, retention, and violation rate?

Core knowledge

Tokenization converts text into subword units using methods like Byte Pair Encoding or SentencePiece. LLMs predict token sequences, not words; token boundaries affect cost, latency, multilingual quality, and bias. A 4,000-token context is often only ~2,500–3,000 English words.
Transformer architectures use self-attention to model relationships between all tokens in a context. The core operation is $\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$ which lets each token condition on other tokens, but has $O(n^2)$ memory/time complexity in sequence length.
Autoregressive language modeling trains models to predict the next token: $\mathcal{L}=-\sum_t \log P(x_t \mid x_{<t})$ Models like GPT and Llama generate left-to-right; lower loss usually improves fluency, but product quality may not track perfectly with cross-entropy.
Perplexity is $\exp(\text{cross-entropy})$ and measures how surprised the model is by held-out text. It is useful for comparing base models on similar distributions, but weak for instruction-following, factuality, safety, or user satisfaction.
Embeddings map text, users, posts, or queries into dense vectors where similarity can be measured by cosine similarity: $\cos(a,b)=\frac{a \cdot b}{\|a\|\|b\|}$ For retrieval at Meta scale, approximate nearest neighbor systems like FAISS are preferred over exact search once vectors reach millions or billions.
Pretraining learns general language patterns from massive corpora; supervised fine-tuning adapts the model to instruction-response data; RLHF or preference optimization aligns outputs with human judgments. Fine-tuning can improve task fit but may reduce generality or introduce regressions.
Prompting controls behavior without changing model weights. Strong prompts specify task, audience, constraints, examples, and output format. However, prompts are brittle: small wording changes can alter outputs, and prompt-only solutions often fail under adversarial or long-tail inputs.
Retrieval-augmented generation combines search with generation: retrieve documents, inject them into context, then generate grounded answers. RAG improves freshness and factuality but introduces new failure modes: bad retrieval, context truncation, conflicting sources, and citations that appear plausible but are unsupported.
Hallucination occurs when a model produces fluent but false or unsupported content. It is not just a knowledge problem; it arises from next-token likelihood optimization, ambiguous prompts, missing context, and overconfident decoding. Mitigations include retrieval, refusal policies, calibration, and human review.
Decoding parameters shape generation. Temperature controls randomness; top-k and top-p sampling limit candidate tokens. For deterministic classification or extraction, use low temperature near 0; for creative generation, higher values increase diversity but also inconsistency and safety risk.
Evaluation should combine offline and online evidence. Offline metrics include human preference win rate, task accuracy, toxicity rate, factuality, latency, and cost per request. Online tests track product metrics like DAU, session time, CTR, hide/report rate, retention, and support escalations.
Safety and privacy matter especially for social platforms. LLMs can leak memorized data, amplify harmful content, generate policy-violating text, or behave differently across languages and demographic groups. Evaluate slices by locale, age group where allowed, content type, and risk category.

Worked example

How would you evaluate whether an LLM-powered comment-summary feature improves user experience?

A strong candidate would first frame the product goal: “I’d clarify whether the summary is meant to save users time, increase meaningful engagement, reduce exposure to toxic comments, or help creators understand feedback.” They would also ask where the summary appears, whether users can expand comments, what languages are in scope, and whether the model is summarizing public comments or sensitive/private content. The answer should be organized around four pillars: offline quality, safety/integrity, online experimentation, and operational constraints. Offline, they would propose human-rated summary faithfulness, coverage, toxicity, and readability, plus slice analysis by language, topic, comment volume, and creator type. Online, they would run an A/B test with guardrails: primary metrics might be comment-thread engagement, creator satisfaction, or time saved; guardrails might include hide/report rate, misinformation exposure, negative feedback, latency, and model cost. A key tradeoff is that increasing summary compression may reduce cognitive load but also hide minority viewpoints or misrepresent sentiment, especially in polarized threads. They should explicitly flag that LLM quality cannot be judged only by engagement because a misleading summary could increase clicks while harming trust. A strong close would be: “If I had more time, I’d add longitudinal metrics, creator surveys, and post-launch monitoring for drift, especially around breaking news and high-risk content.”

A second angle

Why do transformers use attention instead of RNNs?

The same fundamentals apply, but the framing is more model-mechanics than product-evaluation. Here, the candidate should explain that recurrent neural networks process tokens sequentially, making long-range dependencies and parallel training harder, while transformers use self-attention to connect any token pair directly. The tradeoff is that transformers parallelize well on GPUs and scale effectively, but standard attention becomes expensive for long contexts because compute and memory grow roughly as $O(n^2)$ . A product-aware answer would connect this to real constraints: longer chat history or more retrieved documents may improve relevance, but can increase latency, cost, and truncation risk. The best candidates bridge mechanics to decisions: use smaller context windows, retrieval, summarization, caching, or long-context architectures depending on the feature’s quality-latency-cost frontier.

Common pitfalls

Pitfall: Treating LLM evaluation as “just measure accuracy.”

Many LLM tasks do not have a single correct answer, so exact-match accuracy can be misleading. A better answer separates task types: classification may use precision/recall/F1; summarization needs faithfulness and coverage; chat assistants need preference ratings, safety rates, latency, and online user outcomes.

Pitfall: Giving a model-centric answer when the role is Data Scientist.

A tempting answer is to spend five minutes on transformer layers, feed-forward blocks, and positional encodings without connecting to Meta product decisions. The stronger response explains enough mechanics to justify measurement choices, then moves to experiment design, metrics, slices, guardrails, and launch criteria.

Pitfall: Ignoring failure modes in edge populations and long-tail content.

Saying “we can fine-tune on more data” is too shallow. Interviewers expect discussion of hallucination, toxicity, multilingual degradation, prompt injection, privacy leakage, distribution shift, and subgroup performance, especially because social platforms operate across many languages, cultures, and content domains.

Connections

Interviewers may pivot from LLM fundamentals into recommendation systems, especially embeddings, retrieval, ranking, and personalization. They may also pivot into A/B testing, causal inference, or integrity measurement, asking how to prove that an AI feature improves user value without increasing harm. For deeper technical follow-ups, expect questions on RAG, fine-tuning strategy, human-label quality, calibration, or cost-latency tradeoffs.