Probability, Bayes, And Base Rates

What's being tested

Interviewers are testing whether you can reason under uncertainty without ignoring prevalence, selection effects, or asymmetric costs. For a Meta Data Scientist, this shows up in integrity classifiers, ads quality, notification targeting, ranking experiments, fraud detection, and measurement systems where rare events can dominate interpretation. The interviewer is usually not looking for memorized Bayes’ theorem alone; they are probing whether you can translate a product scenario into conditional probabilities, choose the right denominator, and explain the business implication clearly. Strong answers combine math, intuition, and decision-making: “What is the probability this user/content/ad is truly problematic given the signal we observed, and what should we do next?”

Core knowledge

Bayes’ theorem is the central identity:
$P(A \mid B)=\frac{P(B \mid A)P(A)}{P(B)}$
In product terms: posterior = likelihood × prior / evidence. The prior, often the base rate, is usually the part candidates forget.
For binary classification, always distinguish $P(\text{flag} \mid \text{bad})$ from $P(\text{bad} \mid \text{flag})$ . The first is recall/sensitivity; the second is precision/positive predictive value. They can differ dramatically when the bad-event rate is low.
The denominator in Bayes is usually expanded with the law of total probability:
$P(B)=P(B \mid A)P(A)+P(B \mid A^c)P(A^c)$
This is essential when computing posterior probabilities for tests, classifiers, moderation systems, or fraud alerts.
In rare-event settings, even highly accurate classifiers can produce many false positives. If prevalence is $0.1\%$ , sensitivity is $99\%$ , and false-positive rate is $1\%$ , then most flagged examples may still be false positives because the negative class is huge.
Confusion-matrix metrics answer different product questions. Precision asks “when we take action, how often are we right?” Recall asks “how much of the bad population do we catch?” Specificity asks “how well do we avoid harming good users?” Accuracy is often misleading under class imbalance.
Likelihood ratios are a compact way to update odds. Posterior odds = prior odds × likelihood ratio, where
$LR^+ = \frac{P(+ \mid A)}{P(+ \mid A^c)}$
This is useful when combining multiple independent signals, though independence is a strong assumption.
Base rates may vary by segment: geography, language, device, user tenure, advertiser type, content surface, or acquisition channel. A global posterior can be wrong for a specific cohort; this is related to Simpson’s paradox and calibration drift.
Calibration matters when model scores are interpreted probabilistically. If a model assigns score 0.8, about 80% of such cases should be positive. Common calibration methods include Platt scaling, isotonic regression, reliability diagrams, and expected calibration error.
Sampling design affects observed base rates. A labeled moderation dataset enriched for suspicious content will not reflect production prevalence. To estimate real-world precision or prevalence, account for sampling weights, random audits, or inverse propensity weighting.
Threshold choice is a decision problem, not only a modeling problem. If false positives are costly, optimize for high precision or low false-positive rate; if misses are costly, optimize recall. Expected cost can be written as $C_{FP}FP + C_{FN}FN$ .
In sequential or multi-signal systems, posteriors should not double-count correlated evidence. For example, “reported by many users” and “high comment toxicity” may both reflect the same underlying controversy. Naive Bayes can work surprisingly well but fails when dependencies are strong.
Use natural frequencies to communicate clearly. Instead of saying “posterior probability is 9%,” say: “Out of 100,000 posts, 100 are truly violating; the classifier catches 90, but also flags 999 clean posts, so only about 8% of flags are true violations.”

Worked example

“99% Accurate Harmful Content Classifier” is a classic framing: a classifier flags posts as violating policy, the model is described as “99% accurate,” and you are asked how likely a flagged post is to truly violate policy. In the first 30 seconds, a strong candidate would clarify what “accurate” means: sensitivity, specificity, overall accuracy, or both true-positive and true-negative rates. They would also ask for or assume the base rate of violating content, because without prevalence $P(\text{violation})$ , $P(\text{violation} \mid \text{flag})$ cannot be determined. The answer skeleton should have four pillars: define events, write Bayes’ theorem, plug in prevalence plus true/false positive rates, then interpret the result for product action. A clean setup is: $V=$ truly violating, $F=$ flagged, so $P(V \mid F)=\frac{P(F \mid V)P(V)}{P(F \mid V)P(V)+P(F \mid \neg V)P(\neg V)}$ . The candidate should explicitly flag that rare prevalence can make precision low even when recall and specificity look excellent. A key tradeoff is whether to auto-remove flagged content or send it to human review: low precision may be unacceptable for enforcement, but high recall may still be useful for ranking down or prioritizing review queues. They should close by saying that if they had more time, they would segment by surface/language, validate calibration on production traffic, and evaluate costs of false positives versus false negatives rather than optimizing generic accuracy.

A second angle

“Two Tests for the Same Event” uses the same principle but shifts the emphasis from a single posterior to evidence combination. Suppose a user is flagged by both a behavioral anomaly detector and a graph-based fraud model; the tempting answer is to multiply their probabilities as if the signals are independent. The candidate should instead ask whether the detectors share features, training labels, or upstream reporting mechanisms, because correlated signals provide less incremental evidence than independent ones. The same Bayes/odds framework applies, but the constraint is dependence: posterior odds can be updated with likelihood ratios only if the evidence model is valid. A strong answer would recommend validating combined precision empirically on a randomly sampled audit set rather than relying only on theoretical multiplication.

Common pitfalls

Analytical mistake: confusing inverse conditional probabilities. A common wrong answer is treating $P(\text{flag} \mid \text{violation})=99\%$ as if it means $P(\text{violation} \mid \text{flag})=99\%$ . What lands better is immediately naming the two quantities: “99% recall does not imply 99% precision; I need the base rate and false-positive rate.”

Communication mistake: staying in abstract algebra too long. Interviewers may follow the math, but product stakeholders often will not. After writing Bayes’ theorem, translate into counts: “Imagine 1 million posts; if 0.1% violate, that is 1,000 true violations…” Natural frequencies make the base-rate effect obvious and show product intuition.

Depth mistake: assuming the base rate is fixed and global. A candidate might compute one posterior and stop. A stronger answer says the posterior could differ materially by market, policy type, traffic source, or time since launch, and that production monitoring should track calibration and precision by segment to catch drift or fairness issues.

Connections

Interviewers often pivot from this topic into classification metrics, threshold tuning, calibration, and experiment interpretation under low-base-rate outcomes. They may also connect it to causal inference when asking whether an observed posterior or segment difference reflects true risk, selection bias, or a changed sampling process.