Fake-Account Screening with Threshold on 5 Signals
You are designing a rule-based screener that flags an account if at least k of 5 binary signals fire. Signals behave differently for fake vs. authentic accounts:
-
For fake accounts, each signal fires with probability p_f = 0.8.
-
For authentic accounts, each signal fires with probability p_a = 0.05.
-
Signals are independent within an account unless otherwise stated.
Let S be the number of signals that fire for an account (S ~ Binomial(n=5, p) under independence). An account is flagged if S ≥ k.
Tasks
(a) For k = 2, compute P(flagged | fake) and P(flagged | authentic) using the binomial distribution.
(b) Assume a base rate P(fake) = 1.5%. Compute P(fake | flagged) via Bayes' theorem, and the expected number of flagged accounts in a day with 5,000,000 accounts scanned. Is a manual review queue of 80,000 per day sufficient?
(c) Find the smallest k such that expected flagged volume fits within 80,000 ± 5% (i.e., 76,000–84,000 per day) while maximizing P(fake | flagged). Show work and justify the precision–recall trade-off.
(d) If signals are not independent and have equal within-class pairwise correlation ρ = 0.2, explain how your answers and assumptions change. Provide a reasonable way to model this dependence and illustrate its impact on volume and precision.