Sales Outreach Correlation Analysis: Inference, Multiple Testing, Power, and Simpson’s Paradox
Context
You are analyzing sales data to understand relationships between outreach actions and deal outcomes. Below, compute inferential statistics for a correlation, control the false discovery rate across multiple tests, estimate detectable effect size for a study design, and explain a Simpson’s paradox scenario using equations.
Tasks
(a) For n = 3200 deals, the Pearson correlation between call_count in the first 14 days and is_won is r = 0.23. Using Fisher's z-transform, compute the 95% confidence interval for r and the two-sided p-value. Show intermediate Fisher z, standard error (SE), z-interval, and back-transform steps.
(b) You tested m = 24 correlations (different channels/time windows). Sorted p-values are:
[0.0004, 0.0010, 0.0040, 0.0090, 0.0120, 0.0190, 0.0260, 0.0310, 0.0410, 0.0530, 0.0610, 0.0740, 0.0810, 0.0940, 0.1100, 0.1300, 0.1700, 0.2100, 0.2700, 0.3400, 0.4100, 0.5500, 0.6800, 0.7900]. Apply the Benjamini–Hochberg procedure at q = 0.10 and state which hypotheses you reject, showing the thresholds i × (q/m).
(c) What is the minimal detectable correlation (two-sided, α = 0.05, power = 0.80) for n = 500 using the Fisher z power approximation? Provide the formula and numeric result.
(d) You observe overall corr(discount_rate, is_won) = −0.10, but within each region {East, West, Central} the correlations are {+0.05, +0.04, +0.03}. Explain, with equations, how region-mix imbalance can yield this Simpson’s paradox and how to diagnose it numerically (e.g., weighted covariance decomposition and partial correlation controlling for region).