This question evaluates a data scientist's experimental-design and statistical-analysis competencies, including choosing the unit of analysis and randomization scheme, defining primary and guardrail metrics, conducting power/sample-size calculations, and specifying stopping and multiple-comparisons procedures.
You need to test whether replacing the 10% "other" verbal questions with additional grammar questions improves productivity in a 15-minute section with 19 questions total (i.e., ~2 questions replaced). Assume the item bank for both question types is pre-calibrated to be of comparable difficulty within type.
Specify the following:
(a) Unit of analysis, randomization scheme, and whether you use parallel arms (between-subjects) or a crossover (within-subjects) with counterbalancing to control learning effects.
(b) Primary metric (e.g., correct answers per minute) and at least two guardrail metrics (e.g., fairness across native/non-native speakers, speed–accuracy trade-off).
(c) Inclusion/exclusion criteria, pre-registration, blinding, and quality controls (e.g., proctoring, minimum reading times).
(d) A sample-size calculation for detecting a +0.4 correct/attempt improvement with SD = 2.8 at α = 0.05 and 80% power using a two-sample t-test—show the formula and the final per-arm n.
(e) Your stopping rule, multiple-comparisons plan (e.g., Holm–Bonferroni across subtypes), and how you will interpret effects if time-on-question changes but accuracy does not.
Login required