This question evaluates competency in statistical inference and sampling design, covering hypothesis testing for distribution equality in univariate and multivariate settings, distinctions between continuous and categorical tests, assumptions like independence and sample comparability, power and sample-size considerations, multiple-testing issues, and common sampling methods. It is commonly asked in the Statistics & Math domain to assess a data scientist's reasoning about data provenance and comparative analysis, examining both conceptual understanding of hypotheses and assumptions and practical application of tests and sampling strategies.
You have two datasets collected from different systems or populations, and you want to determine whether they come from the same distribution.
Explain how you would test distribution equality in both univariate and multivariate settings. Discuss:
Also describe common sampling methods that are useful when collecting evaluation data, such as simple random sampling, stratified sampling, cluster sampling, importance sampling, and reservoir sampling, and explain when each is appropriate.