This question evaluates skills in designing reliable LLM inference pipelines and in statistical modeling of stochastic outputs, including reproducibility engineering, uncertainty quantification, and the use of correlation metrics (e.g., Pearson) to measure stability.
You run an LLM-based sentiment model to score a fixed dataset of texts. Because the inference API doesn’t let you set temperature (and outputs are stochastic), the model produces slightly different score vectors on different days.
Tasks: