You run an LLM-based sentiment model to score a fixed dataset of texts. Because the inference API doesn’t let you set temperature (and outputs are stochastic), the model produces slightly different score vectors on different days.
-
Day 1 inference output is a vector
y1
(one score per item).
-
Day 2 inference output is
y2
.
-
The observed Pearson correlation is
corr(y1,y2)=0.95
.
Tasks:
-
System/ML design:
How would you make inference outputs more reproducible (or at least stable) in production given limited decoding controls?
-
Modeling question:
Propose a reasonable statistical model for this randomness and derive how many independent inference runs (e.g., days) you’d need to aggregate so that the correlation between aggregated outputs from two independent aggregations exceeds
0.99
(state assumptions clearly).