PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/ML System Design/Citadel

Stabilize LLM inference and estimate needed repeats

Last updated: Mar 29, 2026

Quick Overview

This question evaluates skills in designing reliable LLM inference pipelines and in statistical modeling of stochastic outputs, including reproducibility engineering, uncertainty quantification, and the use of correlation metrics (e.g., Pearson) to measure stability.

  • medium
  • Citadel
  • ML System Design
  • Data Scientist

Stabilize LLM inference and estimate needed repeats

Company: Citadel

Role: Data Scientist

Category: ML System Design

Difficulty: medium

Interview Round: Technical Screen

You run an LLM-based sentiment model to score a fixed dataset of texts. Because the inference API doesn’t let you set `temperature` (and outputs are stochastic), the model produces slightly different score vectors on different days. - Day 1 inference output is a vector \(y_1\) (one score per item). - Day 2 inference output is \(y_2\). - The observed Pearson correlation is \(\mathrm{corr}(y_1, y_2) = 0.95\). Tasks: 1. **System/ML design:** How would you make inference outputs more reproducible (or at least stable) in production given limited decoding controls? 2. **Modeling question:** Propose a reasonable statistical model for this randomness and derive how many independent inference runs (e.g., days) you’d need to aggregate so that the correlation between aggregated outputs from two independent aggregations exceeds **0.99** (state assumptions clearly).

Quick Answer: This question evaluates skills in designing reliable LLM inference pipelines and in statistical modeling of stochastic outputs, including reproducibility engineering, uncertainty quantification, and the use of correlation metrics (e.g., Pearson) to measure stability.

Related Interview Questions

  • Build models for housing and wind power prediction - Citadel (hard)
  • Design a time-series home-buy decision classifier - Citadel (hard)
  • Build a regression model for wind power output - Citadel (hard)
Citadel logo
Citadel
Oct 9, 2025, 12:00 AM
Data Scientist
Technical Screen
ML System Design
2
0
Loading...

You run an LLM-based sentiment model to score a fixed dataset of texts. Because the inference API doesn’t let you set temperature (and outputs are stochastic), the model produces slightly different score vectors on different days.

  • Day 1 inference output is a vector y1y_1y1​ (one score per item).
  • Day 2 inference output is y2y_2y2​ .
  • The observed Pearson correlation is corr(y1,y2)=0.95\mathrm{corr}(y_1, y_2) = 0.95corr(y1​,y2​)=0.95 .

Tasks:

  1. System/ML design: How would you make inference outputs more reproducible (or at least stable) in production given limited decoding controls?
  2. Modeling question: Propose a reasonable statistical model for this randomness and derive how many independent inference runs (e.g., days) you’d need to aggregate so that the correlation between aggregated outputs from two independent aggregations exceeds 0.99 (state assumptions clearly).

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Citadel•More Data Scientist•Citadel Data Scientist•Citadel ML System Design•Data Scientist ML System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.