PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/ML System Design/Citadel

Stabilize LLM inference and estimate needed repeats

Last updated: Mar 29, 2026

Quick Overview

This question evaluates skills in designing reliable LLM inference pipelines and in statistical modeling of stochastic outputs, including reproducibility engineering, uncertainty quantification, and the use of correlation metrics (e.g., Pearson) to measure stability.

  • medium
  • Citadel
  • ML System Design
  • Data Scientist

Stabilize LLM inference and estimate needed repeats

Company: Citadel

Role: Data Scientist

Category: ML System Design

Difficulty: medium

Interview Round: Technical Screen

You run an LLM-based sentiment model to score a fixed dataset of texts. Because the inference API doesn’t let you set `temperature` (and outputs are stochastic), the model produces slightly different score vectors on different days. - Day 1 inference output is a vector \(y_1\) (one score per item). - Day 2 inference output is \(y_2\). - The observed Pearson correlation is \(\mathrm{corr}(y_1, y_2) = 0.95\). Tasks: 1. **System/ML design:** How would you make inference outputs more reproducible (or at least stable) in production given limited decoding controls? 2. **Modeling question:** Propose a reasonable statistical model for this randomness and derive how many independent inference runs (e.g., days) you’d need to aggregate so that the correlation between aggregated outputs from two independent aggregations exceeds **0.99** (state assumptions clearly).

Quick Answer: This question evaluates skills in designing reliable LLM inference pipelines and in statistical modeling of stochastic outputs, including reproducibility engineering, uncertainty quantification, and the use of correlation metrics (e.g., Pearson) to measure stability.

Related Interview Questions

  • Build models for housing and wind power prediction - Citadel (hard)
  • Design a time-series home-buy decision classifier - Citadel (hard)
  • Build a regression model for wind power output - Citadel (hard)
|Home/ML System Design/Citadel

Stabilize LLM inference and estimate needed repeats

Citadel logo
Citadel
Oct 9, 2025, 12:00 AM
mediumData ScientistTechnical ScreenML System Design
4
0
Loading...

You run an LLM-based sentiment model to score a fixed dataset of texts. Because the inference API doesn’t let you set temperature (and outputs are stochastic), the model produces slightly different score vectors on different days.

  • Day 1 inference output is a vector y1y_1y1​ (one score per item).
  • Day 2 inference output is y2y_2y2​ .
  • The observed Pearson correlation is corr(y1,y2)=0.95\mathrm{corr}(y_1, y_2) = 0.95corr(y1​,y2​)=0.95 .

Tasks:

  1. System/ML design: How would you make inference outputs more reproducible (or at least stable) in production given limited decoding controls?
  2. Modeling question: Propose a reasonable statistical model for this randomness and derive how many independent inference runs (e.g., days) you’d need to aggregate so that the correlation between aggregated outputs from two independent aggregations exceeds 0.99 (state assumptions clearly).

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Citadel•More Data Scientist•Citadel Data Scientist•Citadel ML System Design•Data Scientist ML System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.