PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/System Design/Microsoft

Design a ChatGPT-like serving system

Last updated: May 5, 2026

Quick Overview

This question evaluates expertise in designing scalable machine-learning inference systems, covering chat-completion architecture, GPU capacity planning for large transformers, and stateful KV-cache design, layout, latency, and consistency considerations.

  • nan
  • Microsoft
  • System Design
  • Software Engineer

Design a ChatGPT-like serving system

Company: Microsoft

Role: Software Engineer

Category: System Design

Difficulty: nan

Interview Round: Technical Screen

Design a ChatGPT-like system for inference serving. Your design discussion should cover: 1. **High-level architecture** for chat completion (request routing, tokenization, model execution, streaming output, safety, observability). 2. A **rough GPU count estimate** to host a **400B-parameter** transformer model using **BF16** weights. 3. If you store the **KV cache** (attention keys/values for the prompt/history) in **Redis**, explain: - what data is stored (granularity/layout), - how inference workers read/write it during prefill and decode, - latency/bandwidth implications and mitigations, - failure modes and consistency choices. Assume modern datacenter GPUs (e.g., 80GB class) and high-throughput networking. State any assumptions you make (context length, throughput targets, replication, etc.).

Quick Answer: This question evaluates expertise in designing scalable machine-learning inference systems, covering chat-completion architecture, GPU capacity planning for large transformers, and stateful KV-cache design, layout, latency, and consistency considerations.

Related Interview Questions

  • Design A Scalable Web Crawler - Microsoft (medium)
  • Design User Re-engagement Notifications - Microsoft (medium)
  • Design a typeahead search service - Microsoft (hard)
  • Design a Secure Copilot API - Microsoft
  • Design a URL Shortener - Microsoft (hard)
Microsoft logo
Microsoft
Mar 1, 2026, 12:00 AM
Software Engineer
Technical Screen
System Design
6
0
Loading...

Design a ChatGPT-like system for inference serving.

Your design discussion should cover:

  1. High-level architecture for chat completion (request routing, tokenization, model execution, streaming output, safety, observability).
  2. A rough GPU count estimate to host a 400B-parameter transformer model using BF16 weights.
  3. If you store the KV cache (attention keys/values for the prompt/history) in Redis , explain:
    • what data is stored (granularity/layout),
    • how inference workers read/write it during prefill and decode,
    • latency/bandwidth implications and mitigations,
    • failure modes and consistency choices.

Assume modern datacenter GPUs (e.g., 80GB class) and high-throughput networking. State any assumptions you make (context length, throughput targets, replication, etc.).

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Microsoft•More Software Engineer•Microsoft Software Engineer•Microsoft System Design•Software Engineer System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.