How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a nan difficulty System Design question, commonly asked during Technical Screen rounds at Microsoft.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Microsoft during technical interviews.

Design a ChatGPT-like serving system

Last updated: May 5, 2026

Quick Overview

This question evaluates expertise in designing scalable machine-learning inference systems, covering chat-completion architecture, GPU capacity planning for large transformers, and stateful KV-cache design, layout, latency, and consistency considerations.

Microsoft

Mar 1, 2026, 12:00 AM

Software Engineer

Technical Screen

System Design

Design a ChatGPT-like system for inference serving.

Your design discussion should cover:

High-level architecture for chat completion (request routing, tokenization, model execution, streaming output, safety, observability).
A rough GPU count estimate to host a 400B-parameter transformer model using BF16 weights.
If you store the KV cache (attention keys/values for the prompt/history) in Redis , explain:
- what data is stored (granularity/layout),
- how inference workers read/write it during prefill and decode,
- latency/bandwidth implications and mitigations,
- failure modes and consistency choices.

Assume modern datacenter GPUs (e.g., 80GB class) and high-throughput networking. State any assumptions you make (context length, throughput targets, replication, etc.).

Solution

Show

Comments (0)

Loading comments...

Browse More Questions

More System Design•More Microsoft•More Software Engineer•Microsoft Software Engineer•Microsoft System Design•Software Engineer System Design