Introduction to Generative AI System Design
Why Generative AI System Design Matters
If you have been following the AI industry over the past few years, you have witnessed something remarkable. Generative AI has moved from a niche research topic to the backbone of products used by hundreds of millions of people. ChatGPT reached 100 million users faster than any consumer application in history. Midjourney generates millions of images daily. ElevenLabs powers voice synthesis for content creators worldwide. Behind each of these products lies a carefully designed system that balances model quality, latency, cost, and reliability at massive scale.
This course exists because there is a growing gap between understanding how generative models work in theory and knowing how to design the systems that bring them to production. Whether you are preparing for a system design interview at a top AI company or building your own GenAI product, the ability to reason about these systems end-to-end is now a critical skill.
Interview tip: When an interviewer asks you to "design a system like ChatGPT," they are not asking you to derive the transformer architecture from scratch. They want to see that you can reason about the full stack: data pipelines, training infrastructure, serving architecture, evaluation, and deployment.
The Intersection of Distributed ML and System Design
Traditional system design focuses on building scalable web services: load balancers, databases, caches, message queues, and CDNs. GenAI system design introduces an entirely new dimension: the machine learning model itself becomes the most expensive, most complex, and most latency-sensitive component in your architecture.
Consider the differences:
| Aspect | Traditional System Design | GenAI System Design |
|---|---|---|
| Primary bottleneck | Database I/O, network | GPU compute, memory |
| Scaling unit | CPU instances | GPU instances (-30/hr each) |
| Latency drivers | Network hops, DB queries | Model inference (100ms-10s) |
| Storage concerns | User data, media files | Model weights (10GB-1TB), training data (TB-PB) |
| Cost structure | Relatively cheap compute | GPUs are 10-100x more expensive per unit |
| Failure modes | Server crashes, network partitions | OOM errors, GPU failures, model quality degradation |
| Deployment | Rolling updates in seconds | Model swaps requiring minutes of loading |
In GenAI system design, you need to think about both the traditional distributed systems concerns AND the ML-specific challenges. A well-designed GenAI system handles data collection, preprocessing, distributed training across hundreds of GPUs, model evaluation, efficient serving with tight latency budgets, and continuous monitoring of model quality in production.
What Makes GenAI Systems Uniquely Challenging
GPU memory is the new bottleneck. A single GPT-3 scale model (175B parameters) requires roughly 350GB just to store its weights in FP16. That is more than four A100 GPUs (80GB each) just for the weights, before you account for activations, KV cache, or optimizer states during training.
Inference cost scales with output length. Unlike a traditional API where serving cost is roughly constant per request, GenAI models have costs that scale with the number of tokens generated. A 500-token response costs roughly 5x more compute than a 100-token response.
Quality is probabilistic, not deterministic. A traditional web service either returns the correct data or it does not. A GenAI system might generate a plausible-sounding but factually incorrect response. Monitoring and evaluation require entirely different approaches.
Training is a massive distributed computing problem. Training a frontier model like GPT-4 or Llama 3 requires thousands of GPUs running in coordination for weeks or months. A single GPU failure can waste hours of work if checkpointing is not done properly.
Generative AI Modalities
This course covers the major generative AI modalities that companies are building products around today. Each modality has its own unique challenges, but they share common architectural patterns.
Text-to-Text Generation
This is the most mature and widely deployed modality. Systems like ChatGPT, Claude, and Gemini take text prompts and generate text responses. The core architecture is the autoregressive transformer, which generates text one token at a time.
Key challenges: Managing long context windows (up to 128K+ tokens), reducing latency for streaming responses, handling multi-turn conversations, content safety filtering, and managing the enormous cost of serving billions of tokens per day.
Scale reference: OpenAI reportedly serves over 100 million weekly active users, processing billions of tokens daily across their API and ChatGPT products.
Text-to-Image Generation
Systems like DALL-E, Midjourney, and Stable Diffusion generate images from text descriptions. Most modern systems use diffusion models that start with random noise and iteratively denoise it into a coherent image, guided by the text prompt.
Key challenges: Generating high-resolution images (1024x1024+) with fine details, ensuring text-image alignment, controlling generation (style, composition, specific elements), NSFW filtering, and managing the compute cost of the iterative denoising process.
Scale reference: Midjourney has generated over a billion images since launch, with millions of new images created daily.
Text-to-Speech Generation
Systems like ElevenLabs and Play.ht convert text into natural-sounding speech. Modern TTS systems can clone voices from short audio samples and generate speech with natural prosody, emotion, and style.
Key challenges: Real-time streaming synthesis for conversational AI, voice cloning with minimal reference audio, maintaining speaker consistency across long passages, multilingual support, and preventing misuse (deepfake voices).
Scale reference: ElevenLabs processes millions of characters of text per day, serving content creators, game developers, and accessibility applications.
Text-to-Video Generation
The newest frontier, with systems like Sora and Runway Gen-3 generating video clips from text descriptions. This combines the challenges of image generation with temporal consistency and motion modeling.
Key challenges: Maintaining temporal consistency across frames, modeling realistic motion and physics, extremely high compute requirements (orders of magnitude more than image generation), long generation times, and video quality at higher resolutions.
Scale reference: Video generation is still in its early stages commercially, but the compute requirements are staggering. Generating a single minute of high-quality video can require hundreds of GPU-seconds.
The SCALED Framework
Throughout this course, we use a systematic 6-step framework called SCALED for approaching GenAI system design problems. This framework gives you a structured way to work through any GenAI design question, whether in an interview or in a real project.
| Step | Name | What You Do |
|---|---|---|
| S | Scenario | Define the problem scope, use cases, user requirements, and constraints |
| C | Capabilities | Choose the model architecture, training approach, and key ML decisions |
| A | Architecture | Design the system components, data pipelines, and serving infrastructure |
| L | Latency/Throughput | Analyze performance requirements and optimize for your SLA targets |
| E | Evaluation | Define metrics, set up monitoring, and design feedback loops |
| D | Deployment | Plan rollout strategy, scaling, fault tolerance, and CI/CD for models |
Here is how you might use the SCALED framework to think through a text-to-text system:
S - Scenario: "We need a conversational AI assistant that handles 10M daily active users, supports multi-turn conversations up to 8K tokens, and responds in under 2 seconds for the first token."
C - Capabilities: "We will use a transformer-based autoregressive model with 70B parameters, trained on a mix of web data and curated instruction data, fine-tuned with RLHF for helpfulness and safety."
A - Architecture: "The system has an API gateway for rate limiting and auth, a conversation manager for context handling, a model serving layer with continuous batching using vLLM, and a safety filter pipeline."
L - Latency/Throughput: "Time-to-first-token target is 500ms. We need to serve 1,000 requests per second at peak. Each H100 can serve roughly 50 concurrent requests with continuous batching, so we need about 20 H100s for the model serving layer alone."
E - Evaluation: "We track perplexity on held-out data, human preference ratings via thumbs up/down, safety metrics from our content filter, and latency percentiles (p50, p95, p99)."
D - Deployment: "We use canary deployments for new model versions, rolling out to 1% of traffic first. Model weights are stored in S3 and loaded on instance startup. We use Kubernetes for orchestration with GPU node pools."
Interview tip: Memorize the SCALED acronym. In an interview, explicitly tell the interviewer which step you are on. This demonstrates structured thinking and makes it easy for them to follow your design.
Why Companies Ask GenAI System Design Questions
The explosive growth of generative AI has created a new category of interview questions. Companies like Google, Meta, OpenAI, Anthropic, Amazon, Microsoft, and many startups now ask candidates to design GenAI systems. Here is why:
The talent gap is real. There are far more companies wanting to build GenAI products than there are engineers who understand how to design these systems end-to-end. If you can demonstrate this knowledge, you stand out immediately.
It tests both ML and systems knowledge. GenAI system design sits at the intersection of machine learning and distributed systems. It tests whether you can reason about model architectures, training pipelines, serving infrastructure, and operational concerns all at once.
It reflects real work. At companies building GenAI products, engineers routinely make decisions about model selection, serving architecture, cost optimization, and quality monitoring. These interviews mirror the actual work.
What Interviewers Evaluate
When you are asked to design a GenAI system, interviewers are looking for:
Breadth of knowledge: Can you reason about the full stack, from data to deployment?
Depth on tradeoffs: Can you articulate why you would choose one approach over another?
Quantitative reasoning: Can you estimate compute requirements, costs, and latency?
Practical awareness: Do you understand real-world constraints like GPU availability, cost, and failure modes?
Communication: Can you explain complex ML concepts clearly?
Key Differences Between Traditional and GenAI System Design
If you have prepared for traditional system design interviews (designing URL shorteners, chat systems, or news feeds), you already have a solid foundation. GenAI system design builds on those skills but adds several new dimensions.
You Must Think About Two Phases: Training and Serving
Traditional system design only deals with the serving phase. GenAI system design requires you to think about both:
Training phase: How do you collect and process training data? How do you distribute training across a GPU cluster? How long will training take? How much will it cost? How do you evaluate whether the model is good enough?
Serving phase: How do you load a massive model onto GPUs? How do you handle the sequential nature of autoregressive generation? How do you batch requests efficiently? How do you manage GPU memory?
Cost Is Dominated by Compute, Not Storage
In traditional systems, storage (databases) is often the main cost driver. In GenAI systems, GPU compute dominates everything. A single H100 GPU costs roughly /hour on cloud providers. A cluster of 1,000 H100s for training costs ,000/hour or ,000/day. Serving costs scale with the number of tokens generated.
Latency Has a Different Character
In traditional systems, latency is primarily network and I/O bound. In GenAI systems, latency is compute bound. Generating a 200-token response requires 200 sequential forward passes through the model. You cannot parallelize this. The only ways to reduce latency are to use faster hardware, smaller models, or techniques like speculative decoding.
Quality Monitoring Is Fundamentally Different
You cannot write unit tests for a generative model. The model might generate a perfectly grammatical, confident-sounding response that is completely wrong. Quality monitoring requires a combination of automated metrics, human evaluation, and production monitoring systems that detect quality degradation over time.
What You Will Learn in This Course
This course is structured to take you from foundational concepts to designing complete GenAI systems:
Section 1 (this page): Introduction and overview of GenAI system design.
Section 2: Fundamental Concepts. Deep dives into the ML building blocks you need: transformers, attention, tokenization, evaluation metrics, parallelism strategies, inference optimization, RAG, and finetuning.
Section 3: Back-of-the-Envelope Calculations. How to estimate GPU memory requirements, training time, training cost, inference latency, and serving costs. These calculations are critical for interview credibility.
Section 4: The SCALED Framework. A detailed walkthrough of the 6-step framework with examples and interview strategies.
Sections 5-9: Case Studies. Complete system designs for five different GenAI modalities:
Text-to-text (ChatGPT-like systems)
Text-to-image (DALL-E / Stable Diffusion-like systems)
Text-to-speech (ElevenLabs-like systems)
Text-to-video (Sora-like systems)
Image captioning (multimodal understanding systems)
Each case study covers both the training infrastructure and the deployment architecture.
Section 10: Conclusion. Key takeaways, study plan, and practice problems.
Who This Course Is For
This course is designed for:
ML engineers who understand models but want to learn how to design full systems around them
Software engineers transitioning into AI/ML roles who need to understand the ML-specific components
Engineering managers who need to make informed decisions about GenAI infrastructure
Interview candidates preparing for GenAI system design interviews at top companies
Prerequisites: Basic understanding of machine learning concepts (what a neural network is, what training means) and basic system design knowledge (load balancers, databases, APIs). We review the key ML concepts in Section 2, but this is not an intro to ML course.
How to Get the Most Out of This Course
Here are some practical tips:
Read with a whiteboard. For each case study, try drawing the architecture diagram yourself before reading the solution. This simulates the interview experience.
Practice the calculations. The back-of-the-envelope section is one of the highest-value parts of this course. Being able to quickly estimate that "a 70B parameter model needs about 140GB in FP16, so we need at least 2 A100-80GBs" is the kind of quantitative reasoning that impresses interviewers.
Focus on tradeoffs, not memorization. Interviewers do not expect you to remember exact numbers. They want to see that you understand the tradeoffs: why would you use quantization vs distillation? When is RAG better than finetuning? What are the pros and cons of continuous batching?
Practice explaining out loud. GenAI system design interviews are collaborative conversations. Practice explaining your designs verbally, as if you were talking to a colleague.
Interview tip: The best candidates do not just describe components. They explain WHY they chose each component, what alternatives they considered, and what tradeoffs they made. This is what separates a senior answer from a junior one.
Let us get started. In the next section, we will build your foundation by reviewing the key machine learning concepts that underpin every GenAI system.