Design an AWS fine-tuning platform for LLMs
Company: OpenAI
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Onsite
## Scenario
You need to build a system that lets customers fine-tune their own large language model (LLM) on AWS.
## Task
Design a managed platform where users can:
- Upload datasets (text + optional instruction/response pairs).
- Choose a base model and fine-tuning method (full FT, LoRA/QLoRA, adapters).
- Launch training jobs, monitor progress, and evaluate results.
- Deploy the resulting model behind an inference endpoint.
## Constraints / Requirements
- Multi-tenant isolation and security (VPC, IAM, encryption).
- Cost controls and quotas.
- Support for distributed training and checkpointing.
- Reproducibility (job config, code versioning, dataset versioning).
- Observability (metrics, logs, traces) and failure recovery.
## Deliverables
Architecture, key APIs, training workflow, data management, and evaluation strategy.
Quick Answer: This question evaluates a candidate's competency in designing cloud-native ML platforms for fine-tuning large language models, covering distributed training, multi-tenant security and isolation, cost controls, reproducibility, observability, checkpointing, and deployment workflows.