Scenario
You need to build a system that lets customers fine-tune their own large language model (LLM) on AWS.
Task
Design a managed platform where users can:
-
Upload datasets (text + optional instruction/response pairs).
-
Choose a base model and fine-tuning method (full FT, LoRA/QLoRA, adapters).
-
Launch training jobs, monitor progress, and evaluate results.
-
Deploy the resulting model behind an inference endpoint.
Constraints / Requirements
-
Multi-tenant isolation and security (VPC, IAM, encryption).
-
Cost controls and quotas.
-
Support for distributed training and checkpointing.
-
Reproducibility (job config, code versioning, dataset versioning).
-
Observability (metrics, logs, traces) and failure recovery.
Deliverables
Architecture, key APIs, training workflow, data management, and evaluation strategy.