This question evaluates a candidate's competency in designing cloud-native ML platforms for fine-tuning large language models, covering distributed training, multi-tenant security and isolation, cost controls, reproducibility, observability, checkpointing, and deployment workflows.
You need to build a system that lets customers fine-tune their own large language model (LLM) on AWS.
Design a managed platform where users can:
Architecture, key APIs, training workflow, data management, and evaluation strategy.
Login required