This question evaluates an engineer's competency in ML system design for validating large language models, testing knowledge of evaluation metrics, validation scopes from internal numerics to final responses, system architecture for automated validation pipelines, and operational concerns within the ML System Design domain; it requires both conceptual architecture understanding and practical application-level thinking. It is commonly asked to assess the ability to reason about correctness, safety, regression risk, scalability, reproducibility, and monitoring trade-offs when deploying production-grade validation pipelines and to gauge familiarity with complementary evaluation methods and end-to-end data flows.
You are asked to design an end-to-end LLM quality validation system for a team that trains and serves large language models. The goal is to automatically and systematically evaluate correctness, safety, and regression risk for new model versions and for production responses.
Design this system, covering:
Outline your design with clear components, data flows, and trade-offs, and call out important assumptions. Assume the underlying cloud primitives (compute instances, object storage, queues) are available.
Login required