System Design: Secure, Ethical, Multi‑Tenant ML Data and Inference Platform
Context
Design a cloud-based ML platform used by multiple internal product teams. The platform must cover data ingestion, storage, training, and online/offline inference, while meeting strict security, privacy, and ethical standards. Assume:
-
Multiple tenants (teams) share infrastructure but require strong isolation.
-
Mix of structured/unstructured data, including PII and sensitive content.
-
Both batch (training/offline scoring) and real-time (online inference) workloads.
Requirements
-
Multi-tenant isolation, data classification, and PII handling
-
Isolation across compute, storage, and network.
-
Data classification taxonomy and enforcement.
-
PII handling: tokenization/de-identification, data minimization, retention/deletion.
-
Secrets, network, and access controls
-
Secret management with automated key rotation.
-
Network segmentation and egress controls.
-
Least-privilege access (RBAC/ABAC), short-lived credentials.
-
Model governance
-
Approval gates in CI/CD, model registry and lineage.
-
Red‑teaming, bias/safety/abuse audits.
-
Rollback and kill‑switch plans.
-
Compliance and audit logging
-
High-level alignment with SOC 2 and GDPR/CCPA.
-
Tamper‑evident audit logging and retention.
-
Reliability, cost, and monitoring
-
SLOs for training and serving.
-
Cost controls and quotas.
-
Monitoring for data/model drift and misuse.
-
Architecture and trade-offs
-
Provide an end-to-end architecture diagram.
-
Discuss key trade-offs of the design.