PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/ML System Design/Oracle

Design scalable, highly available GenAI serving

Last updated: Mar 29, 2026

Quick Overview

This question evaluates understanding of scalable, highly available generative AI inference platforms and associated competencies in distributed systems, ML model serving, autoscaling and GPU scheduling, global request routing, model/version management, stateful dependency handling, observability, and rate limiting.

  • hard
  • Oracle
  • ML System Design
  • Software Engineer

Design scalable, highly available GenAI serving

Company: Oracle

Role: Software Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Onsite

Design the deployment of a generative AI model for high scalability and high availability. Describe the inference serving architecture, request routing, autoscaling (including GPU scheduling), multi-region failover, model versioning and rollout, stateful dependency management (tokenizer, embeddings, caches), observability, rate limiting, and strategies to meet latency/throughput SLOs under traffic spikes and failures.

Quick Answer: This question evaluates understanding of scalable, highly available generative AI inference platforms and associated competencies in distributed systems, ML model serving, autoscaling and GPU scheduling, global request routing, model/version management, stateful dependency handling, observability, and rate limiting.

Oracle logo
Oracle
Sep 6, 2025, 12:00 AM
Software Engineer
Onsite
ML System Design
1
0

System Design: Highly Scalable, Highly Available Generative AI Inference Platform

Context

Design a production-grade deployment for a generative AI text model (decoder-only Transformer, 7B–70B parameters) serving enterprise, multi-tenant traffic. The platform must sustain high scalability and high availability across regions and handle unpredictable traffic spikes.

You may make minimal, explicit assumptions to ground your design (e.g., target SLOs for time-to-first-token and throughput, typical prompt/output lengths, GPU types).

Requirements

Describe and justify your design for the following:

  1. Inference serving architecture
    • Components and data/control planes
    • Streaming vs non-streaming; batching; cache usage
  2. Request routing
    • Global and regional routing, session affinity, retries/hedging
  3. Autoscaling (including GPU scheduling)
    • Replica scaling signals, node autoscaling, bin-packing/MIG, warm pools
  4. Multi-region strategy
    • Active-active vs active-passive, failover triggers, data/control plane considerations
  5. Model versioning and rollout
    • Registry, artifact management, canary/blue-green, rollback, compatibility (tokenizer/adapters)
  6. Stateful dependency management
    • Tokenizer/embeddings versioning, KV/prompt caches, locality/affinity, external stores
  7. Observability
    • Metrics/traces/logs at model/tenant/version levels; GPU health; SLO dashboards and alerting
  8. Rate limiting and fairness
    • Per-tenant budgets, token-based limits, concurrency caps, overload protection
  9. Meeting latency/throughput SLOs under spikes and failures
    • Admission control, dynamic batching, speculative decoding, degradation and fallbacks

Provide a clear end-to-end flow and the key trade-offs behind your choices.

Solution

Show

Submit Your Answer

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Oracle•More Software Engineer•Oracle Software Engineer•Oracle ML System Design•Software Engineer ML System Design
PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.