PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/System Design/TikTok

Explain SRE architecture and troubleshooting scenarios

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's competency in SRE architecture and reliability engineering, covering Kubernetes fundamentals, pod failure troubleshooting, SLO/SLA/SLI reasoning, multi-layer latency diagnosis, and SQL versus NoSQL trade-offs within the System Design domain.

  • hard
  • TikTok
  • System Design
  • Software Engineer

Explain SRE architecture and troubleshooting scenarios

Company: TikTok

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Technical Screen

Explain the difference between a Pod and a Deployment in Kubernetes and describe the role of a Service. How would you debug a Pod that fails to start—list concrete steps and typical kubectl commands you would use. Define and contrast SLO, SLA, and SLI; if the actual error rate exceeds the SLO, what actions would you take and how would you prioritize them. Given a distributed system with elevated request latency, outline a structured troubleshooting approach across the network, load balancer, database, cache, and the service itself, including the specific metrics to check and experiments to run. Compare SQL and NoSQL data stores and describe scenarios where NoSQL is the better fit.

Quick Answer: This question evaluates a candidate's competency in SRE architecture and reliability engineering, covering Kubernetes fundamentals, pod failure troubleshooting, SLO/SLA/SLI reasoning, multi-layer latency diagnosis, and SQL versus NoSQL trade-offs within the System Design domain.

Related Interview Questions

  • Choose tools for scalable distributed systems - TikTok (medium)
  • Design a distributed key-value store - TikTok (medium)
  • Design a content moderation system - TikTok (medium)
  • Design low-latency large-scale hotel booking system - TikTok (medium)
  • Design tables from metrics - TikTok (hard)
TikTok logo
TikTok
Sep 6, 2025, 12:00 AM
Software Engineer
Technical Screen
System Design
5
0

Kubernetes, Reliability, and Data Store Concepts (Technical Screen)

Context: Assume you operate a high-scale, latency-sensitive microservices platform running on Kubernetes. Answer the following practical questions.

1) Kubernetes Building Blocks

  • Explain the difference between a Pod and a Deployment in Kubernetes.
  • Describe the role of a Service.

2) Debugging a Pod That Fails to Start

List concrete steps and typical kubectl commands you would use to troubleshoot a Pod that is Pending, ImagePullBackOff, CrashLoopBackOff, OOMKilled, or failing health probes.

3) SLO, SLA, and SLI

  • Define and contrast SLO, SLA, and SLI.
  • If the actual error rate exceeds the SLO, what actions would you take and how would you prioritize them?

4) Elevated Request Latency in a Distributed System

Outline a structured troubleshooting approach across the network, load balancer, database, cache, and the service itself. Include:

  • Specific metrics to check at each layer.
  • Concrete experiments to run to isolate the bottleneck.

5) SQL vs NoSQL

Compare SQL and NoSQL data stores and describe scenarios where NoSQL is the better fit.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More TikTok•More Software Engineer•TikTok Software Engineer•TikTok System Design•Software Engineer System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.