Design low-latency large-scale hotel booking system
Company: TikTok
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Technical Screen
You are asked to design the backend for a **large-scale hotel booking system** that runs behind a very high-traffic consumer app (think a TikTok-like app where a hotel goes viral and suddenly millions of users click into the same property).
Users can:
- Browse hotels for a city and date range.
- View **near real-time availability and prices** for a specific hotel.
- Place a **booking request** and receive a **confirmation or rejection**.
The interviewer gives you the high-level requirements and asks you to reason carefully about **business properties first**, then derive the architecture and key trade-offs.
### Functional requirements
- Search for hotels by city, date range, and basic filters.
- For a given hotel and date range, show up-to-date availability (rooms left) and price.
- Create a booking for a chosen hotel, room type, and date range.
- Guarantee that **the same room-night is not double-booked**.
- Optionally support cancellation (you can keep this high-level).
### Non-functional requirements
- **High concurrency:**
- Assume up to \~100k read requests/sec (availability checks) and \~10k booking attempts/sec at peak.
- **Low latency:**
- P99 latency for availability checks and booking confirmation should be **< 200 ms** end-to-end from the client’s perspective.
- **High availability:**
- The system must continue working if individual nodes or even a whole zone go down.
- **Consistency characteristics:**
- **Weak/eventual consistency is acceptable for displaying availability counts on the UI.** A user may occasionally see an outdated count.
- **Strong correctness is required for final booking confirmation** (no double-booking).
- **Message loss tolerance:**
- For streaming / push-style **availability updates** to clients, it is acceptable if **some update messages are lost** (they will be refreshed soon anyway).
- It is **not acceptable** to lose actual booking requests or confirmations.
- **Hotspot handling:**
- Some hotels may become **extremely hot** (e.g., after a viral video), causing a huge, skewed load.
- The system should **avoid concentrating all traffic for one hot hotel on a single node**.
- **Latency vs reliability trade-off:**
- You should discuss **protocol choices** (e.g., HTTP vs WebSocket vs UDP or similar) and how they impact latency and reliability.
### Specific discussion points the interviewer cares about
1. **Latency vs reliability:**
- When and why might you choose a low-overhead protocol such as **UDP or WebSocket-style persistent connections** instead of plain HTTP for certain flows?
- Which parts of the system can tolerate message loss, and which cannot?
2. **Preventing double bookings for the same hotel/room:**
- Under high concurrency, how do you ensure two users cannot both successfully book the **last available room-night**?
- Discuss techniques such as **rate limiting, request coalescing/merging, throttling, and concurrency control** (locks, atomic counters, queues, etc.).
3. **Sharding / horizontal scaling for hot hotels:**
- How would you horizontally partition the load for a very hot hotel?
- Compare strategies like **sharding by room** (e.g., room ID or room type) versus **bucketizing by user** (e.g., hashing on user ID) and discuss pros/cons.
Assume you are free to pick any technologies (e.g., relational vs NoSQL databases, caches like Redis, message queues like Kafka, etc.).
**Task:**
Design the system at a high level:
- Identify the major components/services.
- Propose a data model for hotels, rooms, inventory, and bookings.
- Describe the end-to-end flows for **availability lookup** and **booking confirmation**.
- Explain how your design achieves:
- Low latency with acceptable reliability trade-offs.
- No double-booking despite high concurrency.
- Good handling of **hot hotels** via sharding/partitioning.
- Explicitly call out the key trade-offs and why you made those choices.
Quick Answer: This question evaluates understanding of large-scale backend and distributed system design, including scalability, high availability, consistency models, concurrency control to prevent double-booking, hotspot mitigation, and protocol trade-offs for low-latency booking and availability flows.