You are the on-call engineer for a delivery platform.
System context
-
Couriers use a mobile app to accept and complete deliveries.
-
The mobile app calls a public gateway service (
Dasher Service
), which then calls a
Payment Card Integration Service
.
-
For some merchants, the courier must pay in person using a prepaid debit card.
-
That card is funded programmatically during checkout through a third-party payment card provider.
-
The integration service also relies on Redis for card and account information caching.
-
The company is in the middle of migrating from a monolith to microservices.
High-level flow:
Courier App -> Dasher Service -> Payment Card Integration Service -> Third-Party Card Provider
Payment Card Integration Service <-> Redis cache
Incident
It is 4:30 PM Pacific, during a busy period, and you are paged because the Payment Card Integration Service is showing much higher than expected memory utilization.
Explain how you would handle this on-call investigation. Your answer should cover:
-
How you would assess severity and business impact.
-
What metrics, dashboards, and logs you would check first.
-
The most likely causes of high memory usage in this architecture.
-
How you would determine whether the issue is caused by traffic, a recent deploy, Redis behavior, retries, or the third-party provider.
-
Immediate mitigation steps you would consider.
-
How you would communicate during the incident.
-
What long-term fixes or follow-up actions you would propose after recovery.