You work on a mobile music streaming team responsible for improving playback reliability while minimizing battery drain.
When the app is streaming audio, the network may appear to disconnect. Sometimes this is a temporary failure: retrying the network request a few times will successfully reconnect. Other times the network is truly unavailable, and repeatedly retrying wastes battery and radio power.
You are given historical event-level data with two columns:
| Column | Type | Description |
|---|
retry_count | integer | Number of retry attempts made during a network interruption event before the event ended. |
is_success | boolean | Whether the app eventually reconnected and obtained audio data. |
The team wants to set a retry threshold T: the app will retry at most T times, then stop retrying and fail gracefully.
How would you use this data to choose an optimal retry threshold? Discuss the metrics, tradeoffs, statistical approach, possible biases in the historical data, and how you would validate the threshold before launch.