You are given an approximate histogram of search-query frequencies. Each bucket i is represented as (left_bd_i, right_bd_i, bucket_count_i), where:
-
left_bd_i
and
right_bd_i
are the lower and upper boundaries of the bucket,
-
bucket_count_i
is the number of queries whose true
search_count
falls in that bucket.
Assume there are K non-overlapping buckets sorted by boundary, and you do not have access to the raw per-query search_count values.
How would you estimate the nth percentile of the underlying search_count distribution?
Your answer should address:
-
How to identify which bucket contains the desired percentile.
-
Why simply returning the midpoint of that bucket can be a poor estimate.
-
How to improve the estimate using interpolation within the bucket.
-
What assumptions are required for that interpolation to be reasonable.
-
How to handle edge cases such as empty buckets, percentiles near 0 or 100, and coarse or uneven bucket widths.