This question evaluates understanding of binary classification evaluation metrics (precision and recall) and competency in handling unreliable API behavior for robust metric computation, testing both conceptual understanding and practical application.
You have 10 image files. Each file has a ground-truth label indicating whether it contains a dog.
You can call an API like searchDogs(k) which is intended to return k file IDs that the system predicts are dogs (e.g., top-k results for the query "dog").
Tasks:
None
, throws, returns fewer than k items, returns more than k items, returns duplicates, or returns unknown file IDs). How would you handle these cases so the metric computation is robust and well-defined?