Build an Ollama Eval Harness
Company: Mercor
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: medium
Interview Round: Technical Screen
Design and implement a lightweight evaluation harness for a language model served locally through Ollama.
The harness should:
- Read a dataset of evaluation cases, where each case contains an `id`, a `prompt`, and optionally a `reference_answer` or expected label.
- Send each prompt to an Ollama-hosted model and capture the model output.
- Record metadata such as model name, latency, failures, and retry attempts.
- Compute useful evaluation metrics such as exact match for deterministic tasks, pass rate, average latency, and error rate.
- Save both per-example results and an aggregate summary report.
- Expose configuration for model name, timeout, retries, temperature, and concurrency.
- Be structured so that another model backend could be added later with minimal changes.
Explain the system design, major components, data flow, and how you would make the harness reliable and extensible.
Quick Answer: This question evaluates a candidate's ability to design and implement a lightweight evaluation harness for a locally hosted language model, testing competencies in ML system design, service integration, metrics instrumentation, reliability engineering (timeouts, retries), concurrency control, configuration management, and extensibility.