Clean OCR data and build an LLM dataset
Company: Microsoft
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: medium
Interview Round: Onsite
## Problem: OCR data practice (cleaning → LLM-ready data)
You are given an OCR dataset intended to train or fine-tune an LLM to improve OCR text quality.
### Input
A dataset of records like:
- `image_id` (or image path)
- `ocr_text` (raw OCR output)
- Optional: `ground_truth_text` (human-labeled), `language`, `source`, timestamps, confidence scores, bounding boxes, etc.
### Tasks
1. **Data cleaning & normalization**
- Propose a cleaning pipeline to prepare high-quality text pairs for training.
- Handle common OCR artifacts (broken Unicode, random whitespace/newlines, hyphenation at line breaks, repeated headers/footers, page numbers, garbage tokens).
- Define rules/heuristics and what you would log/measure.
2. **Filtering & quality control**
- Identify and remove low-quality or risky samples (PII, toxic content, extremely noisy OCR, duplicates, near-duplicates, misaligned labels if `ground_truth_text` exists).
- If no ground truth exists, explain how you would estimate quality.
3. **Create an LLM training/eval split**
- Prevent leakage (e.g., same document across train/test).
- Propose evaluation sets and metrics for OCR-correction.
4. **Modeling approach**
- Explain how you would train an LLM (or instruction-tune) for OCR correction given the cleaned dataset.
- Specify input/output format (prompting style), loss/objective, and any baselines.
5. **Deliverables**
- Describe what artifacts you would produce (cleaned dataset schema, reports, dashboards, model cards, etc.).
Quick Answer: This question evaluates competency in OCR data cleaning and normalization, dataset engineering for LLM fine-tuning, quality filtering, and evaluation design for text-correction tasks.