Clean OCR data and build an LLM dataset

Q: Clean OCR data and build an LLM dataset

This is a Machine Learning interview question from Microsoft for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Loading...

Problem: OCR data practice (cleaning → LLM-ready data)

You are given an OCR dataset intended to train or fine-tune an LLM to improve OCR text quality.

Input

A dataset of records like:

image_id (or image path)
ocr_text (raw OCR output)
Optional: ground_truth_text (human-labeled), language , source , timestamps, confidence scores, bounding boxes, etc.

Tasks

Data cleaning & normalization
- Propose a cleaning pipeline to prepare high-quality text pairs for training.
- Handle common OCR artifacts (broken Unicode, random whitespace/newlines, hyphenation at line breaks, repeated headers/footers, page numbers, garbage tokens).
- Define rules/heuristics and what you would log/measure.
Filtering & quality control
- Identify and remove low-quality or risky samples (PII, toxic content, extremely noisy OCR, duplicates, near-duplicates, misaligned labels if ground_truth_text exists).
- If no ground truth exists, explain how you would estimate quality.
Create an LLM training/eval split
- Prevent leakage (e.g., same document across train/test).
- Propose evaluation sets and metrics for OCR-correction.
Modeling approach
- Explain how you would train an LLM (or instruction-tune) for OCR correction given the cleaned dataset.
- Specify input/output format (prompting style), loss/objective, and any baselines.
Deliverables
- Describe what artifacts you would produce (cleaned dataset schema, reports, dashboards, model cards, etc.).

Clean OCR data and build an LLM dataset

Problem: OCR data practice (cleaning → LLM-ready data)

Input

Tasks

Solution

Comments (0)