Problem: OCR data practice (cleaning → LLM-ready data)
You are given an OCR dataset intended to train or fine-tune an LLM to improve OCR text quality.
Input
A dataset of records like:
-
image_id
(or image path)
-
ocr_text
(raw OCR output)
-
Optional:
ground_truth_text
(human-labeled),
language
,
source
, timestamps, confidence scores, bounding boxes, etc.
Tasks
-
Data cleaning & normalization
-
Propose a cleaning pipeline to prepare high-quality text pairs for training.
-
Handle common OCR artifacts (broken Unicode, random whitespace/newlines, hyphenation at line breaks, repeated headers/footers, page numbers, garbage tokens).
-
Define rules/heuristics and what you would log/measure.
-
Filtering & quality control
-
Identify and remove low-quality or risky samples (PII, toxic content, extremely noisy OCR, duplicates, near-duplicates, misaligned labels if
ground_truth_text
exists).
-
If no ground truth exists, explain how you would estimate quality.
-
Create an LLM training/eval split
-
Prevent leakage (e.g., same document across train/test).
-
Propose evaluation sets and metrics for OCR-correction.
-
Modeling approach
-
Explain how you would train an LLM (or instruction-tune) for OCR correction given the cleaned dataset.
-
Specify input/output format (prompting style), loss/objective, and any baselines.
-
Deliverables
-
Describe what artifacts you would produce (cleaned dataset schema, reports, dashboards, model cards, etc.).