Clean OCR data and build an LLM dataset | Microsoft Interview Question