This question evaluates named entity recognition, noisy-text preprocessing, entity disambiguation, and hybrid rule- and model-based pipeline design for extracting organization names from unstructured resumes and web snippets.
You receive messy resume text (PDF-to-text/OCR, varying casing) and scraped web snippets (boilerplate, menus, ads). Your goal is to extract company names (organizations) accurately under noise such as Unicode artifacts, misspellings, acronyms, and ambiguous tokens (e.g., Apple vs apple).
(a) Design a hybrid system that combines rule-based patterns (e.g., legal suffixes and context windows) with a machine-learned NER model. Describe the end-to-end pipeline and how you will handle casing, Unicode noise, and misspellings.
(b) Explain feature choices or embeddings (e.g., subword, contextual) and how to incorporate a company gazetteer with fuzzy matching while avoiding label leakage.
(c) Define evaluation metrics (entity-level precision, recall, F1) and an error analysis plan, with special attention to acronyms and ambiguous tokens.
Login required