PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches
|Home/Machine Learning/Other

Extract companies from noisy text

Last updated: Mar 29, 2026

Quick Overview

This question evaluates named entity recognition, noisy-text preprocessing, entity disambiguation, and hybrid rule- and model-based pipeline design for extracting organization names from unstructured resumes and web snippets.

  • hard
  • Other
  • Machine Learning
  • Data Scientist

Extract companies from noisy text

Company: Other

Role: Data Scientist

Category: Machine Learning

Difficulty: hard

Interview Round: Onsite

You receive messy resume text and scraped web snippets; the task is to extract company names. (a) Design a hybrid system combining rule-based patterns (e.g., legal suffixes, context windows) with a machine-learned NER model; discuss handling of casing, Unicode noise, and misspellings. (b) Explain feature choices or embeddings (e.g., subword, contextual) and how to incorporate a company gazetteer with fuzzy matching while avoiding label leakage. (c) Define evaluation metrics (entity-level precision/recall/F1) and error analysis procedures for acronyms and ambiguous tokens (e.g., Apple vs apple).

Quick Answer: This question evaluates named entity recognition, noisy-text preprocessing, entity disambiguation, and hybrid rule- and model-based pipeline design for extracting organization names from unstructured resumes and web snippets.

Related Interview Questions

  • Derive and regularize logistic regression - Other (hard)
  • Design anomaly detection and handle imbalanced logistic regression - Other (Medium)
  • Evaluate and select K in K-means - Other (medium)
  • Explain SVM kernels and complexity - Other (hard)
  • Compare trees, RF, and gradient boosting - Other (medium)
Other logo
Other
Oct 13, 2025, 9:49 PM
Data Scientist
Onsite
Machine Learning
2
0

Extracting Company Names from Noisy Resumes and Web Snippets

Context

You receive messy resume text (PDF-to-text/OCR, varying casing) and scraped web snippets (boilerplate, menus, ads). Your goal is to extract company names (organizations) accurately under noise such as Unicode artifacts, misspellings, acronyms, and ambiguous tokens (e.g., Apple vs apple).

Tasks

(a) Design a hybrid system that combines rule-based patterns (e.g., legal suffixes and context windows) with a machine-learned NER model. Describe the end-to-end pipeline and how you will handle casing, Unicode noise, and misspellings.

(b) Explain feature choices or embeddings (e.g., subword, contextual) and how to incorporate a company gazetteer with fuzzy matching while avoiding label leakage.

(c) Define evaluation metrics (entity-level precision, recall, F1) and an error analysis plan, with special attention to acronyms and ambiguous tokens.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Other•More Data Scientist•Other Data Scientist•Other Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.