Implement string and basic ML algorithms

Q: Implement string and basic ML algorithms

This question evaluates proficiency in algorithmic string processing and implementation of foundational machine learning models, covering skills such as designing an efficient distinct-character substring algorithm, implementing binary logistic regression training and prediction, and building a multinomial Naive Bayes text classifier.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

You are given three implementation tasks that mix algorithms and basic machine learning models.

Task 1: Longest substring with all distinct characters

Given a string s consisting of ASCII characters, find the length of the longest contiguous substring in which all characters are distinct (no character appears more than once within that substring).

Input: a single string s .
Output: an integer representing the maximum length of any substring of s that has all unique characters.

Example:

s = abcabcbb → the answer is 3 (one such substring is abc ).
s = bbbbb → the answer is 1 (any single b ).
s = "" (empty string) → the answer is 0 .

Constraints (typical):

0 <= len(s) <= 10^5
Expected time: O(n) where n = len(s)
Expected extra space: O(min(n, alphabet_size))

Implement a function with a signature similar to:

int longestDistinctSubstringLength(const string& s) (C++), or
int longest_distinct_substring_length(string s) (Python-style pseudocode).

Task 2: Implement binary logistic regression from scratch

Implement a binary logistic regression classifier without using any machine learning libraries.

Design a class LogisticRegression that supports:

Constructor: takes at least:
- learning_rate: float
- num_iterations: int
fit(X, y):
- X is an n x d numeric matrix of features (e.g., list of length n , each element a length- d list of floats).
- y is a length- n list/array of labels, each label is either 0 or 1 .
- Initialize weights and bias (e.g., to zeros).
- Train using batch gradient descent to minimize the logistic (cross-entropy) loss :
  - Model: for each sample i , compute z_i = w · x_i + b , then probability of class 1 is p_i = sigmoid(z_i) = 1 / (1 + exp(-z_i)) .
  - Loss over the dataset:

$L(w, b) = -\frac{1}{n} \sum_{i=1}^n \Bigl( y_i \log p_i + (1 - y_i) \log(1 - p_i) \Bigr).$

- Update `w` and `b` with gradient descent for `num_iterations` steps.

No regularization is required.
predict_proba(X):
- Given a feature matrix X , return a list/array of predicted probabilities p_i for class 1 for each row.
predict(X):
- Given X , first compute probabilities with predict_proba , then convert to labels 0 or 1 using threshold 0.5 .

Constraints (typical):

n (number of samples) up to about 10^4 .
d (number of features) up to about 10^2 .
Number of iterations up to around 10^4 .

You may implement in any language, but do not use built-in ML training utilities.

Task 3: Implement a multinomial Naive Bayes text classifier

Implement a multinomial Naive Bayes classifier for text classification.

Assume:

Training data: docs_train , a list of documents, where each document is represented as a list of token strings, e.g. [["this", "movie", "is", "great"], ["awful", "film"], ...] .
Training labels: y_train , a list of integers where y_train[i] is the class index for docs_train[i] (values in {0, 1, ..., C-1} ).
Test data: docs_test , same format as docs_train but without labels.

Design a class NaiveBayes that supports:

fit(docs_train, y_train):
- Build a vocabulary of all tokens seen in docs_train .
- Let C be the number of classes.
- Estimate class prior probabilities:

$P(y = c) = \frac{\text{count of training docs with label } c}{\text{total number of docs}}.$

For each class c and token w in the vocabulary, estimate conditional probabilities with Laplace smoothing (add-one smoothing):

$P(w \mid y = c) = \frac{\text{total count of token } w \text{ in class } c + 1}{\text{total token count in class } c + V},$

where `V` is the vocabulary size.

Store log probabilities ( log P(y = c) and log P(w | y = c) ) to avoid underflow.
predict(docs_test):
- For each document, treat it as a bag of words.
- For each class c , compute the log posterior up to a constant:

$\log P(y = c \mid \text{doc}) \propto \log P(y = c) + \sum_{w \in \text{doc}} \text{count}(w \text{ in doc}) \cdot \log P(w \mid y = c).$

Predict the class c with the highest log score.
Ignore tokens not in the training vocabulary (or handle them with an appropriate strategy such as skipping them).

Constraints (typical):

Number of documents up to a few 10^4 .
Vocabulary size up to a few 10^4 .
Must be efficient in both time and memory.

Focus on correctly implementing the core training (fit) and prediction (predict) logic; you do not need to handle file I/O or text preprocessing (assume tokenization is already done).

Implement string and basic ML algorithms

Quick Overview

Task 1: Longest substring with all distinct characters

Task 2: Implement binary logistic regression from scratch

Task 3: Implement a multinomial Naive Bayes text classifier

Comments (0)