You are given three implementation tasks that mix algorithms and basic machine learning models.
Task 1: Longest substring with all distinct characters
Given a string s consisting of ASCII characters, find the length of the longest contiguous substring in which all characters are distinct (no character appears more than once within that substring).
-
Input: a single string
s
.
-
Output: an integer representing the maximum length of any substring of
s
that has all unique characters.
Example:
-
s = abcabcbb
→ the answer is
3
(one such substring is
abc
).
-
s = bbbbb
→ the answer is
1
(any single
b
).
-
s = ""
(empty string) → the answer is
0
.
Constraints (typical):
-
0 <= len(s) <= 10^5
-
Expected time:
O(n)
where
n = len(s)
-
Expected extra space:
O(min(n, alphabet_size))
Implement a function with a signature similar to:
-
int longestDistinctSubstringLength(const string& s)
(C++), or
-
int longest_distinct_substring_length(string s)
(Python-style pseudocode).
Task 2: Implement binary logistic regression from scratch
Implement a binary logistic regression classifier without using any machine learning libraries.
Design a class LogisticRegression that supports:
-
Constructor:
takes at least:
-
learning_rate: float
-
num_iterations: int
-
fit(X, y):
-
X
is an
n x d
numeric matrix of features (e.g., list of length
n
, each element a length-
d
list of floats).
-
y
is a length-
n
list/array of labels, each label is either
0
or
1
.
-
Initialize weights and bias (e.g., to zeros).
-
Train using
batch gradient descent
to minimize the
logistic (cross-entropy) loss
:
-
Model: for each sample
i
, compute
z_i = w · x_i + b
, then probability of class 1 is
p_i = sigmoid(z_i) = 1 / (1 + exp(-z_i))
.
-
Loss over the dataset:
L(w,b)=−n1∑i=1n(yilogpi+(1−yi)log(1−pi)).
- Update `w` and `b` with gradient descent for `num_iterations` steps.
-
No regularization is required.
-
predict_proba(X):
-
Given a feature matrix
X
, return a list/array of predicted probabilities
p_i
for class
1
for each row.
-
predict(X):
-
Given
X
, first compute probabilities with
predict_proba
, then convert to labels
0
or
1
using threshold
0.5
.
Constraints (typical):
-
n
(number of samples) up to about
10^4
.
-
d
(number of features) up to about
10^2
.
-
Number of iterations up to around
10^4
.
You may implement in any language, but do not use built-in ML training utilities.
Task 3: Implement a multinomial Naive Bayes text classifier
Implement a multinomial Naive Bayes classifier for text classification.
Assume:
-
Training data:
docs_train
, a list of documents, where each document is represented as a list of token strings, e.g.
[["this", "movie", "is", "great"], ["awful", "film"], ...]
.
-
Training labels:
y_train
, a list of integers where
y_train[i]
is the class index for
docs_train[i]
(values in
{0, 1, ..., C-1}
).
-
Test data:
docs_test
, same format as
docs_train
but without labels.
Design a class NaiveBayes that supports:
-
fit(docs_train, y_train):
-
Build a vocabulary of all tokens seen in
docs_train
.
-
Let
C
be the number of classes.
-
Estimate class prior probabilities:
P(y=c)=total number of docscount of training docs with label c.
-
For each class
c
and token
w
in the vocabulary, estimate conditional probabilities with
Laplace smoothing
(add-one smoothing):
P(w∣y=c)=total token count in class c+Vtotal count of token w in class c+1,
where `V` is the vocabulary size.
-
Store log probabilities (
log P(y = c)
and
log P(w | y = c)
) to avoid underflow.
-
predict(docs_test):
-
For each document, treat it as a bag of words.
-
For each class
c
, compute the log posterior up to a constant:
logP(y=c∣doc)∝logP(y=c)+∑w∈doccount(w in doc)⋅logP(w∣y=c).
-
Predict the class
c
with the highest log score.
-
Ignore tokens not in the training vocabulary (or handle them with an appropriate strategy such as skipping them).
Constraints (typical):
-
Number of documents up to a few
10^4
.
-
Vocabulary size up to a few
10^4
.
-
Must be efficient in both time and memory.
Focus on correctly implementing the core training (fit) and prediction (predict) logic; you do not need to handle file I/O or text preprocessing (assume tokenization is already done).