MLE Knowledge Compilation V2 Part 2: Loss Functions, Classification Models, Clustering, and Deep Learning Table of Contents 1. Loss Functions 2. Logistic Regression 3. Support Vector Machines 4. Decision Trees 5. Clustering Algorithms 6. Deep Learning Fundamentals 7. Neural Network Architectures 8. Training Deep Networks --- Loss Functions 1. Is Logistic Regression with MSE Loss Convex? No, it is not convex. - MSE loss with logistic regression creates a non-convex optimization landscape - Can lead to multiple local minima - This is why we use cross-entropy loss for logistic regression instead 2. Mean Squared Error (MSE) Formula: ` MSE = (1/N) × Σ(Yi - Ŷi)² ` When to use MSE: - Regression problems with normally distributed errors - When you want to emphasize larger errors (due to squaring) - When outliers should be heavily penalized - Default choice for linear regression Properties: - Always non-negative - Differentiable everywhere - Sensitive to outliers - Units are squared (e.g., dollars² if predicting prices) 3. Relationship Between Least Squares and MSE - MSE is the objective function minimized by the Least Squares Method - Least Squares finds coefficients that minimize the sum of squared residuals - This is equivalent to minimizing MSE - For linear regression: β* = argmin_β MSE(β) 4. KL Divergence (Relative Entropy) Definition: Measures the difference between two probability distributions Formula: ` D_KL(P||Q) = Σ P(x) × log(P(x)/Q(x)) ` Interpretation: - Expected value of log-ratio between P and Q, with respect to P - Always non-negative (D_KL ≥ 0) - D_KL = 0 if and only if P = Q - Not symmetric: D_KL(P||Q) ≠ D_KL(Q||P) Relationship to Cross-Entropy: ` H(P,Q) = -Σ P(x) × log Q(x) = H(P) + D_KL(P||Q) ` 5. Logistic Regression Loss Function Binary Cross-Entropy Loss: ` L = -[y × log(p) + (1-y) × log(1-p)] ` Where: - y is the true label (0 or 1) - p is the predicted probability For all samples: ` L = -(1/N) × Σ[yi × log(pi) + (1-yi) × log(1-pi)] ` 6. Logistic Regression Loss Derivation (MLE) Maximum Likelihood Estimation approach: 1. Likelihood for single sample: - P(y=1|x) = σ(wᵀx) = p - P(y=0|x) = 1 - σ(wᵀx) = 1 - p - Combined: P(y|x) = p^y × (1-p)^(1-y) 2. Log-likelihood for all samples: ` LL = Σ[yi × log(pi) + (1-yi) × log(1-pi)] ` 3. Negative log-likelihood (our loss): ` L = -LL = -Σ[yi × log(pi) + (1-yi) × log(1-pi)] ` 7. SVM Loss Function Hinge Loss: ` L = max(0, 1 - y × f(x)) ` Where: - y ∈ {-1, +1} (class labels) - f(x) = wᵀx + b (decision function) Properties: - Zero loss for correctly classified points with margin ≥ 1 - Linear penalty for margin violations - Non-differentiable at the hinge point Soft-Margin SVM Objective: ` min (1/2)||w||² + C × Σ max(0, 1 - yi × f(xi)) ` 8. Why Cross-Entropy for Multiclass Classification? Reasons: 1. Natural extension of binary cross-entropy 2. Probabilistic interpretation via softmax 3. Maximum likelihood formulation 4. Well-behaved gradients for optimization 5. Handles multiple classes elegantly Formula: ` L = -Σ Σ yij × log(pij) ` Where yij is 1 if sample i belongs to class j, else 0 9. Decision Tree Split Objectives For Classification: - Gini Impurity: Σ pi × (1 - pi) - Entropy: -Σ pi × log(pi) - Information Gain: Entropy(parent) - Weighted_Avg(Entropy(children)) For Regression: - MSE: Minimize variance within nodes - MAE: Minimize absolute deviations 10. Log Loss (Cross-Entropy Loss) Definition: Measures performance by quantifying discrepancy between predicted probabilities and true labels Binary Classification: ` LogLoss = -[y × log(p) + (1-y) × log(1-p)] ` When to use: - Classification problems - When you need probabilistic outputs - When the model should be well-calibrated - Default for logistic regression and neural networks --- Logistic Regression 1. Logistic Regression vs SVM Key Differences: | Aspect | Logistic Regression | SVM | |--------|-------------------|-----| | Objective | Model probability P(y\|x) | Find maximum margin hyperplane | | Output | Probabilities [0,1] | Decision scores | | Loss Function | Log loss | Hinge loss | | Decision Boundary | Linear (can be extended) | Linear/Non-linear (kernels) | | Optimization | Convex, smooth | Convex, non-smooth | | Outlier Sensitivity | More sensitive | More robust (margin) | | Interpretability | Probabilistic interpretation | Geometric interpretation | | When to use | Need probabilities, well-calibrated | Need robust classifier, kernels | Optimization Methods: - Logistic Regression: Gradient descent, Newton's method, L-BFGS - SVM: Quadratic programming, SMO (Sequential Minimal Optimization) --- Support Vector Machines Key Concepts 1. Maximum Margin Classifier - Finds hyperplane with largest margin between classes - Support vectors: Points closest to decision boundary - Only support vectors affect the decision boundary 2. Kernel Trick - Maps data to higher dimensions implicitly - Common kernels: Linear, RBF, Polynomial, Sigmoid - Allows non-linear decisio