Implement a generic mini-batch gradient descent routine: inputs are differentiable loss L(θ; x), initial θ0, batch size b, steps T, and learning-rate schedule ηt. (a) Provide stopping criteria (gradient norm, validation loss patience). (b) Compare full-batch, SGD, and mini-batch in terms of convergence noise and wall-clock performance. (c) Explain effects of batch size on generalization and how to use learning-rate warmup or cosine decay.