Explain Transformers, activations, and training optimization | DRW