In June 2025, I successfully defended my master thesis, which provides a mathematical description on Training Large Language Models. While most resources online omit gradient derivation, my thesis aims to provide a complete description. Alongside, a Python implementation is provided making use of AutoCompyute.

Abstract

Following the success of ChatGPT and the exposure of artificial intelligence (AI) capabilities to the general public, this work intends to provide a formal description of the concepts behind said applications. It presents a mathematical description of the Transformer architecture introduced by Vaswani et al. and the training process of large language models (LLMs). It furthermore elaborates on why it was successful in addressing the challenges of recurrent neural networks (RNNs). Throughout this work, each component of a typical decoder-Transformer is detailed with mathematical formalisms, including gradient derivations. To support the theoretical framework, a library was developed and validated against PyTorch. It provides a transparent and readable implementation, while showing sufficient performance to train models with millions of parameters. Using this library, an LLM was successfully pre-trained on the Tiny Shakespeare dataset to demonstrate the learning capabilities of a Transformer model. This work serves as an introductory guide for mathematicians and computer scientists. All code is available on GitHub.

Full Thesis

Note: If the PDF does not display, you can download the thesis here.