Background and Motivation

I stumbled across Andrej Karpathy’s video Let’s build GPT: from scratch, in code, spelled out. and while I already know the basics of Transformers and NLP I thought it might be fun and give me a deeper understanding of the architecture. This was particularly interesting given my background in AI explainability and my relative lack of exposure to this in relation to Transformers or LLMs and their increasing popularity and impact in the world.


The core goal for me was to do a deep dive into the Transformer architecture, the structures, data flows, bottlenecks etc. to build a strong intuition for future experiements and work using the architecture.

The goal for the project was to build up each component of a Language Modelling Transformer from the very basics up to a full Transformer that could scale up to LLM level theoretically.

Starting out - the basics

Andrej does a great job of going over the basic concepts of language modelling talking about bigram models and the concept of predicting tokens based on some number of previous tokens.

He doesn’t really tackle tokenisation and so we stick to characters as tokens (which keeps it simple).