Large Language Models from Scratch
Personal Project · 2024
A full transformer implementation built without any external AI APIs — encoding, embedding, multi-head attention, and MLP layers — that can train on any text corpus or fine-tune GPT-2 on custom data for text generation. Driven by the Attention is All You Need paper.
Live demo — text generation

The transformer, layer by layer
Each component of the transformer is implemented from first principles. The architecture follows the original "Attention is All You Need" paper, extended with modern improvements from GPT-2.
GPT-2 byte-pair encoding converts raw text into integer token IDs. Vocabulary size: 50,257 tokens.
Each token ID is mapped to a dense vector. A separate positional embedding encodes sequence order.
The key mechanism: each token attends to all others. Multiple heads learn different relational patterns in parallel.
Pre-norm architecture stabilizes training. Residual connections ensure gradients flow through deep networks.
Two linear layers with GELU activation. Applies per-token transformation to mix information across the embedding dimension.
Final linear projection maps embeddings to vocabulary logits. Softmax yields next-token probabilities.
Fine-tuning GPT-2 on The Little Prince
We load GPT-2 XL (1.5B parameters) and fine-tune it on The Little Prince by Antoine de Saint-Exupéry. The shift in style, vocabulary, and thematic content is immediate — even with just 5 minutes of training.
The essential is invisible to the common man, but that which is essential to us so is called in the proper name. And, therefore, the power or work of generation is an essential to the nature of an orderly creature, for it serves for the preservation of the natural order of the generation. Secondly, God by the very power which is ordained to man, produces a true and sensible image of himself, as has been observed.
The essential is invisible to the eye; the indispensable is invisible to the mind. It is what makes one day different from other days, one hour from other hours. It is what makes one day a good day, and one hour a bad day. So then I chose a flower that was unique in the world. I chose the rose that belonged to me. And it is she that I shall put under the glass globe.
Before: GPT-2 XL with no fine-tuning, producing generic philosophical prose. After: same weights shifted toward the book's voice, vocabulary, and imagery.
What the models produce
Examples from the fine-tuned GPT-2 XL model. The model was not given any summary instructions — it generates text by continuation. Results are cherry-picked to show the ceiling of what the approach can achieve.
And now here is my secret, a very simple secret: It is only with the heart that one can see rightly; what is essential is invisible to the eye. "What is essential is invisible to the eye," the little prince repeated, so that he would be sure to remember. "It is the time you have wasted for your rose that makes your rose so important."
Love is the answer that sometimes requires a little meditation. I want you to understand, very clearly, why it is that during the fifty-four years that I have had you as my friend, I have had no desire to harm you. In fact, I have constantly tried to help you. I have tried to make you happy when you were angry, and I have tried to make you happier still when you were happy.
This time, the little prince came back to his idea. "I myself own a flower. Do you think that her colour is an accident of birth?" "Of course it is." The businessman raised his head. "Flowers have been growing thorns for a long time. And if the thorns are not stopped, they bring disaster on the plants."
Disclaimer: Examples above are cherry-picked to show the best achievable output. GPT-2 XL cannot perform abstract reasoning tasks (e.g., "summarize The Little Prince"). Larger models like GPT-3+ are needed for that. Do not draw statistical conclusions from individual generations.
GPT-2 variants supported
All GPT-2 variants are available via init_from='online'. Weights are downloaded automatically on first use.
| model_path | Layers | Heads | Embed dims | Parameters | Size |
|---|---|---|---|---|---|
| gpt2 | 12 | 12 | 768 | 124M | 500 MB |
| gpt2-medium | 24 | 16 | 1024 | 350M | 1.4 GB |
| gpt2-large | 36 | 20 | 1280 | 774M | 3 GB |
| gpt2-xlused above | 48 | 25 | 1600 | 1,558M | 6 GB |
Start generating text in minutes
from llm.train import Trainer
trainer = Trainer(
model_path='results/my_model',
training_data_path='https://...book.txt',
n_layer=6,
n_head=6,
n_embd=384,
)
trainer.run()from llm.sample import Sampler
sampler = Sampler(model_path='results/my_model')
text = sampler.generate_text(
prompt='Once upon a time',
max_tokens=200,
)
print(text)from llm.train import Trainer
trainer = Trainer(
model_path='results/finetuned',
training_data_path='my_text.txt',
init_from='gpt2-xl', # load pretrained
)
trainer.run()from llm.sample import Sampler
sampler = Sampler(
init_from='online',
model_path='gpt2-xl',
)
print(sampler.generate_text(
prompt='Today I decided to',
))