Skip to main contentSkip to navigation
L
Projects
ML / AIProject

Large Language Models from Scratch

Personal Project · 2024

PythonPyTorchFull Transformer Implementation

Live demo — text generation

LLM text generation demo
🔨
Built from scratchNo OpenAI API, no Hugging Face trainer. Every layer — tokenizer, embedding, attention, MLP — is implemented and explained.
📖
Train on any textFeed it a book, a webpage, or a corpus. The model learns the style and content of whatever you provide.
🔧
Fine-tune GPT-2Load pre-trained GPT-2 weights and nudge them with your own text. The model inherits GPT-2's language understanding and adapts it to your domain.
28MParams (from scratch)Trains in 5 min · $0.20
1.5BParams (fine-tuned)GPT-2 XL · 48 layers
6Attention headsFrom-scratch config
$0.50Fine-tune costOn 24-vCPU · 80 GB GPU
Architecture

The transformer, layer by layer

Each component of the transformer is implemented from first principles. The architecture follows the original "Attention is All You Need" paper, extended with modern improvements from GPT-2.

01Tokenization & Encoding

GPT-2 byte-pair encoding converts raw text into integer token IDs. Vocabulary size: 50,257 tokens.

02Token + Positional Embedding

Each token ID is mapped to a dense vector. A separate positional embedding encodes sequence order.

03Multi-Head Self-Attention

The key mechanism: each token attends to all others. Multiple heads learn different relational patterns in parallel.

04Layer Norm + Residual

Pre-norm architecture stabilizes training. Residual connections ensure gradients flow through deep networks.

05MLP (Feed-Forward)

Two linear layers with GELU activation. Applies per-token transformation to mix information across the embedding dimension.

06Language Model Head

Final linear projection maps embeddings to vocabulary logits. Softmax yields next-token probabilities.

Results

Fine-tuning GPT-2 on The Little Prince

We load GPT-2 XL (1.5B parameters) and fine-tune it on The Little Prince by Antoine de Saint-Exupéry. The shift in style, vocabulary, and thematic content is immediate — even with just 5 minutes of training.

input"The essential is invisible to"
Before fine-tuning

The essential is invisible to the common man, but that which is essential to us so is called in the proper name. And, therefore, the power or work of generation is an essential to the nature of an orderly creature, for it serves for the preservation of the natural order of the generation. Secondly, God by the very power which is ordained to man, produces a true and sensible image of himself, as has been observed.

After fine-tuning

The essential is invisible to the eye; the indispensable is invisible to the mind. It is what makes one day different from other days, one hour from other hours. It is what makes one day a good day, and one hour a bad day. So then I chose a flower that was unique in the world. I chose the rose that belonged to me. And it is she that I shall put under the glass globe.

Before: GPT-2 XL with no fine-tuning, producing generic philosophical prose. After: same weights shifted toward the book's voice, vocabulary, and imagery.

Generated Text

What the models produce

Examples from the fine-tuned GPT-2 XL model. The model was not given any summary instructions — it generates text by continuation. Results are cherry-picked to show the ceiling of what the approach can achieve.

prompt"And now here is my secret"
Fine-tuned · GPT-2 XL

And now here is my secret, a very simple secret: It is only with the heart that one can see rightly; what is essential is invisible to the eye. "What is essential is invisible to the eye," the little prince repeated, so that he would be sure to remember. "It is the time you have wasted for your rose that makes your rose so important."

prompt"Love is the answer"
Fine-tuned · GPT-2 XL

Love is the answer that sometimes requires a little meditation. I want you to understand, very clearly, why it is that during the fifty-four years that I have had you as my friend, I have had no desire to harm you. In fact, I have constantly tried to help you. I have tried to make you happy when you were angry, and I have tried to make you happier still when you were happy.

prompt"\n (no prompt)"
Fine-tuned · GPT-2 XL

This time, the little prince came back to his idea. "I myself own a flower. Do you think that her colour is an accident of birth?" "Of course it is." The businessman raised his head. "Flowers have been growing thorns for a long time. And if the thorns are not stopped, they bring disaster on the plants."

Disclaimer: Examples above are cherry-picked to show the best achievable output. GPT-2 XL cannot perform abstract reasoning tasks (e.g., "summarize The Little Prince"). Larger models like GPT-3+ are needed for that. Do not draw statistical conclusions from individual generations.

Pre-trained Models

GPT-2 variants supported

All GPT-2 variants are available via init_from='online'. Weights are downloaded automatically on first use.

model_pathLayersHeadsEmbed dimsParametersSize
gpt21212768124M500 MB
gpt2-medium24161024350M1.4 GB
gpt2-large36201280774M3 GB
gpt2-xlused above482516001,558M6 GB
Usage

Start generating text in minutes

Train from scratch on any text
from llm.train import Trainer

trainer = Trainer(
    model_path='results/my_model',
    training_data_path='https://...book.txt',
    n_layer=6,
    n_head=6,
    n_embd=384,
)
trainer.run()
Stops automatically when evaluation loss plateaus
Generate text from a trained model
from llm.sample import Sampler

sampler = Sampler(model_path='results/my_model')
text = sampler.generate_text(
    prompt='Once upon a time',
    max_tokens=200,
)
print(text)
Fine-tune GPT-2 on your own corpus
from llm.train import Trainer

trainer = Trainer(
    model_path='results/finetuned',
    training_data_path='my_text.txt',
    init_from='gpt2-xl',  # load pretrained
)
trainer.run()
First run downloads GPT-2 weights automatically
Use a pre-trained GPT-2 model directly
from llm.sample import Sampler

sampler = Sampler(
    init_from='online',
    model_path='gpt2-xl',
)
print(sampler.generate_text(
    prompt='Today I decided to',
))
Downloads ~6 GB on first use · runs on CPU