ML / AIProject

Large Language Models from Scratch

Personal Project · 2024

A full transformer implementation built without any external AI APIs — encoding, embedding, multi-head attention, and MLP layers — that can train on any text corpus or fine-tune GPT-2 on custom data for text generation. Driven by the Attention is All You Need paper.

PythonPyTorch★ Full Transformer Implementation

View Code

Live demo — text generation

🔨

Built from scratchNo OpenAI API, no Hugging Face trainer. Every layer — tokenizer, embedding, attention, MLP — is implemented and explained.

📖

Train on any textFeed it a book, a webpage, or a corpus. The model learns the style and content of whatever you provide.

🔧

Fine-tune GPT-2Load pre-trained GPT-2 weights and nudge them with your own text. The model inherits GPT-2's language understanding and adapts it to your domain.

28MParams (from scratch)Trains in 5 min · $0.20

1.5BParams (fine-tuned)GPT-2 XL · 48 layers

6Attention headsFrom-scratch config

$0.50Fine-tune costOn 24-vCPU · 80 GB GPU

Architecture

The transformer, layer by layer

Each component of the transformer is implemented from first principles. The architecture follows the original "Attention is All You Need" paper, extended with modern improvements from GPT-2.

01Tokenization & Encoding

GPT-2 byte-pair encoding converts raw text into integer token IDs. Vocabulary size: 50,257 tokens.

02Token + Positional Embedding

Each token ID is mapped to a dense vector. A separate positional embedding encodes sequence order.

03Multi-Head Self-Attention

The key mechanism: each token attends to all others. Multiple heads learn different relational patterns in parallel.

04Layer Norm + Residual

Pre-norm architecture stabilizes training. Residual connections ensure gradients flow through deep networks.

05MLP (Feed-Forward)

Two linear layers with GELU activation. Applies per-token transformation to mix information across the embedding dimension.

06Language Model Head

Final linear projection maps embeddings to vocabulary logits. Softmax yields next-token probabilities.

Results

Fine-tuning GPT-2 on The Little Prince

We load GPT-2 XL (1.5B parameters) and fine-tune it on The Little Prince by Antoine de Saint-Exupéry. The shift in style, vocabulary, and thematic content is immediate — even with just 5 minutes of training.

input"The essential is invisible to"

Before fine-tuning

The essential is invisible to the common man, but that which is essential to us so is called in the proper name. And, therefore, the power or work of generation is an essential to the nature of an orderly creature, for it serves for the preservation of the natural order of the generation. Secondly, God by the very power which is ordained to man, produces a true and sensible image of himself, as has been observed.

After fine-tuning

The essential is invisible to the eye; the indispensable is invisible to the mind. It is what makes one day different from other days, one hour from other hours. It is what makes one day a good day, and one hour a bad day. So then I chose a flower that was unique in the world. I chose the rose that belonged to me. And it is she that I shall put under the glass globe.

Before: GPT-2 XL with no fine-tuning, producing generic philosophical prose. After: same weights shifted toward the book's voice, vocabulary, and imagery.

Generated Text

What the models produce

Examples from the fine-tuned GPT-2 XL model. The model was not given any summary instructions — it generates text by continuation. Results are cherry-picked to show the ceiling of what the approach can achieve.

prompt"And now here is my secret"

Fine-tuned · GPT-2 XL

And now here is my secret, a very simple secret: It is only with the heart that one can see rightly; what is essential is invisible to the eye. "What is essential is invisible to the eye," the little prince repeated, so that he would be sure to remember. "It is the time you have wasted for your rose that makes your rose so important."

prompt"Love is the answer"

Fine-tuned · GPT-2 XL

Love is the answer that sometimes requires a little meditation. I want you to understand, very clearly, why it is that during the fifty-four years that I have had you as my friend, I have had no desire to harm you. In fact, I have constantly tried to help you. I have tried to make you happy when you were angry, and I have tried to make you happier still when you were happy.

prompt"\n (no prompt)"

Fine-tuned · GPT-2 XL

This time, the little prince came back to his idea. "I myself own a flower. Do you think that her colour is an accident of birth?" "Of course it is." The businessman raised his head. "Flowers have been growing thorns for a long time. And if the thorns are not stopped, they bring disaster on the plants."

⚠

Disclaimer: Examples above are cherry-picked to show the best achievable output. GPT-2 XL cannot perform abstract reasoning tasks (e.g., "summarize The Little Prince"). Larger models like GPT-3+ are needed for that. Do not draw statistical conclusions from individual generations.

Pre-trained Models

GPT-2 variants supported

All GPT-2 variants are available via init_from='online'. Weights are downloaded automatically on first use.

model_path	Layers	Heads	Embed dims	Parameters	Size
gpt2	12	12	768	124M	500 MB
gpt2-medium	24	16	1024	350M	1.4 GB
gpt2-large	36	20	1280	774M	3 GB
gpt2-xlused above	48	25	1600	1,558M	6 GB

Usage

Start generating text in minutes

Train from scratch on any text

from llm.train import Trainer

trainer = Trainer(
    model_path='results/my_model',
    training_data_path='https://...book.txt',
    n_layer=6,
    n_head=6,
    n_embd=384,
)
trainer.run()

Stops automatically when evaluation loss plateaus

Generate text from a trained model

from llm.sample import Sampler

sampler = Sampler(model_path='results/my_model')
text = sampler.generate_text(
    prompt='Once upon a time',
    max_tokens=200,
)
print(text)

Fine-tune GPT-2 on your own corpus

from llm.train import Trainer

trainer = Trainer(
    model_path='results/finetuned',
    training_data_path='my_text.txt',
    init_from='gpt2-xl',  # load pretrained
)
trainer.run()

First run downloads GPT-2 weights automatically

Use a pre-trained GPT-2 model directly

from llm.sample import Sampler

sampler = Sampler(
    init_from='online',
    model_path='gpt2-xl',
)
print(sampler.generate_text(
    prompt='Today I decided to',
))

Downloads ~6 GB on first use · runs on CPU

GitHub Repository From-Scratch Notebook Fine-Tuning Notebook