NolanGPT — gjain.ai

Tech spec

What you're interacting with

Architecture

GPT

Decoder-only transformer — same architecture as ChatGPT

Training data

1.4M

Characters across 8 Nolan screenplays

Training steps

70K

~57 minutes on Mac Mini M4

Final loss

1.27

Down from 4.71 at random initialisation

Parameters

~3.7M

GPT-3 has 175 billion. Same idea, 47,000x smaller.

Attention heads

Multi-head self-attention per block

Transformer blocks

Each block = attention + feedforward + layer norm

Embedding dim

128

Each token represented as a 128-dimensional vector

Token + Position Embeddings

Every character mapped to a 128-dim vector. Position embeddings added so the model knows where each character appears in the sequence.

Multi-Head Self-Attention

8 attention heads running in parallel. Each head learns different relationships — character names, dialogue patterns, scene structure. Based on "Attention Is All You Need" (Vaswani et al., 2017).

Feedforward + Residual

Each token independently processes what it learned from attention. Residual connections allow gradients to flow cleanly through 4 stacked blocks during training.

Trained on

Batman Begins · The Dark Knight · Inception · Interstellar · Dunkirk · Tenet · The Prestige · Oppenheimer. Character-level tokenisation — 97 unique tokens.