LLM Training Masterclass | Build SOTA Models

[ SYSTEM OVERVIEW ]

this is your complete guide to building state-of-the-art large language models. from foundational transformer architecture to deployment at scale. everything you need to go from research papers to production models.

what you'll master

transformer architecture internals (attention, ffn, normalization)
dataset curation and preprocessing at scale
pre-training, instruction tuning, RLHF, and DPO
scaling laws and compute-optimal training (chinchilla)
distributed training (FSDP, DeepSpeed ZeRO)
mixed precision, quantization, LoRA
evaluation frameworks and benchmarks
production deployment strategies

key resources

RECOMMENDED PATH: follow modules in order. each builds on previous concepts. start small (1-7B params), reproduce papers, then scale up. hands-on practice > theory.

[ TRANSFORMER ARCHITECTURE ]

core components

self-attention: QKV projections, scaled dot-product, softmax over context
multi-head attention: parallel attention with different learned projections
feed-forward network: position-wise FFN, typically 4x hidden dim, stores factual knowledge
layer normalization: pre-norm vs post-norm, RMSNorm for efficiency
residual connections: enable deep networks, gradient flow

# simplified attention implementation
import torch
import torch.nn.functional as F

def attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k))
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    attention_weights = F.softmax(scores, dim=-1)
    return torch.matmul(attention_weights, V)

modern architectures

GPT

decoder-only, causal attention, learned positional embeddings

GPT-3 paper →

LLaMA

RMSNorm, SwiGLU activation, RoPE positional embeddings

LLaMA paper →

Mistral

sliding window attention, grouped-query attention (GQA)

Mistral paper →

attention optimizations

Flash Attention: IO-aware algorithm, 2-4x speedup, exact attention
Multi-Query Attention (MQA): shared K/V across heads, faster inference
Grouped-Query Attention (GQA): balance between MHA and MQA
Sliding Window: local attention for long contexts

KEY INSIGHT: attention enables contextual understanding (look at all previous tokens), while FFN enables factual recall (parameters store knowledge). this combination is what makes transformers powerful.

[ DATASET ENGINEERING ]

pre-training datasets

Dataset	Size	Type	Link
The Pile	825 GB	diverse text corpus	view
RedPajama	1.2 TB	LLaMA training data replica	view
RefinedWeb	5 TB	high-quality web data	view
FineWeb	15 TB	deduplicated web corpus	view
The Stack	6 TB	code (permissive licenses)	view

instruction & chat datasets

Dataset	Size	Purpose
OpenAssistant	161K	multi-turn conversations
UltraChat	1.5M	diverse dialogues
Dolly 15K	15K	human-generated instructions

data processing pipeline

deduplication: exact and fuzzy matching (MinHash, SimHash)
quality filtering: perplexity scores, length filters, language detection
PII removal: regex patterns, NER models, privacy protection
toxicity filtering: perspective API, custom classifiers
format standardization: consistent tokenization, special tokens

DATA QUALITY MATTERS: garbage in = garbage out. invest heavily in data curation. better data > better architecture. chinchilla showed most models are undertrained due to insufficient quality data.

tokenization

# train BPE tokenizer with HuggingFace
from tokenizers import Tokenizer, models, trainers, pre_tokenizers

tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

trainer = trainers.BpeTrainer(
    vocab_size=50000,
    special_tokens=["", "", "", "", ""]
)

tokenizer.train(files=["corpus.txt"], trainer=trainer)
tokenizer.save("tokenizer.json")

[ TRAINING METHODOLOGIES ]

pre-training from scratch

initialization: scaled random init, He/Xavier initialization
warmup: linear warmup for 2K-10K steps prevents instability
learning rate schedule: cosine decay with warmup
batch size: large batches (2M-4M tokens) for stability
gradient clipping: max norm 1.0 to prevent explosions

scaling laws (chinchilla)

compute-optimal: ~20 tokens per parameter
doubling compute → increase model size AND data equally
most models are undertrained (too few tokens)
70B model needs ~1.4T tokens for optimal training

# typical pre-training config
{
    "hidden_size": 4096,
    "num_layers": 32,
    "num_heads": 32,
    "intermediate_size": 11008,
    "vocab_size": 50000,
    "max_position_embeddings": 4096,
    "learning_rate": 3e-4,
    "warmup_steps": 2000,
    "max_steps": 500000,
    "batch_size": 256,
    "gradient_accumulation_steps": 16,
    "weight_decay": 0.1,
    "bf16": true
}

instruction tuning (SFT)

convert base model to instruction-following
much lower LR than pre-training (1e-5 to 5e-5)
2-3 epochs typically sufficient
50K-200K high-quality examples

RLHF pipeline

step 1: train reward model on preference pairs
step 2: use PPO to optimize policy against reward
step 3: KL penalty to prevent drift from SFT model
complex, requires careful tuning, prone to reward hacking

DPO (simpler alternative)

train directly on preference pairs, no reward model needed
more stable than PPO, easier to implement
better for smaller teams and limited compute

# DPO training with TRL
from trl import DPOTrainer, DPOConfig

config = DPOConfig(
    learning_rate=5e-7,
    beta=0.1,  # KL penalty
    num_train_epochs=3,
    per_device_train_batch_size=4,
)

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=config,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)

trainer.train()

TRAINING PHASES: pre-training (learn language) → instruction tuning (learn to follow instructions) → alignment (learn human preferences). each phase serves a distinct purpose.

[ OPTIMIZATION & EFFICIENCY ]

mixed precision training

BF16: best for training, stable, same range as FP32
FP16: needs loss scaling, but widely supported
2x memory reduction, 2-3x speedup on modern GPUs
use torch.amp for automatic mixed precision

gradient checkpointing

trade compute for memory (recompute activations in backward)
~30-40% slower but uses ~50% less memory
essential for training large models on limited VRAM
enables longer sequences and larger batch sizes

distributed training

DDP: data parallel, replicate model across GPUs
FSDP: fully sharded, split params/grads/optimizer across GPUs
DeepSpeed ZeRO: stage 1 (optimizer), stage 2 (gradients), stage 3 (parameters)
pipeline parallelism: split layers across GPUs
tensor parallelism: split individual layers across GPUs

# FSDP training
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP

model = FSDP(
    model,
    mixed_precision=MixedPrecision(
        param_dtype=torch.bfloat16,
        reduce_dtype=torch.bfloat16,
    ),
    sharding_strategy="FULL_SHARD",
    device_id=torch.cuda.current_device(),
)

# train normally - FSDP handles sharding automatically

LoRA (parameter-efficient fine-tuning)

learn low-rank updates to weight matrices
train only 0.1-1% of parameters
much less memory, faster training
merge adapters back into model after training

# LoRA with PEFT
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,  # rank
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
# only ~0.5% params trainable, huge memory savings

quantization

8-bit: bitsandbytes, minimal quality loss
4-bit: QLoRA, GPTQ for extreme compression
enables running 70B models on consumer GPUs
inference speedup + memory reduction

OOM ERRORS? try in order: (1) reduce batch size, (2) enable gradient checkpointing, (3) use FSDP/ZeRO-3, (4) use LoRA, (5) reduce model size or sequence length.

[ EVALUATION & BENCHMARKS ]

key benchmarks

Benchmark	Measures	Link
MMLU	multitask accuracy (57 subjects)	view
HellaSwag	commonsense reasoning	view
TruthfulQA	truthfulness, avoiding misconceptions	view
GSM8K	grade school math reasoning	view
HumanEval	code generation (python)	view

leaderboards

Open LLM Leaderboard - comprehensive benchmark suite
Chatbot Arena - ELO ratings from human preferences
AlpacaEval - instruction-following evaluation

running evaluations

# lm-eval-harness
pip install lm-eval

lm_eval --model hf \
    --model_args pretrained=your-model \
    --tasks mmlu,hellaswag,gsm8k,humaneval \
    --device cuda \
    --batch_size 8 \
    --output_path results/

metrics to track

perplexity: exp(cross-entropy loss), lower = better
accuracy: exact match on multiple choice / classification
pass@k: code correctness (k samples)
human eval: side-by-side comparisons, ELO ratings

EVALUATION BEST PRACTICES: use multiple benchmarks, watch for contamination, prioritize human eval for alignment tasks, track both capability and safety metrics.

[ ESSENTIAL RESEARCH PAPERS ]

foundational papers

Attention Is All You Need (2017)

Vaswani et al., Google

introduced transformer architecture. self-attention, multi-head attention, positional encodings.

read paper →

GPT-3: Language Models are Few-Shot Learners (2020)

Brown et al., OpenAI

demonstrated emergent in-context learning at 175B parameters. scaling laws.

read paper →

Training Compute-Optimal Large Language Models (2022)

Hoffmann et al., DeepMind

chinchilla scaling laws. most LLMs undertrained. ~20 tokens per parameter optimal.

read paper →

architecture & optimization

Flash Attention (2022)

Dao et al., Stanford

IO-aware attention algorithm. 2-4x speedup with no approximation.

read paper →

LLaMA: Open and Efficient Foundation Language Models (2023)

Touvron et al., Meta

RMSNorm, SwiGLU, RoPE. strong performance at 7B-65B scale.

read paper →

LoRA: Low-Rank Adaptation (2021)

Hu et al., Microsoft

parameter-efficient fine-tuning. train 0.1% of params with minimal quality loss.

read paper →

training & alignment

InstructGPT (2022)

Ouyang et al., OpenAI

RLHF methodology. reward modeling + PPO for alignment.

read paper →

Direct Preference Optimization (2023)

Rafailov et al., Stanford

train on preferences directly without reward model. simpler than RLHF.

read paper →

Constitutional AI (2022)

Bai et al., Anthropic

self-critique and principle-based feedback. scalable oversight.

read paper →

[ IMPLEMENTATION GUIDE ]

training stack

PyTorch: primary deep learning framework
HuggingFace Transformers: model implementations
HuggingFace Accelerate: distributed training abstraction
DeepSpeed: optimization and scaling
TRL: RLHF and alignment tools
Weights & Biases: experiment tracking

complete training script

# train.py
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from datasets import load_dataset

# load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "base-model",
    torch_dtype=torch.bfloat16,
    use_cache=False,
)
model.gradient_checkpointing_enable()

tokenizer = AutoTokenizer.from_pretrained("base-model")

# load and tokenize dataset
dataset = load_dataset("your-dataset")

def tokenize(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=2048,
    )

tokenized = dataset.map(tokenize, batched=True)

# training arguments
args = TrainingArguments(
    output_dir="./model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=16,
    learning_rate=3e-4,
    warmup_steps=2000,
    lr_scheduler_type="cosine",
    bf16=True,
    gradient_checkpointing=True,
    logging_steps=100,
    save_steps=1000,
)

# train
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

trainer.train()
trainer.save_model()

instruction tuning

# instruction tuning with TRL
from trl import SFTTrainer

# format dataset
def format_instruction(example):
    return {
        "text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
    }

dataset = dataset.map(format_instruction)

# train
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset["train"],
    dataset_text_field="text",
    max_seq_length=2048,
)

trainer.train()

multi-GPU training

# launch with accelerate
accelerate config  # run once to configure

accelerate launch train.py

# or use torchrun
torchrun --nproc_per_node=8 train.py

# or DeepSpeed
deepspeed train.py --deepspeed ds_config.json

DEBUGGING CHECKLIST: loss is NaN → reduce LR | OOM → reduce batch size, enable checkpointing | slow training → check GPU util, increase batch size | loss not decreasing → verify data quality and labels

[ DEPLOYMENT & PRODUCTION ]

inference optimization

quantization: 8-bit, 4-bit (GPTQ, bitsandbytes)
model pruning: remove redundant weights
distillation: train smaller model to mimic larger
speculative decoding: draft + verify for faster generation

serving frameworks

vLLM

PagedAttention, highest throughput, production-ready

github →

Text Generation Inference

HuggingFace's official server, easy to use

github →

Ollama

run LLMs locally with ease, great for development

website →

vLLM deployment

# start vLLM server
python -m vllm.entrypoints.openai.api_server \
    --model your-model \
    --tensor-parallel-size 2 \
    --dtype bfloat16 \
    --max-model-len 4096

# use OpenAI-compatible client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

response = client.chat.completions.create(
    model="your-model",
    messages=[{"role": "user", "content": "hello!"}]
)

monitoring

latency (p50, p95, p99 percentiles)
throughput (requests/sec, tokens/sec)
error rate and failure modes
GPU utilization and memory
cost per request

safety & content filtering

input validation and sanitization
output content filtering (toxicity, PII)
rate limiting and abuse prevention
logging for audit and improvement

PRODUCTION CHECKLIST: implement error handling, monitoring, rate limiting, cost tracking, and safety measures BEFORE serving to users. test with production-like loads.

[ NEXT STEPS ]

start small, scale up

begin with 1-7B parameter models
reproduce a paper's results to validate setup
fine-tune existing models on your domain
contribute to open source projects
share findings with the community

community resources

HuggingFace Forums - active Q&A community
EleutherAI Discord - open source LLM research
r/LocalLLaMA - local LLM enthusiasts

follow research

ArXiv: cs.CL, cs.LG, cs.AI
Labs: Anthropic, OpenAI, Google DeepMind, Meta AI, Mistral AI
Open Source: EleutherAI, HuggingFace, Together AI

FINAL ADVICE: the field moves fast. stay curious, experiment often, don't be afraid to fail. consistent practice > reading papers. build, deploy, iterate. the best way to learn is by doing.

[ SYSTEM OVERVIEW ]

what you'll master

key resources

[ TRANSFORMER ARCHITECTURE ]

core components

modern architectures

GPT

LLaMA

Mistral

attention optimizations

[ DATASET ENGINEERING ]

pre-training datasets

instruction & chat datasets

data processing pipeline

tokenization

[ TRAINING METHODOLOGIES ]

pre-training from scratch

scaling laws (chinchilla)

instruction tuning (SFT)

RLHF pipeline

DPO (simpler alternative)

[ OPTIMIZATION & EFFICIENCY ]

mixed precision training

gradient checkpointing

distributed training

LoRA (parameter-efficient fine-tuning)

quantization

[ EVALUATION & BENCHMARKS ]

key benchmarks

leaderboards

running evaluations

metrics to track

[ ESSENTIAL RESEARCH PAPERS ]

foundational papers

architecture & optimization

training & alignment

more essential reads

[ IMPLEMENTATION GUIDE ]

training stack

complete training script

instruction tuning

multi-GPU training

[ DEPLOYMENT & PRODUCTION ]

inference optimization

serving frameworks

vLLM

Text Generation Inference

Ollama

vLLM deployment

monitoring

safety & content filtering

[ NEXT STEPS ]

start small, scale up

community resources

follow research