[ LLM TRAINING MASTERCLASS ]

Build State-of-the-Art Language Models | From Theory to Production

[ SYSTEM OVERVIEW ]

this is your complete guide to building state-of-the-art large language models. from foundational transformer architecture to deployment at scale. everything you need to go from research papers to production models.

what you'll master

  • transformer architecture internals (attention, ffn, normalization)
  • dataset curation and preprocessing at scale
  • pre-training, instruction tuning, RLHF, and DPO
  • scaling laws and compute-optimal training (chinchilla)
  • distributed training (FSDP, DeepSpeed ZeRO)
  • mixed precision, quantization, LoRA
  • evaluation frameworks and benchmarks
  • production deployment strategies

key resources

RECOMMENDED PATH: follow modules in order. each builds on previous concepts. start small (1-7B params), reproduce papers, then scale up. hands-on practice > theory.

[ TRANSFORMER ARCHITECTURE ]

core components

  • self-attention: QKV projections, scaled dot-product, softmax over context
  • multi-head attention: parallel attention with different learned projections
  • feed-forward network: position-wise FFN, typically 4x hidden dim, stores factual knowledge
  • layer normalization: pre-norm vs post-norm, RMSNorm for efficiency
  • residual connections: enable deep networks, gradient flow
# simplified attention implementation
import torch
import torch.nn.functional as F

def attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k))
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    attention_weights = F.softmax(scores, dim=-1)
    return torch.matmul(attention_weights, V)

modern architectures

GPT

decoder-only, causal attention, learned positional embeddings

GPT-3 paper →

LLaMA

RMSNorm, SwiGLU activation, RoPE positional embeddings

LLaMA paper →

Mistral

sliding window attention, grouped-query attention (GQA)

Mistral paper →

attention optimizations

  • Flash Attention: IO-aware algorithm, 2-4x speedup, exact attention
  • Multi-Query Attention (MQA): shared K/V across heads, faster inference
  • Grouped-Query Attention (GQA): balance between MHA and MQA
  • Sliding Window: local attention for long contexts
KEY INSIGHT: attention enables contextual understanding (look at all previous tokens), while FFN enables factual recall (parameters store knowledge). this combination is what makes transformers powerful.

[ DATASET ENGINEERING ]

pre-training datasets

Dataset Size Type Link
The Pile 825 GB diverse text corpus view
RedPajama 1.2 TB LLaMA training data replica view
RefinedWeb 5 TB high-quality web data view
FineWeb 15 TB deduplicated web corpus view
The Stack 6 TB code (permissive licenses) view

instruction & chat datasets

Dataset Size Purpose
OpenAssistant 161K multi-turn conversations
UltraChat 1.5M diverse dialogues
Dolly 15K 15K human-generated instructions

data processing pipeline

  • deduplication: exact and fuzzy matching (MinHash, SimHash)
  • quality filtering: perplexity scores, length filters, language detection
  • PII removal: regex patterns, NER models, privacy protection
  • toxicity filtering: perspective API, custom classifiers
  • format standardization: consistent tokenization, special tokens
DATA QUALITY MATTERS: garbage in = garbage out. invest heavily in data curation. better data > better architecture. chinchilla showed most models are undertrained due to insufficient quality data.

tokenization

# train BPE tokenizer with HuggingFace
from tokenizers import Tokenizer, models, trainers, pre_tokenizers

tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

trainer = trainers.BpeTrainer(
    vocab_size=50000,
    special_tokens=["", "", "", "", ""]
)

tokenizer.train(files=["corpus.txt"], trainer=trainer)
tokenizer.save("tokenizer.json")

[ TRAINING METHODOLOGIES ]

pre-training from scratch

  • initialization: scaled random init, He/Xavier initialization
  • warmup: linear warmup for 2K-10K steps prevents instability
  • learning rate schedule: cosine decay with warmup
  • batch size: large batches (2M-4M tokens) for stability
  • gradient clipping: max norm 1.0 to prevent explosions

scaling laws (chinchilla)

  • compute-optimal: ~20 tokens per parameter
  • doubling compute → increase model size AND data equally
  • most models are undertrained (too few tokens)
  • 70B model needs ~1.4T tokens for optimal training
# typical pre-training config
{
    "hidden_size": 4096,
    "num_layers": 32,
    "num_heads": 32,
    "intermediate_size": 11008,
    "vocab_size": 50000,
    "max_position_embeddings": 4096,
    "learning_rate": 3e-4,
    "warmup_steps": 2000,
    "max_steps": 500000,
    "batch_size": 256,
    "gradient_accumulation_steps": 16,
    "weight_decay": 0.1,
    "bf16": true
}

instruction tuning (SFT)

  • convert base model to instruction-following
  • much lower LR than pre-training (1e-5 to 5e-5)
  • 2-3 epochs typically sufficient
  • 50K-200K high-quality examples

RLHF pipeline

  • step 1: train reward model on preference pairs
  • step 2: use PPO to optimize policy against reward
  • step 3: KL penalty to prevent drift from SFT model
  • complex, requires careful tuning, prone to reward hacking

DPO (simpler alternative)

  • train directly on preference pairs, no reward model needed
  • more stable than PPO, easier to implement
  • better for smaller teams and limited compute
# DPO training with TRL
from trl import DPOTrainer, DPOConfig

config = DPOConfig(
    learning_rate=5e-7,
    beta=0.1,  # KL penalty
    num_train_epochs=3,
    per_device_train_batch_size=4,
)

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=config,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)

trainer.train()
TRAINING PHASES: pre-training (learn language) → instruction tuning (learn to follow instructions) → alignment (learn human preferences). each phase serves a distinct purpose.

[ OPTIMIZATION & EFFICIENCY ]

mixed precision training

  • BF16: best for training, stable, same range as FP32
  • FP16: needs loss scaling, but widely supported
  • 2x memory reduction, 2-3x speedup on modern GPUs
  • use torch.amp for automatic mixed precision

gradient checkpointing

  • trade compute for memory (recompute activations in backward)
  • ~30-40% slower but uses ~50% less memory
  • essential for training large models on limited VRAM
  • enables longer sequences and larger batch sizes

distributed training

  • DDP: data parallel, replicate model across GPUs
  • FSDP: fully sharded, split params/grads/optimizer across GPUs
  • DeepSpeed ZeRO: stage 1 (optimizer), stage 2 (gradients), stage 3 (parameters)
  • pipeline parallelism: split layers across GPUs
  • tensor parallelism: split individual layers across GPUs
# FSDP training
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP

model = FSDP(
    model,
    mixed_precision=MixedPrecision(
        param_dtype=torch.bfloat16,
        reduce_dtype=torch.bfloat16,
    ),
    sharding_strategy="FULL_SHARD",
    device_id=torch.cuda.current_device(),
)

# train normally - FSDP handles sharding automatically

LoRA (parameter-efficient fine-tuning)

  • learn low-rank updates to weight matrices
  • train only 0.1-1% of parameters
  • much less memory, faster training
  • merge adapters back into model after training
# LoRA with PEFT
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,  # rank
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
# only ~0.5% params trainable, huge memory savings

quantization

  • 8-bit: bitsandbytes, minimal quality loss
  • 4-bit: QLoRA, GPTQ for extreme compression
  • enables running 70B models on consumer GPUs
  • inference speedup + memory reduction
OOM ERRORS? try in order: (1) reduce batch size, (2) enable gradient checkpointing, (3) use FSDP/ZeRO-3, (4) use LoRA, (5) reduce model size or sequence length.

[ EVALUATION & BENCHMARKS ]

key benchmarks

Benchmark Measures Link
MMLU multitask accuracy (57 subjects) view
HellaSwag commonsense reasoning view
TruthfulQA truthfulness, avoiding misconceptions view
GSM8K grade school math reasoning view
HumanEval code generation (python) view

leaderboards

running evaluations

# lm-eval-harness
pip install lm-eval

lm_eval --model hf \
    --model_args pretrained=your-model \
    --tasks mmlu,hellaswag,gsm8k,humaneval \
    --device cuda \
    --batch_size 8 \
    --output_path results/

metrics to track

  • perplexity: exp(cross-entropy loss), lower = better
  • accuracy: exact match on multiple choice / classification
  • pass@k: code correctness (k samples)
  • human eval: side-by-side comparisons, ELO ratings
EVALUATION BEST PRACTICES: use multiple benchmarks, watch for contamination, prioritize human eval for alignment tasks, track both capability and safety metrics.

[ ESSENTIAL RESEARCH PAPERS ]

foundational papers

Attention Is All You Need (2017)

Vaswani et al., Google

introduced transformer architecture. self-attention, multi-head attention, positional encodings.

read paper →
GPT-3: Language Models are Few-Shot Learners (2020)

Brown et al., OpenAI

demonstrated emergent in-context learning at 175B parameters. scaling laws.

read paper →
Training Compute-Optimal Large Language Models (2022)

Hoffmann et al., DeepMind

chinchilla scaling laws. most LLMs undertrained. ~20 tokens per parameter optimal.

read paper →

architecture & optimization

Flash Attention (2022)

Dao et al., Stanford

IO-aware attention algorithm. 2-4x speedup with no approximation.

read paper →
LLaMA: Open and Efficient Foundation Language Models (2023)

Touvron et al., Meta

RMSNorm, SwiGLU, RoPE. strong performance at 7B-65B scale.

read paper →
LoRA: Low-Rank Adaptation (2021)

Hu et al., Microsoft

parameter-efficient fine-tuning. train 0.1% of params with minimal quality loss.

read paper →

training & alignment

InstructGPT (2022)

Ouyang et al., OpenAI

RLHF methodology. reward modeling + PPO for alignment.

read paper →
Direct Preference Optimization (2023)

Rafailov et al., Stanford

train on preferences directly without reward model. simpler than RLHF.

read paper →
Constitutional AI (2022)

Bai et al., Anthropic

self-critique and principle-based feedback. scalable oversight.

read paper →

more essential reads

  • BERT - masked language modeling, bidirectional pre-training
  • RoFormer (RoPE) - rotary position embeddings
  • ZeRO - memory optimizations for trillion-param models
  • QLoRA - 4-bit quantization + LoRA
  • Chain-of-Thought - step-by-step reasoning improves capabilities

[ IMPLEMENTATION GUIDE ]

training stack

  • PyTorch: primary deep learning framework
  • HuggingFace Transformers: model implementations
  • HuggingFace Accelerate: distributed training abstraction
  • DeepSpeed: optimization and scaling
  • TRL: RLHF and alignment tools
  • Weights & Biases: experiment tracking

complete training script

# train.py
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from datasets import load_dataset

# load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "base-model",
    torch_dtype=torch.bfloat16,
    use_cache=False,
)
model.gradient_checkpointing_enable()

tokenizer = AutoTokenizer.from_pretrained("base-model")

# load and tokenize dataset
dataset = load_dataset("your-dataset")

def tokenize(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=2048,
    )

tokenized = dataset.map(tokenize, batched=True)

# training arguments
args = TrainingArguments(
    output_dir="./model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=16,
    learning_rate=3e-4,
    warmup_steps=2000,
    lr_scheduler_type="cosine",
    bf16=True,
    gradient_checkpointing=True,
    logging_steps=100,
    save_steps=1000,
)

# train
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

trainer.train()
trainer.save_model()

instruction tuning

# instruction tuning with TRL
from trl import SFTTrainer

# format dataset
def format_instruction(example):
    return {
        "text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
    }

dataset = dataset.map(format_instruction)

# train
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset["train"],
    dataset_text_field="text",
    max_seq_length=2048,
)

trainer.train()

multi-GPU training

# launch with accelerate
accelerate config  # run once to configure

accelerate launch train.py

# or use torchrun
torchrun --nproc_per_node=8 train.py

# or DeepSpeed
deepspeed train.py --deepspeed ds_config.json
DEBUGGING CHECKLIST: loss is NaN → reduce LR | OOM → reduce batch size, enable checkpointing | slow training → check GPU util, increase batch size | loss not decreasing → verify data quality and labels

[ DEPLOYMENT & PRODUCTION ]

inference optimization

  • quantization: 8-bit, 4-bit (GPTQ, bitsandbytes)
  • model pruning: remove redundant weights
  • distillation: train smaller model to mimic larger
  • speculative decoding: draft + verify for faster generation

serving frameworks

vLLM

PagedAttention, highest throughput, production-ready

github →

Text Generation Inference

HuggingFace's official server, easy to use

github →

Ollama

run LLMs locally with ease, great for development

website →

vLLM deployment

# start vLLM server
python -m vllm.entrypoints.openai.api_server \
    --model your-model \
    --tensor-parallel-size 2 \
    --dtype bfloat16 \
    --max-model-len 4096

# use OpenAI-compatible client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

response = client.chat.completions.create(
    model="your-model",
    messages=[{"role": "user", "content": "hello!"}]
)

monitoring

  • latency (p50, p95, p99 percentiles)
  • throughput (requests/sec, tokens/sec)
  • error rate and failure modes
  • GPU utilization and memory
  • cost per request

safety & content filtering

  • input validation and sanitization
  • output content filtering (toxicity, PII)
  • rate limiting and abuse prevention
  • logging for audit and improvement
PRODUCTION CHECKLIST: implement error handling, monitoring, rate limiting, cost tracking, and safety measures BEFORE serving to users. test with production-like loads.

[ NEXT STEPS ]

start small, scale up

  • begin with 1-7B parameter models
  • reproduce a paper's results to validate setup
  • fine-tune existing models on your domain
  • contribute to open source projects
  • share findings with the community

community resources

follow research

  • ArXiv: cs.CL, cs.LG, cs.AI
  • Labs: Anthropic, OpenAI, Google DeepMind, Meta AI, Mistral AI
  • Open Source: EleutherAI, HuggingFace, Together AI
FINAL ADVICE: the field moves fast. stay curious, experiment often, don't be afraid to fail. consistent practice > reading papers. build, deploy, iterate. the best way to learn is by doing.